Ž .
Chemometrics and Intelligent Laboratory Systems 57 2001 1–14www.elsevier.com
r
locate
r
chemometrics
Conditional Fisher’s exact test as a selection criterion forpaircorrelation method. Type I and Type II errors
Robert Rajko
a,
)
, Karoly Heberger
b,
)
´ ´ ´ ´
a
Department of Unit Operations and En
Õ
ironmental Engineering, Institute of Food Industry College, Uni
Õ
ersity of Szeged, P.O. Box 433, H6701 Szeged, Hungary
b
Institute of Chemistry, Chemical Research Center, Hungarian Academy of Sciences, P.O. Box. 17, H1525 Budapest, Hungary
Received 1 February 2000; accepted 20 December 2000
Abstract
Ž .
The paircorrelation method PCM has been developed recently for discrimination between two variables. PCM can be
Ž .
used to identify the decisive fundamental, basic factor from among correlated variables even in cases when all other statistical criteria fail to indicate significant difference. These decisions are needed frequently in QSAR studies and
r
or chemicalmodel building. The conditional Fisher’s exact test, based on testing significance in the 2
=
2 contingency tables is a suitableselection criterion for PCM. The test statistic provides a probabilistic aid for accepting the hypothesis of significant differ
Ž .
ences between two factors, which are almost equally correlated with the response dependent variable . Differentiating between factors can lead to alternative models at any arbitrary significance level. The power function of the test statistic has
Ž
also been deduced theoretically. A similar derivation was undertaken for the description of the influence of Type I false
. Ž .
positive conclusion, error of the first kind and Type II falsenegative conclusion, error of the second kind errors. The appropriate decision is indicated from the low probability levels of both false conclusions.
q
2001 Elsevier Science B.V. Allrights reserved.
Ž . Ž .
Keywords:
Variable or feature selection; Paircorrelation method PCM
1. Introduction
Ž
Variable selection subset selection, feature selec
.
tion is one of the key issues in chemometrics. Theselection process is more or less solved for linear relationships. Unfortunately, the algorithm for variableselection and the selection criteria are often indistinguishable. The same algorithm can lead to the selec
)
Corresponding authors.
Ž .
Email addresses:
rajko@sol.cc.uszeged.hu R. Rajko ,
´
Ž .
heberger@chemres.hu K. Heberger .
´
tion of different variables using different criteria andvice versa. Construction of data sets for which even
Ž
the accepted criteria forward selection, backward
.
elimination, stepwise can lead to different conclu
w x
sions is relatively easy 1a,1b . Other algorithms
Ž .
based on principal component analysis PCA , par
Ž .
tial least squares PLS , genetic algorithm and artifi
Ž .
cial neural networks ANN increase the uncertaintyconcerning the selection of the best subset. The increasing usage of ANN technique forced the chemometricians to develop nonlinear variable selectionmethods. The approaches are often heuristic and lack
01697439
r
01
r
$  see front matter
q
2001 Elsevier Science B.V. All rights reserved.
Ž .
PII: S01697439 01 001010
( ) R. Rajko, K. Heberger
r
Chemometrics and Intelligent Laboratory Systems 57 2001 1–14
´ ´
2
w x
any firm theoretical basis. Centner et al. 2 emphasised that
A
...weakness of all these methods is the
Ž
estimation of a suitable number of variables cutoff
.
level . No explicit rule exists up to now. As a resultall approaches work with a userdefined number of variables or with a userdefined critical value for theconsidered selection criterion
B
.All the above mentioned methods build models forprediction. However, the prediction is not necessarilya goal to be achieved for the model building process.Models having theoretical basis, physical relevance,are superior to empirical ones. However, there are noalgorithmic ways to select important or basic factors.The connection to the physical significance has to beexamined individually again and again. The presentpaper introduces a new technique, which uses otherportion of information present in the data than usually. The technique is able to select
A
superior
B
factors if the superiority exists.Consider the following example based on the cor
Ž
relation coefficient. Two independent variables
X
1
.
and
X
can be discriminated according to the mag
2
nitude of the correlation coefficients
r
and
Y
vs.
X
1
r
. The discrimination can be formulated as an
Y
vs.
X
2
F
test to identify significant differences at a givenprobability level.The classical Pearson product moment correlationcoefficient is not the only measure for correlation.There are also nonparametric measures for correla
w x
tion, e.g., Spearman’s rho and Kendall’s tau 3 . Theyare, however, not yet used for variable selection. The
Ž .
paircorrelation method PCM provides an alternative possibility to characterise different correlationswithout using the correlation coefficient.
w x
PCM 4–6 has been developed recently for thediscrimination of variables as a nonparametricmethod, contrast with methods that require the assumption of normality. PCM can be used to choose
Ž .
the decisive fundamental, basic factor from among
Ž .
correlated collinear variables, even if all classicalstatistical criteria cannot indicate any significant difference. PCM, however, needs a test statistic as a selection criterion, i.e., a probabilistic aid for acceptingthe hypothesis that a significant difference exists between the two factors at any arbitrary significancelevel.There are two hypotheses that must be specified in
w x
any statistical testing procedure 7–9 : the null hypothesis, denoted H , and the alternative hypothesis,
0
denoted H . Acceptation or rejection of the null hy
A
pothesis is the task to be solved. However, statisticalhypothesis testing is based on sample information.Nobody can be sure that the decision is correct. WhenH is true but, by chance, the sample data infer in
0
correctly to that it is false, this is referred to as a Type
Ž
I error or the error of the first kind the probability of
.
this event is
´
. When H is false but, by bad luck,
0
the sample data lead mistakenly to that it could betrue, this is called Type II error or the error of the
Ž .
second kind the probability of this event is
b
. The
Ž .
power of a test equals 1
y
b
is a measure of howgood the test is at rejecting a false null hypothesis.PCM is used to choose between two factors
X
1
and
X
, which are approximately equally correlated
2
with the dependent variable
Y
. Hence, determination
Ž
of
b
is of crucial importance PCM can only discriminate between
X
and
X
if the null hypothesis
1 2
.
can be rejected . Low levels of both
´
and
b
indicate that the correct decision has been made.Our aim in this paper was to develop a selectioncriterion for PCM. The theoretical deduction of TypeI and II errors will justify the usage of the method.Moreover, we would like to communicate an improvement of the algorithm for PCM. The improvement is summarised in the Appendix A. Finally, wepresent some examples to validate the method and tobetter understand how it works.
2. Theoretical principles of PCM
PCM is based on nonparametric, i.e., distribu
Ž .
tionfree combinatorial analysis. The formulation of the initial task is given below.
Ž .
Let us define three vectors as dependent
Y
and
Ž .
independent variables
X
and
X
. The task is to
1 2
choose the superior one from the coequal
X
and
X
.
1 2
Both of the independent variables correlate positively with the dependent variables. The case whenone of them or both does not correlate with
Y
doesnot cause serious limitation. This will be discussed inthe validation part of the paper. Likewise, a negativecorrelation does not limit the usage of the method.Consider all the possible element pairs of the
Y
vector that can occur when the differences
D
X
for
1
( ) R. Rajko, K. Heberger
r
Chemometrics and Intelligent Laboratory Systems 57 2001 1–14
´ ´
3Table 1Distribution of events A, B, C, and D; frequencies obtained usingPCM
D
X
)
0
D
X

0
1 1
D
X
)
0 A:
k
C:
k
2 A C
D
X

0 B:
k
D:
k
2 B D
Y
vs.
X
, and
D
X
for
Y
vs.
X
are determined.
1 2 2
Only the signs of the differences are important:
D
X
s
X
y
X
sgn
Y
y
Y
Ž . Ž .
1 1
i
1
j i j
1
F
i
F
j
F
m
,
5
D
X
s
X
y
X
sgn
Y
y
Y
Ž . Ž .
2 2
i
2
j i j
1
Ž .
0, if
Y
s
Y
,
°
i j
~
< <
Y
y
Y
sgn
Y
y
Y
s
i j
Ž .
i j
, otherwise,
¢
Y
y
Y
i j
where
m
is the number of measurements. There will
m
w
Ž .
x
be
s
m m
y
1
r
2
s
n
point pairs and differ
ž /
2ences
D
X
and
D
X
. There are only four possible
1 2
signs of differences in
D
X
and
D
X
. They are
1 2
termed A, B, C and D. Table 1 summarises the four
Ž .
possibilities events . The frequencies of the events
Ž .
A, B, C and D
k
,
k
,
k
, and
k
, respectively are
A B C D
Ž .
counted and ordered see Table 1 . Fig. 1 representsthe fundamental nature of the four events as the basisof PCM. The cases are ignored if the
Y
s
Y
. How
i j
ever, this cannot cause any limitation. These cases donot hold any information on the differences in the independent variables.
Ž
Because of the initial assumption, both positive or
.
negative correlations for
Y
vs.
X
and
Y
vs.
X
,
1 2
the frequency of event A should be the largest. Thatis both
X
and
X
must change in the same direc
1 2
tion as
Y
. Event D shows how the correlation tendsto be reduced by chance. Its frequency is expected tobe the lowest then. If the frequency of event A is notthe highest, then either one of or both
X
variablescorrelate with
Y
negatively.The rearrangement of boxes is equivalent to themultiplication of
X
or
X
, or both by minus one to
1 2
obtain positive correlation between
Y
and
X
, as well
1
as
Y
and
X
. This can be seen from the formulas in
2
brackets in Appendix A, where the rearrangementprocedure is given in details.Events A and D have no direct information forchoosing between
X
and
X
. If the frequency value
1 2
k
belonging to event B is larger than
k
for event
B C
C, then
X
overrides
X
and vice versa. Further de
1 2
w x
tails of the properties of PCM are given in 6 . Theword ‘larger’ can be interpreted statistically. Thus, atest statistic is required to determine whether the frequency value associated with event B is significantly
Ž .
larger than that for event C or vice versa .The paper describes a test statistic based on testing the significance of a 2
=
2 contingency table. Thepower function of this test statistic and the influence
Fig. 1. Graphical representation of four possible events as the basis of PCM.
( ) R. Rajko, K. Heberger
r
Chemometrics and Intelligent Laboratory Systems 57 2001 1–14
´ ´
4
of Type I and Type II errors are also investigated anddescribed.
3. Conditional Fisher’s exact test
3.1. Type I error
Consider Table 1 as an example of 2
=
2 contingency tables. Similar contingency tables are frequently used, e.g., in medical sciences, so severaltests have been developed and investigated. The most
w x
important one is Fisher’s exact test 3,9–14 .The contingency table shown in Table 2 can be
Ž .
created by applying PCM to the data see Table 2 .If
k
is significantly larger than
k
, then variable
B C
X
is more strongly correlated with
Y
than variable
1
X
, and vice versa. The null hypothesis assumes that
2
X
and
X
are equally correlated with
Y
:
1 2
H :
k
s
k
. 2
Ž .
0 B C
Consider the following alternative hypothesisH :
k
/
k
. 3
Ž .
A B C
Ž .
If H is rejected, then
X
or
X
is more strongly
0 1 2
correlated with
Y
than the other
X
variable. If H is
0
not rejected, then it can be supposed that the probability of events B and C are equal, i.e., they have, inaddition to the same predictive property, the samecorrelation. It can be further supposed that the eventB appears only in half of
n
, and the event C appearsonly in the other half of
n
. The test statistics is basedon the probability of the 2
=
2 contingency table being realised with factual values
k
,
k
,
k
,
k
A B C D
w x
3,9,13 :
n n
2 2
0 0
k k n n
B C
P k
N
k
q
k
, ,
s
. 4
Ž . Ž .
B B C
ž /
2 2
nk
q
k
ž /
B C
Alternative developments of this hypergeometric
w x
formula can be found in Ref. 15 . Use of this formula for determining the optimal sample size in
w x
forensic casework has recently appeared in 16 .Thus, the cumulative distribution function of the teststatistic will be hypergeometric:
t
n nF t
,
K
s
P k
N
K
, ,
Ž .
Ý
ž /
2 2
k
s
0
n n
2 2
t
0 0
k K
y
k
s
, 5
Ž .
Ý
n
k
s
0
ž /
K
where
K
s
k
q
k
. The following equation must
B C
hold:
n n n n
2 2 2 2
X
k K
0 0 0 0
´
k K
y
k k K
y
k
q
Ý Ý
Y
n n
k
s
0
k
s
k
´
ž / ž /
K K
´ ´
s q s
´
, 6
Ž .
2 2where
k
X
and
k
Y
are chosen according to
´
, which is
´
´
Table 2The 2
=
2 contingency table to help to test the discrimination between variables
X
and
X
based on calculations by PCM
1 2
Frequencies with Frequencies without Marginalinformation to information to sumdiscriminate between discriminate betweenvariables variables
Ž . Ž .
X
may have stronger correlation
k n
r
2
y
k n
r
2
1 B B
Ž . Ž .
X
may have stronger correlation
k n
r
2
y
k n
r
2
2 C C
Marginal sum
k
q
k k
q
k n
B C A D
( ) R. Rajko, K. Heberger
r
Chemometrics and Intelligent Laboratory Systems 57 2001 1–14
´ ´
5
the probability of Type I error, so that
k
Y
s
K
y
k
X
.
´
´
Ž .
Because of the symmetry, Eq. 6 can be reduced to2
F k
X
,
K
s
´
. 7
Ž . Ž .
´
The above part of the paper describes Fisher’s test.
w x
As stated by Massart et al. 14 this is the best choicefor testing hypotheses on 2
=
2 contingency tables.Now all the statistical tools are at the disposal tomake a decision for discrimination between the twovariables
X
and
X
. If
k

k
X
or
k
)
k
Y
, then the
1 2 B
´
B
´
null hypothesis H should be rejected at a confi
0
Ž .
dence level 1
y
´
, and again, if k is larger than
B
k
, then
X
correlates with
Y
stronger than
X
and
C 1 2
vice versa. The procedure is visualised in Fig. 2.
Ž .
To apply Eq. 6 , the binomial coefficients
n
Ž Ž . .
s
n
!
r
k
!
n
y
k
! must be calculated via facto
ž /
k
Ž .
rials
n
!,
k
! and
n
y
k
!. It can preferably be doneusing an approximation called Stirlingformula, seeAppendix B.
3.2. Type II error, a strictly conditional approxima

tion
Ž .
The power function Pow . of the previously described test can be deduced by taking into account thewrong acceptance of the null hypothesis H :
k
s
k
.
0 B C
Fig. 2. Hypergeometric distribution function helping the acceptance or rejection of the null hypothesis H based on the test
0
Ž
statistic described in the text
k
s
40,
K
s
k
q
k
s
20,
k
s
2,
A B C D
X Y
.
´
s
0.02,
k
s
6,
k
s
14 .
´
´
First, an approximation is investigated. The approximation can be considered that the alternativehypothesis H :
k
s
k
is true. Let
A B 3
K
s
k
q
k
/
K
s
k
q
k k
/
k
, 8
Ž . Ž .
´
B C
b
B 3 C 3
and the probability of H becomes
A
n
X
n
X
2 2
X X
0 0
k k n n
B 3
P k
N
k
q
k
, ,
s
, 9
Ž . Ž .
X
B B 3
ž /
2 2
nk
q
k
ž /
B 3
where
n
X
s
k
q
k
q
k
q
k
.
A B 3 D
Fig. 3 shows an example of making a Type II error and its probability. The calculation of thecrosshatched area in Fig. 3 can be done with help of
Ž .
Eq. 5 :
b
s
F k
Y
,
K
y
F k
X
,
K
Ž . Ž .
´
b ´ b
s
F K
y
k
X
,
K
y
F k
X
,
K
, 10
Ž .
Ž . Ž .
´
´ b ´ b
then the power function will be
K k
q
k
b
B 3
Pow
s s
1
y
b
ž /
2 2
X
n
X
n
22
X
0
k
0
´
K
y
k k
b
s
1
q
Ý
X
n
k
s
0
K
ž /
b
X
n
X
n
22
X
0
K
y
k
0
´
´
K
y
k k
b
y
. 11
Ž .
Ý
X
n
k
s
0
K
ž /
b