Finance

Conditional Fisher's exact test as a selection criterion for pair-correlation method. Type I and Type II errors

Description
Conditional Fisher's exact test as a selection criterion for pair-correlation method. Type I and Type II errors
Categories
Published
of 14
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Transcript
  Ž . Chemometrics and Intelligent Laboratory Systems 57 2001 1–14www.elsevier.com r locate r chemometrics Conditional Fisher’s exact test as a selection criterion forpair-correlation method. Type I and Type II errors Robert Rajko  a, ) , Karoly Heberger  b, ) ´ ´ ´ ´ a  Department of Unit Operations and En Õ ironmental Engineering, Institute of Food Industry College, Uni Õ ersity of Szeged, P.O. Box 433, H-6701 Szeged, Hungary b  Institute of Chemistry, Chemical Research Center, Hungarian Academy of Sciences, P.O. Box. 17, H-1525 Budapest, Hungary Received 1 February 2000; accepted 20 December 2000 Abstract Ž . The pair-correlation method PCM has been developed recently for discrimination between two variables. PCM can be Ž . used to identify the decisive fundamental, basic factor from among correlated variables even in cases when all other statis-tical criteria fail to indicate significant difference. These decisions are needed frequently in QSAR studies and r or chemicalmodel building. The conditional Fisher’s exact test, based on testing significance in the 2 = 2 contingency tables is a suitableselection criterion for PCM. The test statistic provides a probabilistic aid for accepting the hypothesis of significant differ- Ž . ences between two factors, which are almost equally correlated with the response dependent variable . Differentiating be-tween factors can lead to alternative models at any arbitrary significance level. The power function of the test statistic has Ž also been deduced theoretically. A similar derivation was undertaken for the description of the influence of Type I false- . Ž . positive conclusion, error of the first kind and Type II false-negative conclusion, error of the second kind errors. The ap-propriate decision is indicated from the low probability levels of both false conclusions.  q 2001 Elsevier Science B.V. Allrights reserved. Ž . Ž . Keywords:  Variable or feature selection; Pair-correlation method PCM 1. Introduction Ž Variable selection subset selection, feature selec- . tion is one of the key issues in chemometrics. Theselection process is more or less solved for linear re-lationships. Unfortunately, the algorithm for variableselection and the selection criteria are often indistin-guishable. The same algorithm can lead to the selec- ) Corresponding authors. Ž .  E-mail addresses:  rajko@sol.cc.u-szeged.hu R. Rajko , ´ Ž . heberger@chemres.hu K. Heberger . ´ tion of different variables using different criteria andvice versa. Construction of data sets for which even Ž the accepted criteria forward selection, backward . elimination, stepwise can lead to different conclu- w x sions is relatively easy 1a,1b . Other algorithms Ž . based on principal component analysis PCA , par- Ž . tial least squares PLS , genetic algorithm and artifi- Ž . cial neural networks ANN increase the uncertaintyconcerning the selection of the best subset. The in-creasing usage of ANN technique forced the chemo-metricians to develop non-linear variable selectionmethods. The approaches are often heuristic and lack  0169-7439 r 01 r $ - see front matter q 2001 Elsevier Science B.V. All rights reserved. Ž . PII: S0169-7439 01 00101-0  ( ) R. Rajko, K. Heberger  r Chemometrics and Intelligent Laboratory Systems 57 2001 1–14 ´ ´ 2 w x any firm theoretical basis. Centner et al. 2 empha-sised that  A  ...weakness of all these methods is the Ž estimation of a suitable number of variables cut-off  . level . No explicit rule exists up to now. As a resultall approaches work with a user-defined number of variables or with a user-defined critical value for theconsidered selection criterion B .All the above mentioned methods build models forprediction. However, the prediction is not necessarilya goal to be achieved for the model building process.Models having theoretical basis, physical relevance,are superior to empirical ones. However, there are noalgorithmic ways to select important or basic factors.The connection to the physical significance has to beexamined individually again and again. The presentpaper introduces a new technique, which uses otherportion of information present in the data than usu-ally. The technique is able to select  A superior B  fac-tors if the superiority exists.Consider the following example based on the cor- Ž relation coefficient. Two independent variables  X  1 . and  X   can be discriminated according to the mag- 2 nitude of the correlation coefficients  r   and Y   vs.  X  1 r   . The discrimination can be formulated as an Y   vs.  X  2 F  -test to identify significant differences at a givenprobability level.The classical Pearson product moment correlationcoefficient is not the only measure for correlation.There are also non-parametric measures for correla- w x tion, e.g., Spearman’s rho and Kendall’s tau 3 . Theyare, however, not yet used for variable selection. The Ž . pair-correlation method PCM provides an alterna-tive possibility to characterise different correlationswithout using the correlation coefficient. w x PCM 4–6 has been developed recently for thediscrimination of variables as a non-parametricmethod, contrast with methods that require the as-sumption of normality. PCM can be used to choose Ž . the decisive fundamental, basic factor from among Ž . correlated collinear variables, even if all classicalstatistical criteria cannot indicate any significant dif-ference. PCM, however, needs a test statistic as a se-lection criterion, i.e., a probabilistic aid for acceptingthe hypothesis that a significant difference exists be-tween the two factors at any arbitrary significancelevel.There are two hypotheses that must be specified in w x any statistical testing procedure 7–9 : the null hy-pothesis, denoted H , and the alternative hypothesis, 0 denoted H . Acceptation or rejection of the null hy- A pothesis is the task to be solved. However, statisticalhypothesis testing is based on sample information.Nobody can be sure that the decision is correct. WhenH is true but, by chance, the sample data infer in- 0 correctly to that it is false, this is referred to as a Type Ž I error or the error of the first kind the probability of  . this event is  ´   . When H is false but, by bad luck, 0 the sample data lead mistakenly to that it could betrue, this is called Type II error or the error of the Ž . second kind the probability of this event is  b   . The Ž . power of a test equals 1 y b   is a measure of howgood the test is at rejecting a false null hypothesis.PCM is used to choose between two factors  X  1 and  X   , which are approximately equally correlated 2 with the dependent variable  Y  . Hence, determination Ž of   b   is of crucial importance PCM can only dis-criminate between  X   and  X   if the null hypothesis 1 2 . can be rejected . Low levels of both  ´   and  b   indi-cate that the correct decision has been made.Our aim in this paper was to develop a selectioncriterion for PCM. The theoretical deduction of TypeI and II errors will justify the usage of the method.Moreover, we would like to communicate an im-provement of the algorithm for PCM. The improve-ment is summarised in the Appendix A. Finally, wepresent some examples to validate the method and tobetter understand how it works. 2. Theoretical principles of PCM PCM is based on non-parametric, i.e., distribu- Ž . tion-free combinatorial analysis. The formulation of the initial task is given below. Ž . Let us define three vectors as dependent  Y   and Ž . independent variables  X   and  X   . The task is to 1 2 choose the superior one from the coequal  X   and  X   . 1 2 Both of the independent variables correlate posi-tively with the dependent variables. The case whenone of them or both does not correlate with  Y   doesnot cause serious limitation. This will be discussed inthe validation part of the paper. Likewise, a negativecorrelation does not limit the usage of the method.Consider all the possible element pairs of the  Y  vector that can occur when the differences  D  X   for 1  ( ) R. Rajko, K. Heberger  r Chemometrics and Intelligent Laboratory Systems 57 2001 1–14 ´ ´  3Table 1Distribution of events A, B, C, and D; frequencies obtained usingPCM D  X   ) 0  D  X   - 0 1 1 D  X   ) 0 A:  k   C:  k  2 A C D  X   - 0 B:  k   D:  k  2 B D Y   vs.  X   , and  D  X   for  Y   vs.  X   are determined. 1 2 2 Only the signs of the differences are important: D X  s  X   y  X   sgn  Y   y Y  Ž . Ž . 1 1 i  1  j i j 1 F i F  j F m , 5 D X  s  X   y  X   sgn  Y   y Y  Ž . Ž . 2 2 i  2  j i j 1 Ž . 0, if   Y   s Y   , °  i j ~ < < Y   y Y  sgn  Y   y Y   s  i j Ž . i j , otherwise, ¢ Y   y Y  i j where  m  is the number of measurements. There will m w  Ž . x be  s  m m y 1  r 2 s n  point pairs and differ- ž / 2ences  D  X   and  D  X   . There are only four possible 1 2 signs of differences in  D  X   and  D  X   . They are 1 2 termed A, B, C and D. Table 1 summarises the four Ž . possibilities events . The frequencies of the events Ž . A, B, C and D  k   ,  k   ,  k   , and  k   , respectively are A B C D Ž . counted and ordered see Table 1 . Fig. 1 representsthe fundamental nature of the four events as the basisof PCM. The cases are ignored if the  Y   s Y  . How- i j ever, this cannot cause any limitation. These cases donot hold any information on the differences in the in-dependent variables. Ž Because of the initial assumption, both positive or . negative correlations for  Y   vs.  X   and  Y   vs.  X   , 1 2 the frequency of event A should be the largest. Thatis both  X   and  X   must change in the same direc- 1 2 tion as  Y  . Event D shows how the correlation tendsto be reduced by chance. Its frequency is expected tobe the lowest then. If the frequency of event A is notthe highest, then either one of or both  X   variablescorrelate with  Y   negatively.The rearrangement of boxes is equivalent to themultiplication of   X   or  X   , or both by minus one to 1 2 obtain positive correlation between  Y   and  X   , as well 1 as  Y   and  X   . This can be seen from the formulas in 2 brackets in Appendix A, where the rearrangementprocedure is given in details.Events A and D have no direct information forchoosing between  X   and  X   . If the frequency value 1 2 k   belonging to event B is larger than  k   for event B C C, then  X   overrides  X   and vice versa. Further de- 1 2 w x tails of the properties of PCM are given in 6 . Theword ‘larger’ can be interpreted statistically. Thus, atest statistic is required to determine whether the fre-quency value associated with event B is significantly Ž . larger than that for event C or vice versa .The paper describes a test statistic based on test-ing the significance of a 2 = 2 contingency table. Thepower function of this test statistic and the influence Fig. 1. Graphical representation of four possible events as the basis of PCM.  ( ) R. Rajko, K. Heberger  r Chemometrics and Intelligent Laboratory Systems 57 2001 1–14 ´ ´ 4 of Type I and Type II errors are also investigated anddescribed. 3. Conditional Fisher’s exact test 3.1. Type I error  Consider Table 1 as an example of 2 = 2 contin-gency tables. Similar contingency tables are fre-quently used, e.g., in medical sciences, so severaltests have been developed and investigated. The most w x important one is Fisher’s exact test 3,9–14 .The contingency table shown in Table 2 can be Ž . created by applying PCM to the data see Table 2 .If   k   is significantly larger than  k   , then variable B C  X   is more strongly correlated with  Y   than variable 1  X   , and vice versa. The null hypothesis assumes that 2  X   and  X   are equally correlated with  Y  : 1 2 H :  k   s k   . 2 Ž . 0 B C Consider the following alternative hypothesisH :  k   / k   . 3 Ž . A B C Ž . If H is rejected, then  X   or  X   is more strongly 0 1 2 correlated with  Y   than the other  X   variable. If H is 0 not rejected, then it can be supposed that the proba-bility of events B and C are equal, i.e., they have, inaddition to the same predictive property, the samecorrelation. It can be further supposed that the eventB appears only in half of   n , and the event C appearsonly in the other half of   n . The test statistics is basedon the probability of the 2 = 2 contingency table be-ing realised with factual values  k   ,  k   ,  k   ,  k  A B C D w x 3,9,13 : n n 2 2  0  0 k k n n  B C P k   N  k   q k   , ,  s  . 4 Ž . Ž . B B C ž / 2 2  nk   q k  ž / B C Alternative developments of this hypergeometric w x formula can be found in Ref. 15 . Use of this for-mula for determining the optimal sample size in w x forensic casework has recently appeared in 16 .Thus, the cumulative distribution function of the teststatistic will be hypergeometric: t  n nF t  , K   s  P k  N K  , , Ž .  Ý  ž / 2 2 k  s 0 n n 2 2 t    0  0 k K  y k  s  , 5 Ž . Ý n k  s 0 ž / K  where  K  s k   q k   . The following equation must B C hold: n n n n 2 2 2 2 X k  K   0  0  0  0 ´  k K  y k k K  y k  q Ý Ý Y n n k  s 0  k  s k  ´  ž / ž / K K  ´ ´  s q s ´  , 6 Ž . 2 2where  k  X and  k  Y are chosen according to  ´  , which is ´   ´  Table 2The 2 = 2 contingency table to help to test the discrimination between variables  X   and  X   based on calculations by PCM 1 2 Frequencies with Frequencies without Marginalinformation to information to sumdiscriminate between discriminate betweenvariables variables Ž . Ž .  X   may have stronger correlation  k n r 2  y k n r 2 1 B B Ž . Ž .  X   may have stronger correlation  k n r 2  y k n r 2 2 C C Marginal sum  k   q k k   q k n B C A D  ( ) R. Rajko, K. Heberger  r Chemometrics and Intelligent Laboratory Systems 57 2001 1–14 ´ ´  5 the probability of Type I error, so that  k  Y s K  y k  X . ´   ´  Ž . Because of the symmetry, Eq. 6 can be reduced to2 F k  X , K   s ´  . 7 Ž . Ž . ´  The above part of the paper describes Fisher’s test. w x As stated by Massart et al. 14 this is the best choicefor testing hypotheses on 2 = 2 contingency tables.Now all the statistical tools are at the disposal tomake a decision for discrimination between the twovariables  X   and  X   . If   k   - k  X or  k   ) k  Y , then the 1 2 B  ´   B  ´  null hypothesis H should be rejected at a confi- 0 Ž . dence level 1 y ´   , and again, if k is larger than B k   , then  X   correlates with  Y   stronger than  X   and C 1 2 vice versa. The procedure is visualised in Fig. 2. Ž . To apply Eq. 6 , the binomial coefficients n Ž Ž . . s n ! r  k  !  n y k   ! must be calculated via facto- ž / k  Ž . rials  n !,  k  ! and  n y k   !. It can preferably be doneusing an approximation called Stirling-formula, seeAppendix B. 3.2. Type II error, a strictly conditional approxima - tion Ž . The power function Pow . of the previously de-scribed test can be deduced by taking into account thewrong acceptance of the null hypothesis H :  k   s k   . 0 B C Fig. 2. Hypergeometric distribution function helping the accep-tance or rejection of the null hypothesis H based on the test 0 Ž statistic described in the text  k   s 40,  K  s k   q k   s 20,  k   s 2, A B C D X Y . ´  s 0.02,  k   s 6,  k   s 14 . ´   ´  First, an approximation is investigated. The ap-proximation can be considered that the alternativehypothesis H :  k   s k   is true. Let A B 3 K   s k   q k   / K   s k   q k k   / k   , 8 Ž . Ž . ´   B C  b   B 3 C 3 and the probability of H becomes A n X n X 2 2 X X  0  0 k k n n  B 3 P k   N  k   q k   , ,  s  , 9 Ž . Ž . X B B 3 ž / 2 2  nk   q k  ž / B 3 where  n X s k   q k   q k   q k   . A B 3 D Fig. 3 shows an example of making a Type II er-ror and its probability. The calculation of thecrosshatched area in Fig. 3 can be done with help of  Ž . Eq. 5 : b  s F k  Y , K   y F k  X , K  Ž . Ž . ´   b ´ b  s F K   y k  X , K   y F k  X , K   , 10 Ž . Ž . Ž . ´   ´ b ´ b  then the power function will be K k   q k  b  B 3 Pow  s s 1 y b  ž / 2 2 X n X n 22 X  0 k    0 ´   K   y k k   b s 1 q  Ý  X n k  s 0 K  ž / b X n X n 22 X  0 K   y k    0 ´   ´   K   y k k   b y  . 11 Ž . Ý  X n k  s 0 K  ž / b
Search
Similar documents
View more...
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks