Data & Analytics

A linear discriminant analysis method based on mutual information maximization

A linear discriminant analysis method based on mutual information maximization
of 9
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
  Alineardiscriminantanalysismethodbasedonmutualinformationmaximization Haihong Zhang a,  , Cuntai Guan a , Yuanqing Li b a Institute for Infocomm Research, A*STAR, Singapore 138632, Singapore b School of Automation Science and Technology, South China University of Technology, Guangzhou 510460, China a r t i c l e i n f o  Article history: Received 9 February 2010Received in revised form6 September 2010Accepted 7 November 2010 Keywords: Discriminant analysisMutual informationFeature extraction a b s t r a c t We present a new linear discriminant analysis method based on information theory, where the mutualinformation between linearly transformed input data and the class labels is maximized. First, weintroduce a kernel-based estimate of mutual information with a variable kernel size. Furthermore, wedevisealearning algorithmthat maximizes the mutualinformation w.r.t.the linear transformation. Twoexperimentsareconducted:thefirstoneusesatoyproblemtovisualizeandcomparethetransformationvectors in the srcinal input space; the second one evaluates the performance of the method forclassification by employing cross-validation tests on four datasets from the UCI repository. Variousclassifiers are investigated. Our results show that this method can significantly boost class separabilityover conventional methods, especially for nonlinear classification. &  2010 Elsevier Ltd. All rights reserved. 1. Introduction Discriminant analysis (DA) (also referred to as discriminantfeatureextraction)aims to find a transformationof inputvariablesinto latent variables (features) with maximum class separability[1].Itcomprisesoneofthekeysubjectsinpatternrecognition,anddiffers largely from feature selection (e.g. [2,3]) in that, while featureselection ranks and selectsthe inputvariablesaccording totheir predictive significance, feature extraction transforms thevariables by, e.g. a linear combination. In this work, we focus onlinear DA in favor of its simplicity and possibility for nonlinearextension via the kernel trick [4].The most widely used DA method is known as Fisher lineardiscriminant analysis (LDA) [5,6], based on Fisher’s criterion that the ratio of inter-class scatter over intra-class scatter is maximized. It isdesignedfor2-classproblemsunderthe homoscedastic  conditionthatallclassesshare one Gaussiancovariance matrix. The non-optimalityor sub-optimality of LDA is well recognized in the literature (see [7–9]): neither is it able to deal with heteroscedastic data (i.e. classes donot have equal covariance matrices) in a proper way, nor it is Bayesoptimum for more than 2-class problems.A number of alternative LDA techniques have been proposed inthe past to address the problems. For multi-class problems, non-parametric scatter matrices were proposed in the so-called non-parametric discriminant analysis (NDA) [10]. The matrices were generallyoffullrank,allowingmorefeaturesthantheclassnumbertobeextracted.Besides,thenon-parametricmethodologyallowedDAtoworkwellevenfornon-Gaussiandatasets.Anotherextensionof LDA to multi-class was reported in [11], where the approximate Pairwise Accuracy Criteria (aPAC) replaced Fisher’s criterion.Specifically, aPAC weighted the contribution of individual classpairs in terms of Bayes error (a similar weighting scheme wasreported in [12]). More recently, a minimum Bayes error method was reported for dealing with multi-class homoscedastic data [7].In the heteroscedastic discriminant analysis (HDA) [13], all the classes were allowed to have different covariance matrices. It wasderived in a maximum-likelihood framework to handle hetero-scedasticdata. Without closed-form solution,the methodresortedto numerical optimization. Another heteroscedastic extension of LDA (HELDA) was presented in [14] that utilized the  Chernoff  criteriontohandlemulti-class,heteroscedasticdata.Favorably,theChernoff criterion led to a closed-form solution.Generally, these methods are limited with their unimodalGaussian assumptions. Furthermore, theoretical analysis [8] hasconcluded that generalized eigen-based linear equations (widelyused in many methods above) may not work whether or not thedata are homoscedastic or heteroscedastic.An alternative is to use the  maximum mutual information  (MMI)criterion. Stemmed from information theory, mutual information[15] basically measures how much knowing the features reducestheuncertaintyabouttheclasslabels.In[9],theauthorsstudiedthe relationship between MMI and the criteria of LDA and its hetero-scedasticextensions.Importantly,thestudyhasshownthatMMIisBayesian optimum under more general conditions than that forearlier criteria. Besides, MMI can connect to minimum Bayes errorvialowerandupperbound[16].Thecriterionhasalsobeenappliedto feature selection [2,3,17,18].In view of its superior capability for handling complex datadistributions,MMI-baseddiscriminantanalysis[19]orblindsource Contents lists available at ScienceDirectjournal homepage: Pattern Recognition 0031-3203/$-see front matter  &  2010 Elsevier Ltd. All rights reserved.doi:10.1016/j.patcog.2010.11.001  Corresponding author. E-mail addresses:, (H. Zhang), (C. Guan), (Y. Li).Pattern Recognition 44 (2011) 877–885  separation[20]waspromotedrecently.Especiallyin[19],themethod ‘‘MeRMaID_SIG’’ used Renyi’s quadratic entropy, in favor of its lowercomputational complexity, to replace Shannon’s entropy for themutual information formulation. However, the quadratic entropygenerallydivergesfromShannon’sentropywhichisthefundamentalof information theory behind the MMI approach. Furthermore, thatmethodislimitedtopredefinedorannealed[21]kernelsize,while,aswewillshowlaterattheendofSection2,akernelsizeasanintrinsicfunction of the transformation parameters is preferable for makingMMI a consistent measure of separability.In this paper we propose a new MMI-based DA method anddemonstrate its superiority. In particular, we introduce a non-para-metric, kernel-based estimate of mutual information based onShannon’sentropy.Particularly,thekernelsizeisdefinedasafunctionof the feature distributions, and the estimate is invariant against of dilatation/contractionofoutputdimensions.Inotherwords,thekernelsizebecomesafunctionoftheDAtransformationmatrix.Furthermore,we derive from the estimate a gradient-based learning algorithm.We investigate the method using a toy problem firstly. Thetransformation vectors are visualized in the srcinal 2D input space.Wethenevaluatethemethodusingcross-validationonfourdatasetsfromtheUCIrepository,whilecomparingitwithexistingDAmethodsincluding aPAC, HELDA and MeRMaID_SIG. For assessing the classseparabilityoffeaturesgeneratedbydifferentmethods,weemployalinearandanonlinearsupportvectormachines(SVMs)inadditiontoa Parzen window classifier ([1, Section 6.1]).The remainder of the paper is organized as follows. Section 2presents a robust mutual information estimate for the objectivefunction of linear discriminant analysis. Section 3 describes agradient-based learning algorithm, followed by experimentalresults in Section 4. Section 5 presents discussions, and Section 6finally concludes the paper. 2. Mutual information estimate Let a variable in the srcinal space be  x  A R n . It is linearlytransformed into a latent variable (i.e. a feature vector)  y  A R m by  y  ¼  Wx  ,  ð 1 Þ where  W  A R m  n is a projection matrix that comprises  n w  columnvectors.The mutual information between the variable  Y   of   y   and theclass variable  C   is known as I  ð Y  , C  Þ ¼ H  ð Y  Þ H  ð Y  j C  Þ ¼ H  ð Y  Þ X c  A C  H  ð Y  j c  Þ P  ð c  Þ ,  ð 2 Þ where  c   is a particular class label. The entropy  H  ð Y  Þ  is determinedby probability density function  p  y (  y  ): H  ð Y  Þ ¼  Z   y   p  y ð  y  Þ log ð  p  y ð  y  ÞÞ  d  y  :  ð 3 Þ Now we show that the mutual information is invariant undernonsingular linear transformation of the feature space. Considertwo linear transformationmatrices  W  and  W  u , eachtransforms thesrcinaldata vector  x   to  y   and  y  u , respectively.Considerthefeaturevector  y   being transformed by a nonsingular (i.e. full rank) squarematrix  G  and a constant vector  g  0 .  y  u  ¼ G  y  þ g  0 ,  i : e :  W  u  x  ¼ G  Wx  þ g  0 :  ð 4 Þ We would like to note that the transformation as Eq. (4)between two feature spaces (  y   and  y  u ) is equivalent to a transfor-mationbetweenthetwolinearmethods’matrices  W  and  W  u  ¼ G  W  plusaconstantadditivevector g  0 .Wewouldliketoemphasizethatthe nonsingular transformation  G  refers to the relationshipbetween two feature spaces. And, it should not be confused withthe usually non-square transformation matrix  W  .The mutual information becomes H  ð Y  u Þ ¼ H  ð G Y  þ g  0 Þ ¼  Z   11  p  y u ð G  y  þ g  0 Þ log ð  p  y u ð G  y  þ g  0 ÞÞ d ð G  y  þ g  0 Þ : ð 5 Þ Since  G  is a full-rank square matrix, we notedet  d ð G  y  þ g  0 Þ d  y    ¼ j det ð G Þj ,  ð 6 Þ and write the following: H  ð Y  u Þ ¼ j det ð G Þj Z   11  p  y ð  y  Þj det ð G Þj log  p  y ð  y  Þj det ð G Þj   d  y  ¼ log ðj det ð G ÞjÞ Z   11  p  y ð  y  Þ log ð  p  y ð  y  ÞÞ  d  y   ¼ H  ð Y  Þþ log ðj det ð G ÞjÞ : ð 7 Þ Therefore, the entropy is changed by log ðj det ð G ÞjÞ . But thechange is cancelled out in the mutual information: I  ð Y  u , C  Þ ¼ H  ð Y  u Þ X c  A C  P  c  ð c  Þ H  ð Y  u j c  Þ¼ H  ð Y  Þþ log ðj det ð G ÞjÞ X c  A C  P  c  ð c  Þ  H  ð Y  j c  Þþ log ðj det ð G ÞjÞ   ¼ H  ð Y  Þ X c  A C  P  c  ð c  Þ H  ð Y  j c  Þ ¼ I  ð Y  , C  Þ :  ð 8 Þ Hence, the mutual information is invariant against nonsingularlinear transformation like Eq. (4). This is an important propertyrequiredforametricoffeatureextraction:sincenonsingularlineartransformation is invertible, it shall have no effect in the classseparability of the features under transformation.Shannon’s mutual information is a functional of underlyingprobability distributions, thus it has no analytical form for com-putation. Instead, like in [22,18,3] we resort to a Monte Carloapproximation described below.Suppose there are  l  samples of data: {  x  i },  i ¼ 1, y , l , transformedto  f  y  i g  by Eq. (1). Now consider how to estimate the mutualinformation of {  y  i } with the corresponding class labels. First of all,wenotethattheentropyofthefeaturevector Y   canbeexpressedasthe expectation of log(  p  y (  y  )) H  ð Y  Þ ¼  E  ½ log ð  p  y  ð  y  ÞÞffi 1 l X li  ¼  1 log ð  p  y  ð  y  i ÞÞ :  ð 9 Þ Subsequently,  p  y (  y  ) can be estimated with kernel densityestimation [23]:  p  y  ð  y  Þffi  1 l X li  ¼  1 j ð  y    y  i Þ ,  ð 10 Þ where  j  usually takes a Gaussian form: j ð  y    y  i Þ ¼ a exp ð 12 ð  y    y  i Þ T  W  1 ð  y    y  i ÞÞ :  ð 11 Þ Here the factor  a  shall make the integral of   ^  p  y  ð  y  Þ  equal to 1, asrequiredforprobabilitydensityfunction.Inthisworkweconsideraconstant a . Thekernelsizematrix W is diagonal,and eachdiagonalelement is determined by c k , k  ¼ z  1 l  1 X li  ¼  1 ð  y ik   y k Þ 2 ,  ð 12 Þ where  y  k  is the empirical mean of   y  k , and we set the coefficient z ¼ ð 4 = 3 l Þ 0 : 1 accordingtothenormaloptimalsmoothingstrategy[24].By substituting Eq. (10) into Eq. (9), the entropy of the featurevector can be estimated using ^ H  ð Y  Þ ¼  1 l X li  ¼  1 log 1 l X l j  ¼  1 j ð  y  i   y   j Þ 8<:9=; ,  ð 13 Þ H. Zhang et al. / Pattern Recognition 44 (2011) 877–885 878  and the conditional intra-class entropy  ^ H  ð Y  j c  Þ  can be estimatedsimilarly by using class  c   samples only.The mutual information estimate becomes ^ I  ð Y  , C  Þ¼  ^ H  ð Y  Þ X c  P  ð c  Þ ^ H  ð Y  j c  Þ :  ð 14 Þ Importantly, we show in below that the mutual informationestimate is invariant against the following transformation:  y  u ¼ F   y  þ g  0 ,  ð 15 Þ where F  isafull-rankdiagonalmatrixwhose k -thdiagonalelementisdenotedby  f  k , k ,and g  0 isatranslationvector.Inageometricsense,the transformation is equivalent to dilating (if   f  k , k 4 1)/contracting(if   f  k , k o 1) the dimensions of the latent variable  y  , in addition totranslation. Therefore, such a transformation is invertible, and theclass separability remains the same after the transformation.Now we study the impact of the transformation on the mutualinformation estimate. After the transformation, the diagonalmatrix C (Eq. (12)) becomes a diagonal matrix C u  whose elementsare given by c u k , k ¼  f  2 k , k c k , k :  ð 16 Þ With the new diagonal matrix C u , the Gaussian-kernel function byEq. (11) becomes ^ j ð  y  u i   y  u  j Þ¼ a exp   12 ð  y  u i   y  u  j Þ T  C u  1 ð  y  u i   y  u  j Þ   ¼ a exp   12 X d  y k  ¼  1 ð  f  k , k  y ik þ  f  0 k   f  k , k  y ik   f  0 k Þ 2  f   2 k , k c k , k  ! ¼ a exp   12 ð  y    y  i Þ T  c  1 ð  y    y  i Þ   ¼ j ð  y    y  i Þ :  ð 17 Þ Therefore, the transformation in Eq. (15) does not change theGaussian kernel function output, thus has no effect on the entropyestimate in Eq. (10) as well as on the mutual information estimatein Eq. (2).The above property is important for the mutual informationestimate as the metric (criterion) for class separability in discri-minantanalysis.ItisknownthatBayeserroristhebestcriterionforclass separability, but it is just too complex and useless as ananalyticaltool([1,Chapter10]).Therefore,othercriteriaareneededin practice that shall be as consistent as possible with the Bayeserror. If there is a nonsingular transformation between two linearanalysistransforms,theyareequivalentinBayeserror[25]sincenoclassificationinformationislostunderthetransformation.Accord-ingly,criteriafordiscriminantanalysisshallalsobeinvariantundersuch a transformation, at least under dilating/contracting. In thissense,themutualinformationestimateisfavorablecomparedwithprior arts: for example, the criterion used in [19] employed a fixedkernel size and would not be invariant under dilating/contracting.Ontheotherhand,wecanseefromtheabovedevelopmentthatthekey point of the proposed estimate is in its kernel size beingdescribed as an appropriate function of the features (i.e. as afunction of the linear projection matrix  W  ).Therefore, we propose to use the mutual information estimate ^ I  ð Y  , C  Þ  as the objective function (i.e. criterion) for discriminantanalysis, and derive in the following an algorithm to learn theoptimum  W   which produces maximum mutual information. 3. Learning algorithm The objective of learning is therefore to find the optimumtransformation matrix  W   that maximizes the mutual informationestimate  ^ I  ð Y  , C  Þ . However, the mutual information estimateexpressed by Eqs. (13) and (14) takes a rather complicated form,and there is no closed-form solution to maximization. (Theestimate can be viewed as combinations of a number of Gaussianfunctions, which are in turn determined by the transformationmatrix  W   for the discriminant analysis.)Hereweproposeanumericalsolutionbyemployingagradient-based optimization algorithm. To this end, we first consider eachprojection vector, e.g. the  k -th projection vector  w k  in the lineartransformation matrix, and note that the gradient of mutualinformation estimate with respect to the projection vector is r   w k I  ð Y  , C  Þ¼ r   w k H  ð Y  Þ X c  A C  P  ð c  Þ r   w k H  ð Y  j c  Þ :  ð 18 Þ From Eq. (13), we have r   w k H  ð Y  Þ¼ 1 l X li  ¼  1 b i 1 l X l j  ¼  1 @ j ð  y  i   y   j Þ @  w k ,  ð 19 Þ where b i ¼  1 l X l j  ¼  1 j ð  y  i   y   j Þ 0@1A  1 :  ð 20 Þ From Eq. (11), we have @ j ð  y  i   y   j Þ @  w k ¼ 12 j ð  y  i   y   j Þ @ ð  y  i   y   j Þ T  C  1 ð  y  i   y   j Þ @  w k :  ð 21 Þ Let us denote the quadratic function  ð  y  i   y   j Þ T  C  1 ð  y  i   y   j Þ  by  W ij .And,  W ij  can be decomposed as below: W ij ¼ X d o k 1  ¼  1 X d o k 2  ¼  1 c  1 k 1 k 2 ð  y ik 1   y  jk 1 Þð  y ik 2   y  jk 2 Þ :  ð 22 Þ The gradient of   W ij  is @ W ij @  w k ¼ X d o k 1  ¼  1 X d o k 2  ¼  1 @ c  1 k 1 k 2 @  w k ð  y ik 1   y  jk 1 Þð  y ik 2   y  jk 2 Þ "  þ c  1 k 1 k 2 @ ð  y ik 1   y  jk 1 Þð  y ik 2   y  jk 2 Þ @  w k  : ð 23 Þ Consider that ð  y ik 1   y  jk 2 Þ 2 is a function of   w k  if and only if   k 1 ¼ k and/or  k 2 ¼ k , and  c  1 k 1 k 2 is a function of   w k  if and only if   k 1 ¼ k 2 ¼ k .Furthermore,  c  1 k 1 k 2 ¼ 0 if   k 1 a k  or  k 2 a k . The expression of thegradient above can be written as @ W ij @  w k ¼  @ c  1 kk @  w k ð  y ik   y  jk Þ 2 þ c  1 kk 1 @ ð  y ik   y  jk Þ 2 @  w k :  ð 24 Þ For the computation of   @ c  1 kk  =@  w k , we first note from Eq. (12)that  c kk  can be expressed as a direct function of   w k : c k , k ¼ z  1 l  1 X li  ¼  1  w T k ð  x  i   x  i Þð  x  i   x  i Þ T   w k  z  w T k F  w k ,  ð 25 Þ where  x   is the empirical mean of   x  , and  F  denotes the empiricalcovariance matrix of   x  . It then follows that @ c  1 kk @  w k ¼  Z 2 @ ð  w T k F  w k Þ @  w k ¼ Z F  w k ,  ð 26 Þ where for simplicity we denote Z ¼ 2 z  1 ð  w T k F  w k Þ  2 :  ð 27 Þ Furthermore, @ ð  y ik   y  jk Þ 2 @  w k ¼ 2 ð  x  i   x   j Þð  x  i   x   j Þ T   w k :  ð 28 Þ With the above equations, we can write the gradient of theentropy estimate  ^ H  ð Y  Þ  as r   w k ^ H  ð Y  Þ¼  Aw k ,  ð 29 Þ H. Zhang et al. / Pattern Recognition 44 (2011) 877–885  879  where  A ¼  12 l 2 X li  ¼  1 b i X l j  ¼  1 j ð  y  i   y   j Þ½ Z F ð  y ik   y  jk Þ 2 þ 2 c  1 kk  ð  x  i   x   j Þð  x  i   x   j Þ T   : ð 30 Þ Similarly for each within-class entropy, we have  A c  . Therefore,the gradient of the mutual information is given by r   w k I  ð Y  , C  Þ ¼  A  X c  P  ð c  Þ  A c  !  w k :  ð 31 Þ Note that because the multiplier  ð  A  P c  P  ð c  Þ  A c  Þ  contains rathercomplicated functions of   W  , the gradient in the above equation isindeed a nonlinear function of all  w k . Using the above equations tocompute the gradient for each projection vector  w k , we employan iterative optimization procedure to update the projectionvectors by  w ð iter   þ 1 Þ k  ¼  w ð iter  Þ k  þ l r   w k I  ð iter  Þ ð Y  , C  Þ ,  ð 32 Þ where l isthestepsize.Inthiswork,weperformalocallinesearchprocedure to determine the step size. The method tries a number(tentatively15)of  l intherangeof[00.01]toupdate  W  ,andchecksif a local maximum of mutual information exists within the rangeexcluding the boundary points. If a local maximum exists, themethodusesthat l asthefinalstepsizeinthisiteration.Otherwise,itincreasestherange(tentativelybyafactorof1.5)andrepeatstheline search procedure.The pseudocode of the optimization procedure is described inFig. 1. Since the above method is a deterministic process, it isimportanttosetanappropriate initialvaluefor  W  .In thework,weconsiderselectinginitialvalueamongarandomlygeneratedset.Inthe following studies, we generate 50 samples for the initial value,and choose the one that produces the largest mutual informationestimate. 4. Experimental results 4.1. Toy problem We used a toy problem to investigate the proposed method, byvisualizing the generated transformation vectors in the srcinalspace.Unliketherealworlddatasetstobeusedlaterthathavehighdimensionality ( b 3), the toy problem consists of bivariate inputsamples, allowing visualization in an explicit form.Fig. 2 illustrates the problem and the result. The two classeswere not linearly separable: while the positive class (in red color)has a unimodal Gaussian distribution, the negative class (in bluecolor)israndomlyscatteredaroundahalf-circlethatsurroundsthepositiveclass.AfterapplyingvariousmethodsincludingaPAC,PCA,MeRMaID_SIG and the proposed method (MMILA), we plotted theresultant transformation vectors as lines starting from the zeropoint (note that the data were zero-meaned beforehand).The near-horizontal line by aPAC can be reversed withoutaffecting the learning and classification system. Hence, the threeexisting methods created quite similar transformation vectors.Furthermore, the vectors were almost parallel to the two axes,implying that none of the three methods can well explore the datastructure.The proposed method MMILA produced fairly different results.Interestingly, the two transformation vectors can be viewed aspiece-wise linear approximations to the arc-shape separationbetween the two classes. Therefore, it seems that the MMILAfeatures can better describe the discriminative data structure.Themethodswerefurthertestedunderstrongadditivenoise.Asin subfigure (b), the sample distributions of the two classes were Input: A set of training samples  { x i } li =1  with corresponding class labels  { c i } li =1  .Output: A linear transformation matrix  W   .Algorithm:1. Initialization: iteration step  iter   = 0 , generate a number (e.g. 50 in the presentwork) of random W , and choose the one as W (0) which produces the largestmutual information estimate;2. Compute the gradient of W  ( iter   ) according to Eq. 31 and related equations;3. Perform a linear search that intuitively seeks a local maximum mutual informationalong the gradient direction. In the present work we simply try a range of the stepsize ( λ  in Eq. 32, see also Section 3 below Eq.32) to update W  ( iter   ) and check themutual information estimate respectively, if no local maximum exists in the range,increase the search range and repeat the search until a local maximum is found andset it as W  ( iter   +1) ;4. Iteration step  iter   =  iter   + 1 ;5. Check termination condition: compute the gain of mutual information estimate inthis interation; terminate learning if the gain is smaller than a preset small thresholdvalue (1e-4 in the present work), otherwise proceed to Step 2; Fig. 1.  The learning algorithm for the proposed maximum mutual information linear discriminant analysis. H. Zhang et al. / Pattern Recognition 44 (2011) 877–885 880  quite overlapped due to noise. Nevertheless,MMILA still producedsimilar and reasonable results like in subfigure (a). 4.2. Real-world datasets We used four real-world classification datasets from the UCIrepository, include both 2-class data and multi-class data. Thenatureof thedata islargelydifferent amongthesets.Details aboutthe datasets are summarized in Table 1.The experiment was conducted in MATLAB, where everyattribute in the data was linearly normalized to the range of [0 1] beforehand. The general objectives of the experiment are: tostudy the convergence of the iterative optimization method; toassess the performance of the MMI-based discriminant analysismethod(referredtoMMILAhereafter),intermsofclassseparabilityof the features.We evaluated the class separability in terms of classificationaccuracy using randomized cross-validation and different classi-fiers. A 5  5-fold cross-validation is conducted using the Matlabfunction‘‘crossvalind’’fromtheMATLABbioinformaticstoolboxtogenerate random cross-validation partitions of data. Each cross-validation test was initialized with different random seeds. Weused a linear support vector machine (SVM-L) a Gaussian-kernelsupport vector machine (SVM-G) (using the LIBSVM toolbox [26]),andaParzenwindowclassifier(referredtoasParzenhereafter)[1]that shared the same kernel size with MMILA. Each classifierlearnedfromthetrainingsettoclassifythetest setsamplesduringcross-validation.The proposed method were compared against three exis-ting methods, namely, aPAC, HELDA, PCA and MeRMaID_SIG. The randomized cross-validation settings were consistentacross different DA methods, output dimensions, and the classifierchoice.The statistics of the classification result is summarized inTables 2 and 3. Consider each combination of dataset and dimen-sion asa particular case(e.g. ‘‘Musk’’and dimension ¼¼ 1). Amongall the 15 cases, MMILA yielded highest mean accuracy rate in 11cases, aPAC in two cases, HELDA in two cases and PCA in onecase only.Since the classification performance varies with the outputdimensionoffeatures[27,28],weplottheclassificationaccuracyasa function of the output dimension in Fig. 3 so as to facilitate theanalysis of the results. In the ‘‘Musk’’ and the ‘‘Glass’’ datasets,MMILA improved classification accuracy significantly over otherdiscriminant analysis methods in the  4 2 dimensions. In ‘‘Yeast’’and ‘‘Vehicle’’, MMILA also yielded the highest accuracy rates inmost cases.To further examine the statistical significance of the MMILAresultscomparedwithothers,we ran t  -tests ofthehypothesisthatMMILA produced a higher mean classification accuracy. Paired t  -test using cross-validation results was performed using theMatlab function ‘‘ttest’’. Particularly, we compared the results of MMILA against those by aPAC (which provided overall the bestaccuracyamongtheexistingmethods)andthosebyMeRMaID_SIG,the state-of-the-art mutual information-based feature extractionmethod.Fig. 4 further summarizes the comparison results in terms of statistics. Specifically, it illustrates how likely MMILA would out-perform significantly (  p -value o 0 : 05) the two existing methods of aPAC and MeRMaID_SIG. Compared with aPAC, MMILA tends toproduce significant improvement in nonlinear classification andhigher dimension. Compared with MeRMaID_SIG, MMILA alsooutperformed in nonlinear classification. −1.5 −1 −0.5 0 0.5 1 1.5−2−1.5−1−0.500.511.52 aPACPCAMeRMaID_SIGMMILA −1.5 −1 −0.5 0 0.5 1 1.5−2−1.5−1−0.500.511.52 aPACPCAMeRMaID_SIGMMILA Fig.2.  Toyproblemfordiscriminativefeatureextraction.(2-classproblem)Thetwoclasssamplesareplottedasredcirclesorbluedotsin(a),and(2-classproblemwithstrong noise) as red crosses or blue dots in (b). See Subsection 4.1 for furtherdescription. (For interpretation of the references to color in this figure legend, thereader is referred to the web version of this article.)  Table 1 Datasets used for evaluation. Only predictive attributes are considered.Name #instance #attribute #class RemarkMusk 476 166 2  Real-value attributes ;  class distribution : 43.5%, 56.5%Yeast 1484 8 10  Real-value attributes ;  class distribution : 16.4%, 28.9%, 31.2%, 29.6%, 23.6%, 34.4%, 11.0%, 2.0%, 1.4%, 0.3%Glass 214 9 6  Real-value attributes ;  the null class  ( i.e. no samples )  from the srcinal seven classes is removed ;  class distribution : 32.7%, 35.5%, 7.9%,6.1%, 4.2%, 13.6%Vehicle 846 18 4  Integer attributes in  [0 1018]  range ;  class distribution : 23.5%, 25.7%, 25.8%, 25.1% H. Zhang et al. / Pattern Recognition 44 (2011) 877–885  881
Similar documents
View more...
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks