Religion & Spirituality

A New Covariance Estimate for Bayesian Classifiers in Biometric Recognition

Description
A New Covariance Estimate for Bayesian Classifiers in Biometric Recognition
Published
of 10
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Transcript
  214 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 14, NO. 2, FEBRUARY 2004 A New Covariance Estimate for BayesianClassifiers in Biometric Recognition Carlos E. Thomaz, Duncan F. Gillies, and Raul Q. Feitosa  Abstract— In many biometric pattern-recognition problems,the number of training examples per class is limited, and con-sequently the sample group covariance matrices often used inparametric and nonparametric Bayesian classifiers are poorlyestimated or singular. Thus, a considerable amount of effort hasbeen devoted to the design of other covariance estimators, for usein limited-sample and high-dimensional classification problems.In this paper, a new covariance estimate, called the maximumentropy covariance selection (MECS) method, is proposed. It isbased on combining covariance matrices under the principle of maximum uncertainty. In order to evaluate the MECS effective-nessinbiometricproblems,experimentsonface,facialexpression,and fingerprint classification were carried out and compared withpopular covariance estimates, including the reguralized discrim-inant analysis and leave-one-out covariance for the parametricclassifier, and the Van Ness and Toeplitz covariance estimatesfor the nonparametric classifier. The results show that, in imagerecognition applications whenever the sample group covariancematrices are poorly estimated or ill posed, the MECS methodis faster and usually more accurate than the aforementionedapproaches in both parametric and nonparametric Bayesianclassifiers.  Index Terms— Bayesian classifiers, biometric recognition,covariance estimate, limited sample sizes, maximum entropy. I. I NTRODUCTION S TATISTICAL pattern-recognition techniques have beenused successfully to design several recognition systems[11]. In the statistical approach, a pattern is represented by aset of features or parameters and the region of the featurespace occupied by each class is determined by the probabilitydistribution of its corresponding patterns, which must be eitherspecified (parametric approach) or learned (nonparametricapproach).There are a number of classification rules available to defineappropriate statistical decision-making boundaries [11]. Thewell-known Bayes’ decision rule that assigns a pattern to theclass with the highest posterior probability is the one thatachieves minimal misclassification risk among all possiblerules (see, e.g., [1]). Manuscript received October 18, 2002; revised May 7, 2003. The work of C. E. Thomaz was supported in part by the Brazilian Government AgencyCAPES under Grant 1168/99-1.C. E. Thomaz and D. F. Gillies are with the Department of Computing,Imperial College London, London SW7 2AZ, U.K. (e-mail: cet@doc.ic.ac.uk;dfg@doc.ic.ac.uk).R. Q. Feitosa is with the Department of Electrical Engineering, Catholic Uni-versity of Rio de Janeiro, Rio de Janeiro 22453-900, Brazil, and also with theDepartment of Computer Engineering, State University of Rio de Janeiro, Riode Janeiro 205590-900, Brazil (e-mail: raul@ele.puc-rio.br).Digital Object Identifier 10.1109/TCSVT.2003.821984 The idea behind the Bayes’ rule is that all of the informationavailable about class membership is contained in the set of con-ditional probability densities. In practice, most of these proba-bility densities are based on Gaussian kernel functions that in-volve the inverse of the true covariance matrix of each class [3],[7], [8], [12], [13]. Since in real-world problems the true covari- ance matrices are seldom known, estimates must be computedbased on the patterns available in a training set.The usual choice for estimating the true covariance ma-trices is the maximum-likelihood estimate defined by thecorresponding sample group covariance matrices. However,it is well known that in limited-sample-size applications theinverse of sample group covariance matrices is either poorlyestimated or cannot be calculated when the number of trainingpatterns per class is smaller than the number of features. Thisproblem is indeed quite common nowadays, especially inimage-recognition applications where patterns are frequentlycomposed of thousands of pixels from which hundreds of preprocessed image features are obtained.Thus, a considerable amount of effort has been devoted tothe design of other covariance estimators, for targeting limited-sample and high-dimensional problems in parametric and non-parametric Bayesian classifiers [3]–[8], [17], [20], [23]. Most of these covariance estimators either rely on optimization tech-niques that are time consuming or have restrictive forms and donot necessarily lead to the highest classification accuracy in allcircumstances.In this paper, a new covariance estimate called maximum en-tropycovarianceselection(MECS)ispresented.Thisestimateisbasedoncombiningcovariancematricestotakeintoaccountthemaximum uncertainty. It assumes that there are some sourcesof variation that are the same from class to class and, conse-quently, similarities in covariance shape may be expected forall the classes. This has often been the case for biometric appli-cations such as face recognition.In order to evaluate the MECS effectiveness, experiments onface, facial expression, and fingerprint recognition were car-ried out, using publicly released databases with different ra-tios between the training sample sizes and the dimensionalityof the feature space. In each application, the performance of theMECS approach was compared with other covariance estima-tors, including the reguralized discriminant analysis (RDA) andleave-one-out covariance (LOOC) for the parametric Bayesianclassifier, and the Van Ness and Toeplitz covariance estimatesfor the nonparametric Bayesian classifier.The results show that, in such recognition applications when-ever the sample group covariance matrices were poorly esti-mated or ill posed, the MECS covariance method is preferable 1051-8215/04$20.00 © 2004 IEEE  THOMAZ  et al. : NEW COVARIANCE ESTIMATE FOR BAYESIAN CLASSIFIERS IN BIOMETRIC RECOGNITION 215 to the other approaches in both parametric and nonparametricBayesian classifiers. Moreover, the MECS approach offers con-siderable savings in computation time.II. L IMITED -S AMPLE -S IZE  P ROBLEM The similarity measures used for Bayesian classifiers basedon Gaussian kernels involve the inverse of the true covariancematrices of each class.Conventionally, those matrices are estimated by the samplegroup covariance matrices , which are the unbiased max-imum-likelihood estimators of the true covariance matrices [1],given by(1)where is the pattern from class and is the number of training patterns from class .However, the inverse of is especially problematic if for-dimensional patterns less than training patterns fromeach class are available. Since the sample group covariance ma-trix isa functionof orlesslinearlyindependentvectors,its rank is or less. Therefore, is a singular matrix if is less than the dimension of the feature space.Another well-known problem related to the sample group co-variance matrix is its instability due to limited samples. Thiseffect can be explicitly seen by writing the matrices in theirspectral decomposition forms [4], that is,(2)where is the th eigenvalue of and is the corre-sponding eigenvector. According to this representation, the in-verse of the sample group covariance matrix can be calculatedas(3)From (3), it can be observed that the inverse of isheavily weighted by the smallest eigenvalues and the directionsassociated with their respective eigenvectors [3]. Hence, apoor or unreliable estimation of tends to exaggerate theimportance associated with the low-variance information andconsequently distorts discriminant similarity measures basedon these estimates.As a general guideline, Jain and Chandrasekaran [10] havesuggested that the class sample sizes should be at least five toten times the dimension of the feature space. However, these es-timation settings have been quite difficult to achieve in practice,particularly in biometric recognition problems where patternsare frequently composed of thousands of pixels or even hun-dreds of preprocessed image features.In Section III, two of the most popular parametric and non-parametric Bayesian classifiers are briefly described along withtheir commonly used nonconventional covariance estimates fortargeting limited sample and high dimensional problems. By “ nonconventional covariance estimate, ”  we mean any covari-ance estimate that is not based solely on the data of the samplegroup. A new covariance estimate that can be used in both typesof Bayesian classifiers is detailed in Section V.III. Q UADRATIC  D ISCRIMINANT  C LASSIFIER The quadratic discriminant classifier is one of the most pop-ular parametric Bayesian classifiers. It is based on the -multi-variate Gaussian class-conditional probability densities.Assuming the symmetrical or zero-one loss function, theBayes quadratic discriminant (QD) rule stipulates that an un-known pattern should be assigned to the class that  minimizes (4)where isapriorprobabilityassociatedwiththe thclass, isthe maximum-likelihood estimate of the respective true covari-ance matrix (defined in (1)), and is the maximum-likelihoodestimate of the corresponding true mean vector, given by(5)The discriminant rule described in (4) defines the standard orconventional quadratic discriminant function (QDF) classifier.  A. Linear Discriminant Classifier  One straightforward method routinely applied to overcomethelimited-sample-sizeproblemontheQDFclassifier,andcon-sequently deal with the singularity and instability of the samplegroup covariance matrices , is to employ the Fisher ’ s lineardiscriminant function (LDF) classifier.The LDF classifier is obtained by replacing the in (4) withthe pooled sample covariance matrix defined as(6)where is the number of classes and .Sincemore patterns are taken tocalculate thepooled covariancematrix, is indeed a weighting average of , will poten-tiallyhaveahigherrankthan andwouldbeafullrankmatrix.Theoretically,however, isaconsistentestimatorofthetruecovariance matrices only when .  B. RDA Estimate In order to reduce the singularity and instability effects of theQDF classifier due to limited samples and the limitation of theLDF classifier, Friedman proposed an approach called the RDA[3].RDA is a two-dimensional (2-D) optimization method thatshrinks both the toward and also the eigenvalues of thetoward equality by blending the first shrinkage with multiplesof the identity matrix.In this context, the sample covariance matrices of the QDrule defined in (4) are replaced by the following :(7)wherethenotation “ tr ” denotesthetraceofamatrix.Themixingparameters and are restricted to the range 0 – 1 (optimization  216 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 14, NO. 2, FEBRUARY 2004 grid) and are selected to maximize the leave-one-out classifica-tion accuracy regarding all classes [3].Although RDA has the benefit of being directly related to theclassification accuracy, it is a computationally intensive methodparticularly when a large number of classes is considered. C. LOOC Estimate Hoffbeck and Landgrebe [8] proposed a covarianceestimatorfor QDF classifiers that depends only on covariance optimiza-tion of single classes.The idea is to examine pairwise mixtures of the sample groupcovariance estimates and the unweighted common covari-ance estimate , together with their diagonal forms [8]. The un-weighted common covariance estimate is given by(8)and can be viewed as the pooled covariance matrix when thenumber of training patterns is equal in each class.The LOOC estimator has the following form:(9)Its optimization strategy consists of evaluating several values of over the grid and then choosing that maxi-mizes the average log likelihood of the corresponding -dimen-sional Gaussian density function [8].As the LOOC estimate requires that only one density func-tion be evaluated for each point on the optimization grid, itgenerally requires less computation than the RDA estimator.IV. P ARZEN  W INDOW  C LASSIFIER The Parzen window classifier is a popular nonparametricBayesian classifier. In this classifier, the class-conditionalprobability densities are estimated locally by using kernelfunctions and a number of group neighboring patterns [4].Assuming the symmetrical or zero-one loss function, theBayes classification rule stipulates that an unknown patternshould be assigned to the class corresponding to the highestor maximum posterior probability.In the standard Parzen classifiers with Gaussian kernels(PZW), the posterior probability of each class is calculatedby multiplying the prior probability of the th class with thecorresponding Parzen likelihood density estimate , givenby(10)where,asareminder, isthedimensionofthefeaturespace,is thepattern fromclass , is thenumberoftraining patternsfrom class , and is the sample group covariance matrix. Theparameter is the window width of class and controls thekernel function  “ spread ”  or size.Due to the limited-sample-size problem, several researchershave imposed, analogously to the QDF classifiers, somestructures on the sample group covariance matrices for use inGaussian Parzen classifiers [4], [7], [12], [23]. Two approaches commonly employed for overcoming these estimation singu-larities and instabilities are briefly described in Sections IV-Aand IV-B.  A. Van Ness Covariance Estimate Van Ness proposed a flexible diagonal form for the true co-variancematrices ofGaussian Parzenclassifiersbasedsolely onthe estimation of the variances of each variable [23].In this approach, the sample group covariance matrices of theParzen density estimates defined in (10) are replaced with thefollowing matrices:(11)where is a smooth or scale parameter selected to maximizethe leave-one-out classification accuracy regarding all classes.Since only the sample variance of each variable has to be cal-culated from the training patterns of each class, would benonsingular as long as there are at least two linearly indepen-dent patterns available per class.  B. Toeplitz Covariance Estimate Another possible structure for the covariance matrices is theToeplitz approximation, based on the stationary assumption [4].The basic idea is to allow each individual variable to have itsown variance, whereas all covariance elements along any diag-onal are multiplied by the same correlation factor.The Toeplitz approximation of each group covariance matrixcan be calculated as follows:(12)where............(13)and is thesamplestandarddeviationofvariable calculatedfrom the training patterns of class . The correlation factoris given by the average of the sample correlation overvariables [4].Although we would not expect the Toeplitz covariance es-timate to be well suited to many pattern recognition applica-tions, Hamamoto  et al.  [7] have shown, based on experimentscarried out on artificial data sets, that the Toeplitz estimator canbe preferable to the Van Ness [23] and orthogonal expansion[15] estimators, particularly in small-training-sample-size andhigh-dimensional situations.  THOMAZ  et al. : NEW COVARIANCE ESTIMATE FOR BAYESIAN CLASSIFIERS IN BIOMETRIC RECOGNITION 217 V. N EW  C OVARIANCE  E STIMATE In several image recognition applications, especially in bio-metric ones, the pattern classification task is commonly per-formedonpreprocessedorwell-framedpatternsandthesourcesof variation are often the same from class to class. As a conse-quence, similarities in covariance shape may be assumed for allclasses.In such situations and when the sample group covariancema-trices are singular or not accurately estimated, linear com-binations of and a well-defined covariance matrix, e.g., thepooled covariance matrix , may lead to a  “ loss of covari-ance information ”  [21]. This statement, which will be discussedfirst in Section V-A, forms the basis of the new covariance esti-matecalledthemaximumentropycovarianceselection(MECS)method.  A. “Loss of Covariance Information” The theoretical interpretation of the  “ loss of covariance in-formation ”  can be described as follows. Let a matrix begiven by the following linear combination:(14)where the mixing parameters and are positive constants, andthe pooled covariance matrix is a nonsingular and well-de-fined matrix. The eigenvectors and eigenvalues are givenrespectively by the matrices and .From the covariance spectral decomposition formula de-scribed in (2), it is possible to write...(15)where are the eigenvalues and isthe dimension of the measurement space considered. Using theinformation provided by (14), (15) can be rewritten as(16a)Thematrices and arenotdiagonalmatricesbecausedoes not necessarily diagonalises both matrices simultaneously.However,as istheeigenvectorsmatrixofthelinearcombi-nationof and ,theoff-diagonalelementsof and nec-essarily cancel each other in order to generate the eigenvaluesmatrix . Therefore, the string of equalities in (16a) can beextended to(16b) Fig. 1. Geometric idea of a hypothetical  “ loss of covariance information. ” where and are, respectively, thevariancesofthesampleandpooledcovariancematricesspannedby the eigenvectors matrix . Then, according to (3),the inverse of becomes(17)where is the corresponding th eigenvector of the matrix.The inverse of described in (17) considers the disper-sions of sample group covariance matrices spanned by all theeigenvectors. However,when the classsample sizes aresmall or not large enough compared with the dimension of thefeature space , the corresponding lower dispersion values areoften estimated to be 0 or approximately 0, implying that thesevaluesarenotreliable.Therefore,alinearcombinationof andthat uses the same parameters and as defined in (14) forthe whole feature space fritters away some pooled covarianceinformation.The geometric idea of a hypothetical  “ loss of covariance in-formation ”  on a three-dimensional (3-D) feature space is illus-trated in Fig. 1. The constant probability density contour of and are represented by the 2-D dark grey ellipseand 3-D light grey ellipsoid, respectively. As canbe seen, is well defined on the plane but not definedat all on . In fact, there is no information from onthe axis. As a consequence, a linear combination of andthatshrinksorexpandsbothmatricesequallyalloverthefea-ture space simply ignores this evidence. Other covariance esti-mators, especially the ones developed for QDF classifiers, havenot addressed this problem.  B. MECS Method  The MECS method considers the issue of combining thesample group covariance matrices and the pooled covariancematrix based on the maximum entropy (ME) principle, statedbriefly as “ Whenwemakeinferencesbasedonincompleteinformation,we should draw them from that probability distribution that hasthe maximum entropy permitted by the information we do have[14]. ”  218 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 14, NO. 2, FEBRUARY 2004 In the problem of estimating covariance matrices forGaussian classifiers, it is known that different covarianceestimators should be optimal depending not only on the truecovariance statistics of each class, but also on the numberof training patterns, the dimension of the feature space, andeven the ellipsoidal symmetry associated with the Gaussiandistribution [3], [13]. In fact, such covariance optimization can be viewed as a problem of estimating parameters of Gaussianprobability distributions under uncertainty. Therefore, the MEcriterion that maximizes the uncertainty under an incompleteinformation context should be a promising solution.Let a -dimensional sample be normally distributedwith true mean and true covariance matrix , i.e.,. The entropy of such multivariate distribu-tion can be written as(18)which is simply a function of the determinant of and is in-variant under any orthonormal transformation [4]. Thus, whenconsists of eigenvectors of , we have(19)In order to maximize (19) or equivalently (18), we must se-lect the covariance estimation of that gives the largest eigen-values.Considering convex combinations between the sample groupcovariance and matrices, (19) can be rewritten [by using(16)] as(20)where and are the variances of thesample and pooled covariance matrices spanned by , andthe parameters and are nonnegative and sum to 1.Moreover, as the natural logarithm is a monotonic increasingfunction,wedonotchangetheproblemifinsteadofmaximizing(20) we maximize(21)However, isaconvexcombinationoftworealnumbersand the following inequality is valid [9]:(22)for any and convex parameters and that are non-negative and sum to 1. Equation (22) shows that the maximumof depends on and is attained at the extreme valuesof the convex parameters, that is, either and orand .One possible way to maximize (21) and consequently the en-tropy given by the convex combination of and is to selectthe maximum variances of the sample and pooled covariancematrices given by an orthonormal projection basis that diago-nalizes an unbiased linear mixture of the correspondingmatrices. Recalling the assumption made that all classes havesimilar covariance shapes, it is reasonable to expect that thedominant eigenvectors (i.e., the eigenvectors with largest eigen-values) of this unbiased mixture would be mostly orientated bythe eigenvectors of the covariance matrix with largest eigen-values. The choice of sample group or pooled information isthen made purely on the size of the eigenvalue, which reflectsthe reliability of the information. Since any unbiased combi-nation of and gives the same set of eigenvectors, an or-thonormalbasisthatwouldavoidthelossofcovarianceinforma-tion is the one composed of the eigenvectors of the covariancematrix given by .Therefore, the MECS estimator can be calculated bythe following procedure.1) Findtheeigenvectors ofthecovariancegivenby.2) Calculate the variance contribution of both and onthe basis, i.e.,(23)3) Form a new variance matrix based on the largest values,that is,(24)4) Form the MECS estimator(25)The MECS approach is a direct procedure that deals with thesingularity and instability of when similar covariance ma-trices are linearly combined. It does not require an optimiza-tion procedureand consequently its estimation, differentlyfromRDA and LOOC ones, is not exclusive to the quadratic discrim-inant classifier. In fact, MECS can be used in the parametricquadratic classifier as well as in the nonparametric GaussianParzen classifier whenever the sample group covariance ma-trices are poorly estimated or ill posed.VI. E XPERIMENTS In order to investigate the performance of MECS comparedwith the parametric QDF, LDF, RDA, and LOOC classifiers,and also with the nonparametric PZW, Van Ness, and Toeplitzclassifiers, three biometric applications were considered: facerecognition, facial expression recognition, and fingerprint clas-sification.In the face and facial expression recognition applications,the training sample sizes were chosen to be extremely smalland small, respectively, compared to the dimension of the fea-ture space. In contrast, moderate and large training sample sizescompared with the number of features were considered for thefingerprint problem.Since in all of the applications the same number of trainingexamples per class was considered, the prior probabilitieswere assumed to be equal for all classes and recognitiontasks. Moreover, the RDA optimization grid was taken tobe the outer product of and, as suggested by Friedman ’ swork [3]. Analogously, the size of the LOOC parameter was, as given by [8], and the Van
Search
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks