Documents

09_analise_componentes

Description
1 1 The goals: Select the “optimum” number l of features Select the “best” l features Large l has a three-fold disadvantage: High computational demands Low generalization performance Poor error estimates FEATURE SELECTION 2 Given N ã l must be large enough to learn – what makes classes different – what makes patterns in the same class similar ã l must be small enough not to learn what makes patterns of the same class different ã In practice, has been reported to be a sensible ch
Categories
Published
of 10
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Transcript
  1 1  The goals:  Select the “optimum”number l  of features  Select the “best”  l  features  Large l  has a three-fold disadvantage:  High computational demands  Low generalization performance  Poor error estimates FEATURE SELECTION 2  Given  N  ã l  must be large enoughto learn –what makes classes different –what makes patterns in the same class similar ã l  must be small enough notto learn what makes patterns of the same class differentãIn practice, has been reported to be a sensible choice for a number of cases  Once l  has been decided, choose the l  most informative featuresãBest: Large between class distance, Small within class variance 3/  Ν  l   < 34  The basic philosophy  Discard individual features with poorinformation content  The remaining information rich features are examined  jointlyas vectors 5  Feature Selection based on statistical Hypothesis Testing  The Goal: For each individual feature, find whether the values, which the feature takes for the different classes,differ significantly.That is, answerã:The values differ significantlyã:The values do not differ significantlyIf they do not differ significantly reject feature from subsequent stages.  Hypothesis Testing Basics 010011  : : θ θ θ θ  =≠  H  H  6  The steps: ã  N  measurementsare knownãDefine a function of themtest statisticso that is easily parameterized in terms of θ  .ãLet  D be an interval, where q has a high probability to lie under  H  0 , i.e.,  p q ( q   θ  0 ) ãLet  D be the complement of  D D  AcceptanceInterval  D Critical IntervalãIf q , resulting from lies in  D we accept  H  0 , otherwise we reject it.  N i x i ,...,2,1,  = :),...,,( 21  N   x x x f q  = );(  θ  q p q ,,...,, 21  N   x x x  2 7  Probability of an error ã  ρ is preselectedand it is known as the significance level.  ρ  =∈ )( 0  H  Dq p q 1-  ρ 8   Application: The known variance case:  Let  x be a random variable and the experimental samples, , are assumed mutually independent. Also let  Compute the sample mean  This is also a random variable with mean valueThat is, it is an Unbiased Estimator  N  x i ,...,2,1 = 22 ])[(][ σ µ µ  =−=  x E  x E  ∑ = =  N ii  x N  x 1 1 ∑ = ==  N ii  x E  N  x E  1 ][1][  µ  9  The varianceDue to independenceThat is, it is  Asymptotically Efficient  Hypothesis test  Test Statistic: Define the variable 2  x σ  ) )]()[( 1])[(1]1[(])[( 2122212 µ µ µ µ µ  −−+−= −=− ∑∑∑∑ ==  jiji N ii N ii  x x E  N  x E  N  x N  E  x E  22 1  x x  σ σ   = [ ][ ]  µ µ  ˆ :ˆ : 01 =≠  x E  H  x E  H   N  xq /ˆ σ µ  −= 10  Central limit theoremunder  H  0  Thus, under  H  0 ⎟⎟ ⎠ ⎞⎜⎜⎝ ⎛  −−= 22 2)ˆ(exp2)( σ µ σ π   x N  N  x p  x )1,0(  2exp21)( 2  N qqq p q  ≈⎟⎟ ⎠ ⎞⎜⎜⎝ ⎛ −= π  11  The decisionstepsãCompute q from  x i  , i= 1  , 2,…,N ãChoose significance level  ρ ãCompute from  N  (0,1) tables  D =[-  x  ρ  , x  ρ ] ã   An example: A random variable  x has variance σ  2 = (0.23) 2 .  Ν  = 16 measurements are obtained giving.The significance level is  ρ = 0.05 . Test the hypothesis 00 reject if accept if   H  Dq H  Dq ∈∈ 35.1 =  x µ µ µ µ  ˆ:4.1ˆ: 10 ≠==  H  H 1-  ρ   12  Since σ 2 is known,is  N  (0,1) . From tables,we obtain the values with acceptance intervals [ -x  ρ  , x  ρ ] for normal  N  (0,1)  Thus 4/ˆ σ µ  −=  xq { } { } 95.0463.1 ˆ237.1Prob or 95.0113.0 ˆ113.0Prob or 95.0967.1 4/23.0 ˆ967.1Prob =<<=<−<− =⎭⎬⎫⎩⎨⎧<−<− µ µ µ   x x 3.293.092.572.321.961.641.441.28  x  ρ 0.9990.9980.990.980.950.90.850.8 1-  ρ  3 13  Since lieswithin the above acceptanceinterval,we accept  H  0 , i.e.,The interval [1.237, 1.463] is also known as confidence interval at the 1-  ρ = 0.95 level.We say that: There is no evidenceat the 5% level that the mean value is not equal to 4.1ˆ  = 4.1ˆ  == µ µ  µ  ˆ   14  The Unknown Variance Case  Estimate the variance. The estimateis unbiased, i.e.,  Define the test statistic ∑ = −−=  N ii  x x N  122 )(11ˆ σ  22 ]ˆ[  σ σ   =  E   N  / ˆ  xq σ µ  −= 15  This is no longer Gaussian. If  x is Gaussian, then q follows a t-distribution, with  N  -1 degrees of freedom   An example: .025.0levelcesignificanat the 4.1ˆ :hypothesisTest the .)23.0( ˆand35.1 ts,measuremenfromobtained,16Gaussian,is 022 ======  ρ µ µ σ   H  x N  x 16  Table of acceptance intervals for t-distribution  acceptedis4.1 ˆThus,493.1ˆ207.149.24/ˆˆ49.2Prob =<<⎭⎬⎫⎩⎨⎧<−<− µ µ σ µ   x 2.882.442.101.7318 2.902.462.111.7417 2.922.472.121.7516 2.952.492.131.7515 2.982.512.151.7614 3.012.532.161.7713 3.052.562.181.7812 0.990.9750.950.9 1- ρ Degrees of Freedom 17   Application in Feature Selection  The goal here is to test against zerothe difference  µ 1 -µ 2 of the respective means in ω 1  , ω 2 of a single feature.  Let  x i i= 1,…,  N  , the values of a feature in ω 1  Let  y i i= 1  ,…,N  , the values of the samefeature in ω 2   Assume in both classes (unknown or not)  The test becomes 22221  σ σ σ   == 0 :0 : 1210 ≠∆=−=∆ µ µ µ µ   H  H  18  Define  z=x-y  Obviously  E  [  z  ] =µ 1 -µ 2  Define the average  Known Variance Case: Define  This is  N  (0,1) and one follows the procedure as before. ∑ = −=−=  N iii  y x y x  N  z  1 )(1  N  y xq 2)ˆˆ()( 21 σ µ µ   −−−=  4 19  Unknown Variance Case:Define the test statistic ã q is t-distribution with 2  N- 2 degrees of freedom,ãThen apply appropriate tables as before.  Example:The values of a feature in two classes are: ω 1 : 3.5, 3.7, 3.9, 4.1, 3.4, 3.5, 4.1, 3.8, 3.6, 3.7 ω 2 : 3.2, 3.6, 3.1, 3.4, 3.0, 3.4, 2.8, 3.1, 3.3, 3.6 Test if the mean values in the two classes differ significantly, at the significance level  ρ = 0.05 ))()(( 2212)()( 1212221 ∑∑ == −+−−=−−−=  N ii N ii z  z   y y x x  N S  N S  y xq  µ µ  20  We haveFor  N= 10  From the table of the t-distribution with 2  N- 2 =18 degrees of freedom and  ρ = 0.05, we obtain  D= [ - 2.10,2.10] and since q= 4.25 is outside  D,H  1 is accepted and the feature is selected. 0672.0ˆ ,25.3 : 0601.0ˆ ,73.3 : 222211 ==== σ ω σ ω   y x 25.41020)()ˆˆ(21 22212 =−−=+= qS  y xqS   z  z   σ σ    21  Class Separability MeasuresThe emphasis so far was on individually considered features. However,such an approach cannot take into account existing correlations among the features. That is, two features may be rich in information, but if they are highlycorrelatedweneed not consider both of them. To this end, in order to search for possible correlations, we consider features  jointlyas elements of vectors. To this end:  Discard poor in information features, by means of a statistical test.  Choose the maximum number, , of features to be used. This is dictated by the specific problem (e.g., the number,  N  , of available training patterns and the type of the classifier to beadopted). l 22  Combine remaining features to search for the “best” combination. To this end:ãUse different feature combinations to form the feature vector. Train the classifier, and choose the combination resulting in the best classifier performance. A major disadvantageof this approach is the high complexity. Also, local minima, maygive misleading results.ãAdopt a class separability measure and choose the best feature combination against this cost. 23  Class separability measures:Let be the current feature combination vector.ãDivergence. To see the rationale behind this cost, consider the two –class case. Obviously, if on the averagethevalue of is close to zero, then should be apoor feature combination. Define: – – – d  12 is known as the divergenceand can be used as a class separability measure.  x x )|()|(ln 21 ω ω   x p x p  x   d  x p x p x p D ∫ +∞∞− = )|()|(ln)|( 21112 ω ω ω   xd  x p x p x p D ∫ +∞∞− = )|()|(ln)|( 12221 ω ω ω  211212  D Dd   += 24  –For the multi-class case, define d  ij for every pair of classes ω  i  , ω   j and the average divergenceis defined as –Some properties: –Largevalues of d  are indicative of goodfeature combination. ∑∑ = = =  M i M  jij ji d  P  P d  11 )()(  ω ω   jiijijij d d  jid d  ===≥  if ,0 0
Search
Tags
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks