A novel feature selection method for classification using a fuzzy criterion

Abstract. Although many classification methods take advantage of fuzzy sets theory, the same cannot be said for feature reduction methods. In this paper we explore ideas related to the use of fuzzy sets and we propose a novel fuzzy feature selection
of 13
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
  A novel feature selection method for classification usinga fuzzy criterion Maria Brigida Ferraro 1 , Antonio Irpino 2 , Rosanna Verde 2 , and Mario RosarioGuarracino 1 , 3 1 High Performance Computing and Networking Institute,National Research Council, Naples, Italy 2 Department of European and Mediterranean Studies,Second University of Naples, Caserta, Italy 3 Department of InformaticsKaunas University of Technology, Kaunas, Lithuania Abstract. Although many classification methods take advantage of fuzzy setstheory, the same cannot be said for feature reduction methods. In this paper weexplore ideas related to the use of fuzzy sets and we propose a novel fuzzy featureselection method tailored for the Regularized Generalized Eigenvalue Classifier(ReGEC). The method provides small and robust subsets of features that can beused for supervised classification. We show, using real world datasets that theperformance of ReGEC classifier on the selected features well compares withthat obtained using them all. 1 Introduction In many practical situations, the size and dimensionality of datasets is large and manyirrelevant and redundant features are included. In a classification context, learning fromhuge datasets could not work well even if theoretically more features should lead morediscriminant power. In order to face with this problem two kinds of algorithms canbe used: feature transformation (or extraction) and feature selection . Feature transfor-mation consists in constructing new features (in a lower dimentional space) from thesrcinal ones. These methods include clustering, basic linear transforms of the inputvariables (Principal Component Analysis/Singular Value Decomposition, Linear Dis-crimant Analysis), spectral transforms, wavelet transforms or convolution of kernels.The basic idea of a feature transformation is simply projecting a high-dimensional fea-ture vector onto a low-dimensional space. Unfortunately, the projection leads a loss of the measurement units of features and the obtained features are not easy to interpret.Feature selection (FS) may overcome this disadvantages.FS aims at selecting a subset of features relevant in terms of discrimination capabil-ity. It avoids the drawback of the output interpretability, because the selected featuresrepresent a subset of the given ones. FS is used as a preprocessing phase in many con-texts. It plays an important role in applications that involve a large number of featuresand only few samples. FS enables data mining algorithms to run when it is otherwiseimpossible given the dimensionality of the dataset. Furthermore, it permits to focus only  on relevant features and to avoid redundant information. FS strategy consists of the fol-lowing steps. From the srcinal set of features, a candidate subset is generated and thenevaluated by means of an evaluation criterion. The goodness of each subset is analyzedand, if it fulfills the stopping rule, it is selected and validated in order to check whetherthe subset is valid. Otherwise, if the subset does not fulfill the stopping rule, anothercandidate is generated and the whole process is repeated.The FS methods are classified as filters , wrappers and embedded  , depending on cri-terion used to evaluate the feature subsets. Filters are based on intrinsic characteristicsof features to reveal their discriminating power and do not depend on predictor. Thesemethods select features by ranking. Different relevance measures can be used. Thesemeasures include correlation criteria [1], the mutual information metric [2, 3, 4], classsimilarity measures with respect to the selected subset (FFSEM [5] and filter methodspresented in [6, 7]) and the separability of neighboring patterns (ReliefF [8]). A filterprocedure may involve a forward or a backward selection. Forward selection consistsin starting with no features and then, at each iteration, one or more features are addedif they bring additional contribution. The algorithm stops when no features among thecandidates lead a significant improvement. Backward selection (or elimination) startswith all features. At each iteration, one or more features are removed if they reduce thevalue of the total evaluation.Filters present a low complexity but the discriminant power may be not high, sincethe evaluation criterion can be not associated with the classifier in use. Embedded meth-odsdonotseparatethelearningfromthefeatureselectionphase,thusembeddingthese-lection within the learning algorithm. At the time of designing the predictor, these meth-ods pick up the relevant features. Embedded methods include decision trees, weightednaive Bayes (Duda et al. [9]), FS using the weight vector of Support Vector Machines(SVM) (Guyon et al. [10], Weston et al. [11]).In wrapper methods, FS depends on classifiers. Namely, each candidate subset isevaluated by analyzing the accuracy of a classifier. These methods, unlike the filters,are characterized by high computational costs but high classification rates are usuallyobtained. Filter algorithms are computationally more efficient, although their perfor-mance can be worse than wrapper algorithms.In a classification framework, data may present characteristics of different classesand can be affected by noise. To cope with this problem, classes may be consideredas fuzzy sets and data belong to each class with a degree of membership. Fuzzy logicimproves classification by means of overlapping class definitions and improves the in-terpretability of the results. In the last years, some efforts have been devoted to thedevelopment of methodologies for selecting feature subsets in an imprecise and uncer-tain context. To this extend, the idea of fuzzy set is used to characterize the imprecision.Ramze Rezaee et al. [12] present a method consisting of an automatic identificationof a reduced fuzzy set of a labeled multi-dimensional data set. The procedure includesthe projection of the srcinal data set onto a fuzzy space, and the determination of theoptimal subset of fuzzy features by using conventional search techniques. A k  -nearestneighbor (NN) algorithm is used. Pedrycz and Vukovich [13] generalize feature selec-tion method by introducing a mechanism of fuzzy feature selection. They propose toconsider granular features, rather than numeric. A process of fuzzy feature selection is  carried out and numerically quantified in the space of membership values generated byfuzzy clusters. In this case a simple Fazzy C-Means (FCM) algorithm is used. Morerecently, a new heuristic algorithm has been introduced by Li and Wu [5]. This al-gorithm is characterized by a new evaluation criterion, based on a min-max learningrule, and a search strategy for feature selection from fuzzy feature space. The authorsconsider the accuracy of  k  -NN classifier as the evaluation criterion. Hedjazi et al. [14]introduce a new feature selection algorithm, MEmbership Margin Based Attribute Se-lection (MEMBAS). This approach processes in the same way numerical, qualitativeand interval data based on an appropriate and simultaneous mapping, using fuzzy logicconcepts. They propose to use the Learning Algorithm for Multivariable Data Analysis(LAMBDA), a fuzzy classification algorithm that aims at getting the global member-ship degree of a sample to an existing class, taking into account the contributions of each feature. Chen et al. [15] introduce an embedded method. It is an integrated mech-anism to extract fuzzy rules and select useful features, simultaneously. They use theTakagi-Sugeno model for classification. Finally, Vieira et al. [16] consider fuzzy crite-ria in feature selection by using a fuzzy decision making framework. The underlyingoptimization problem is solved using an ant colony optimization algorithm previouslyproposed by the same authors. The classification accuracy is computed by means of afuzzy classifiers.A different approach is considered in the work proposed by Moustakidis and Theo-charis [17]. They propose a forward filter FS based on a Fuzzy Complementary Critrion(FuzCoC). They introduce the notion of fuzzy partition vector (FPV) associated witheach feature. A local fuzzy evaluation measure with respect to patterns is used and ittakes advantage of fuzzy membership degrees of training patterns (projected on thatfeature) to their own classes. These grades are obtained using a fuzzy output kernel-based SVM. FPV aims at detecting the data discrimination capability provided by eachfeature. It treats each feature on a pattern-wise base, thus allowing to assess redun-dancy between features. They obtain subsets of discriminating (highly relevant) andnon-redundant features. FuzCoC acts like a minimal-redundancy-maximal-relevance(mRMR) criterion. Once features have been selected, the prediction on class labels isobtained using a 1-NN.In the present work, we take inspiration from the above methodology and from[18] to devise a novel wrapper FS method. It can be seen as a FuzCoC constructedby a ReGEC (Guarracino et al. [19]) classification approach. By means of a binarylinear ReGEC, a one-versus-all (OVA) strategy is implemented, that allows to solvemulticlass problems. For each feature, distances between each pattern and classificationhyperplanes are computed, and they are used to construct the membership degree of each pattern to its own class. The sum of these grades represent the score associatedwith the feature, that is the capability to discriminate the classes. In this way, all featuresare ranked, and the selection process determines the features leading to an incrementof the total accuracy on training set. Hence, only features with highest discriminationpower are selected.The advantage of this strategy is that it takes into account the peculiarity of theclassification method, providing a set of features consistent with it. We show that thisprocess fits out a robust subset of features, thus, a change in training points produces a  small variation in the selected features. Furthermore, using standard datasets, we showthat the classification accuracy obtained with a small percentage of available features iscomparable with that obtained using all features.This paper is organized as follows. In the next section, a description of the forwardfilter FS SVM-FuzCoC ([17]) is given. Section 3 contains our proposal, FFS-ReGEC,and the novel algorithm is described. In order to check the adequacy of the proposedprocedure, in Secion 4, we present a discussion on the dataset SONAR. Some compar-ative results on real world datasets are given in Section 5. Finally, Section 6 containssome concluding remarks and open problems. 2 SVM-FuzCoC Let D = { x i , i = 1 , ··· ,  N  } be the training set, where x i = {  x ij ,  j = 1 , ··· , n } ( n is the totalnumber of features). The training patterns in D are initially sorted by class labels:  D = {  D 1 , ··· ,  D K  } where D k  = { x i 1 , ··· , x i  N k  } denotes the set of class k patterns and N  k  is the number of patterns included in D k  , with K  ∑ k  = 1  N  k  = N  . Following the OVA methodology, the authorsinitially train a set of M binary K-SVM classifiers on each single feature, to obtain fuzzymembership of each pattern to its class. Let x ij denote the feature j component of pattern x i , i = 1 , ··· ,  N  . According to FO-K-SVM, fuzzy membership value µ  k  (  x ij ) ∈ [ 0 , 1 ] of   x ij to class k is computed by µ  k  (  x ij ) =  0 . 5 if  f  k  (  x ij ) = m ijk  = 111 + e  ln  1 − γ  γ    ·   f  k  (  x ij ) − m ijk  | 1 − m ijk  |  if m ijk   = 1(1)where f  k  (  x ij ) is the decision value of the k  − th K-SVM binary classifier trained by  x ij , m ijk  = max l  = k  f  l (  x ij ) is the maximum decision value obtained by the rest ( k  − 1 ) K-SVM binary classifiers, and γ   is the membership degree threshold fixed by the user.The fuzzy partition vector (FPV) of feature j is defined as G (  j ) = { µ  G (  x 1  j ) , ··· , µ  G (  x  N j ) } (2)where µ  G (  x ij ) = µ  c i (  x ij ) ∈ [ 0 , 1 ] , i = 1 , ··· ,  N  . Generally, µ  G (  x ij ) is determined usingthe general formula (1) by replacing k  with c i , i.e., the class label which pattern x ij belongs to. Each FPV can be considered as a fuzzy set defined on D: G (  j ) = {  x ij , µ  G (  x ij ) |  x ij ∈ D } , |  D | = N  , i = 1 , ··· ,  N  , where µ  G (  x ij ) denotes the membership value of  x ij to fuzzy set G .Consider a set of initial features, S  = { z 1 , ··· , z n } . where , z  j = [  x 1  j , ··· ,  x  N j ] T  . For each  feature they construct in advance the associated FPV by means of the FO-K-SVM tech-nique. Let FS  (  p ) = { z l 1 , ··· , z l  p } denote the set of  p features selected up to and includ-ing iteration p . The cumulative set CS  (  p ) is an FPV representing the aggregating effect(union) of FPVs of the features contained in FS  (  p ) : CS  (  p ) = G ( z l 1 ) ∪···∪ G ( z l  p ) (3) CS  (  p ) fits out approximatively the quality of data coverage obtained by the featuresselected at the p − th iteration.Let z l  p be a candidate feature to be selected at iteration p . AC  (  p , z l  p ) denotes theadditional contribution of  z l  p with respect to the cumulative set CS  (  p − 1 ) obtained atthe preceding iteration, and it is determined by  AC  (  p , z l  p ) = G ( z l  p ) |−| CS  (  p − 1 ) (4)Feature selection, according to SVM-FuzCoC, follows the algorithm in Fig. 1 3 Fuzzy feature selection ReGEC The proposed FFS-ReGEC is a wrapper FS, incorporating a FuzCoC. The training pat-terns in D are initially sorted by class labels. Following the OVA methodology, we ini-tially train a set of M binary linear ReGEC classifiers on each single feature, to obtainfuzzy membership of each pattern to its class. Let x ij denote the feature j componentof pattern x i , i = 1 , ··· ,  N  . According to FO-ReGEC (Fuzzy Output ReGEC), a fuzzymembership value µ  c i (  x ij ) ∈ [ 0 , 1 ] of  x ij to its own class c i is computed by µ  c i (  x ij ) = f  +( 1 − f  ) · e −   x ij − c i  2 dm 2 (5)where   x ij − c i  2 isthesquareddistanceof   x ij fromitsoriginalclass c i , dm 2 = min l  = i   x ij − c l  2 is the minimum squared distance of  x ij from the other classes and f  is the minimummembership (fixed). The fuzzy score s  j of feature j is defined as s  j =  N  ∑ i = 1 = µ  c i (  x ij ) (6)Feature selection according to FFS-ReGEC consists of the following steps. From thefeature set we select the feature j with the highest score s  j , obtained by (6). Then weconsider the set of non-selected features. At each iteration p , we consider a candidatewith the highest score among non-selected ones. Let D  p be the dataset obtained consid-ering the features selected at iteration (  p − 1 ) and the candidate. We consider a linearMulti-ReGEC algorithm ( Guarracino et al. [20]) and we compute the accuracy rate ontraining set. If the last added feature increases accuracy on training set, we add it to theset of selected features. We iterate the procedure until a candidate leads an incrementof the total accuracy. In order to explain better this procedure, the algorithm of the FSis presented in Fig. 2
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks