A novel feature selection method for classiﬁcation usinga fuzzy criterion
Maria Brigida Ferraro
1
, Antonio Irpino
2
, Rosanna Verde
2
, and Mario RosarioGuarracino
1
,
3
1
High Performance Computing and Networking Institute,National Research Council, Naples, Italy
2
Department of European and Mediterranean Studies,Second University of Naples, Caserta, Italy
3
Department of InformaticsKaunas University of Technology, Kaunas, Lithuania
Abstract.
Although many classiﬁcation methods take advantage of fuzzy setstheory, the same cannot be said for feature reduction methods. In this paper weexplore ideas related to the use of fuzzy sets and we propose a novel fuzzy featureselection method tailored for the Regularized Generalized Eigenvalue Classiﬁer(ReGEC). The method provides small and robust subsets of features that can beused for supervised classiﬁcation. We show, using real world datasets that theperformance of ReGEC classiﬁer on the selected features well compares withthat obtained using them all.
1 Introduction
In many practical situations, the size and dimensionality of datasets is large and manyirrelevant and redundant features are included. In a classiﬁcation context, learning fromhuge datasets could not work well even if theoretically more features should lead morediscriminant power. In order to face with this problem two kinds of algorithms canbe used:
feature transformation
(or extraction) and
feature selection
. Feature transformation consists in constructing new features (in a lower dimentional space) from thesrcinal ones. These methods include clustering, basic linear transforms of the inputvariables (Principal Component Analysis/Singular Value Decomposition, Linear Discrimant Analysis), spectral transforms, wavelet transforms or convolution of kernels.The basic idea of a feature transformation is simply projecting a highdimensional feature vector onto a lowdimensional space. Unfortunately, the projection leads a loss of the measurement units of features and the obtained features are not easy to interpret.Feature selection (FS) may overcome this disadvantages.FS aims at selecting a subset of features relevant in terms of discrimination capability. It avoids the drawback of the output interpretability, because the selected featuresrepresent a subset of the given ones. FS is used as a preprocessing phase in many contexts. It plays an important role in applications that involve a large number of featuresand only few samples. FS enables data mining algorithms to run when it is otherwiseimpossible given the dimensionality of the dataset. Furthermore, it permits to focus only
on relevant features and to avoid redundant information. FS strategy consists of the following steps. From the srcinal set of features, a candidate subset is generated and thenevaluated by means of an evaluation criterion. The goodness of each subset is analyzedand, if it fulﬁlls the stopping rule, it is selected and validated in order to check whetherthe subset is valid. Otherwise, if the subset does not fulﬁll the stopping rule, anothercandidate is generated and the whole process is repeated.The FS methods are classiﬁed as
ﬁlters
,
wrappers
and
embedded
, depending on criterion used to evaluate the feature subsets. Filters are based on intrinsic characteristicsof features to reveal their discriminating power and do not depend on predictor. Thesemethods select features by ranking. Different relevance measures can be used. Thesemeasures include correlation criteria [1], the mutual information metric [2, 3, 4], classsimilarity measures with respect to the selected subset (FFSEM [5] and ﬁlter methodspresented in [6, 7]) and the separability of neighboring patterns (ReliefF [8]). A ﬁlterprocedure may involve a forward or a backward selection. Forward selection consistsin starting with no features and then, at each iteration, one or more features are addedif they bring additional contribution. The algorithm stops when no features among thecandidates lead a signiﬁcant improvement. Backward selection (or elimination) startswith all features. At each iteration, one or more features are removed if they reduce thevalue of the total evaluation.Filters present a low complexity but the discriminant power may be not high, sincethe evaluation criterion can be not associated with the classiﬁer in use. Embedded methodsdonotseparatethelearningfromthefeatureselectionphase,thusembeddingtheselection within the learning algorithm. At the time of designing the predictor, these methods pick up the relevant features. Embedded methods include decision trees, weightednaive Bayes (Duda
et al.
[9]), FS using the weight vector of Support Vector Machines(SVM) (Guyon
et al.
[10], Weston
et al.
[11]).In wrapper methods, FS depends on classiﬁers. Namely, each candidate subset isevaluated by analyzing the accuracy of a classiﬁer. These methods, unlike the ﬁlters,are characterized by high computational costs but high classiﬁcation rates are usuallyobtained. Filter algorithms are computationally more efﬁcient, although their performance can be worse than wrapper algorithms.In a classiﬁcation framework, data may present characteristics of different classesand can be affected by noise. To cope with this problem, classes may be consideredas fuzzy sets and data belong to each class with a degree of membership. Fuzzy logicimproves classiﬁcation by means of overlapping class deﬁnitions and improves the interpretability of the results. In the last years, some efforts have been devoted to thedevelopment of methodologies for selecting feature subsets in an imprecise and uncertain context. To this extend, the idea of fuzzy set is used to characterize the imprecision.Ramze Rezaee
et al.
[12] present a method consisting of an automatic identiﬁcationof a reduced fuzzy set of a labeled multidimensional data set. The procedure includesthe projection of the srcinal data set onto a fuzzy space, and the determination of theoptimal subset of fuzzy features by using conventional search techniques. A
k
nearestneighbor (NN) algorithm is used. Pedrycz and Vukovich [13] generalize feature selection method by introducing a mechanism of fuzzy feature selection. They propose toconsider granular features, rather than numeric. A process of fuzzy feature selection is
carried out and numerically quantiﬁed in the space of membership values generated byfuzzy clusters. In this case a simple Fazzy CMeans (FCM) algorithm is used. Morerecently, a new heuristic algorithm has been introduced by Li and Wu [5]. This algorithm is characterized by a new evaluation criterion, based on a minmax learningrule, and a search strategy for feature selection from fuzzy feature space. The authorsconsider the accuracy of
k
NN classiﬁer as the evaluation criterion. Hedjazi
et al.
[14]introduce a new feature selection algorithm, MEmbership Margin Based Attribute Selection (MEMBAS). This approach processes in the same way numerical, qualitativeand interval data based on an appropriate and simultaneous mapping, using fuzzy logicconcepts. They propose to use the Learning Algorithm for Multivariable Data Analysis(LAMBDA), a fuzzy classiﬁcation algorithm that aims at getting the global membership degree of a sample to an existing class, taking into account the contributions of each feature. Chen
et al.
[15] introduce an embedded method. It is an integrated mechanism to extract fuzzy rules and select useful features, simultaneously. They use theTakagiSugeno model for classiﬁcation. Finally, Vieira
et al.
[16] consider fuzzy criteria in feature selection by using a fuzzy decision making framework. The underlyingoptimization problem is solved using an ant colony optimization algorithm previouslyproposed by the same authors. The classiﬁcation accuracy is computed by means of afuzzy classiﬁers.A different approach is considered in the work proposed by Moustakidis and Theocharis [17]. They propose a forward ﬁlter FS based on a Fuzzy Complementary Critrion(FuzCoC). They introduce the notion of fuzzy partition vector (FPV) associated witheach feature. A local fuzzy evaluation measure with respect to patterns is used and ittakes advantage of fuzzy membership degrees of training patterns (projected on thatfeature) to their own classes. These grades are obtained using a fuzzy output kernelbased SVM. FPV aims at detecting the data discrimination capability provided by eachfeature. It treats each feature on a patternwise base, thus allowing to assess redundancy between features. They obtain subsets of discriminating (highly relevant) andnonredundant features. FuzCoC acts like a minimalredundancymaximalrelevance(mRMR) criterion. Once features have been selected, the prediction on class labels isobtained using a 1NN.In the present work, we take inspiration from the above methodology and from[18] to devise a novel wrapper FS method. It can be seen as a FuzCoC constructedby a ReGEC (Guarracino
et al.
[19]) classiﬁcation approach. By means of a binarylinear ReGEC, a oneversusall (OVA) strategy is implemented, that allows to solvemulticlass problems. For each feature, distances between each pattern and classiﬁcationhyperplanes are computed, and they are used to construct the membership degree of each pattern to its own class. The sum of these grades represent the score associatedwith the feature, that is the capability to discriminate the classes. In this way, all featuresare ranked, and the selection process determines the features leading to an incrementof the total accuracy on training set. Hence, only features with highest discriminationpower are selected.The advantage of this strategy is that it takes into account the peculiarity of theclassiﬁcation method, providing a set of features consistent with it. We show that thisprocess ﬁts out a robust subset of features, thus, a change in training points produces a
small variation in the selected features. Furthermore, using standard datasets, we showthat the classiﬁcation accuracy obtained with a small percentage of available features iscomparable with that obtained using all features.This paper is organized as follows. In the next section, a description of the forwardﬁlter FS SVMFuzCoC ([17]) is given. Section 3 contains our proposal, FFSReGEC,and the novel algorithm is described. In order to check the adequacy of the proposedprocedure, in Secion 4, we present a discussion on the dataset SONAR. Some comparative results on real world datasets are given in Section 5. Finally, Section 6 containssome concluding remarks and open problems.
2 SVMFuzCoC
Let
D
=
{
x
i
,
i
=
1
,
···
,
N
}
be the training set, where
x
i
=
{
x
ij
,
j
=
1
,
···
,
n
}
(
n
is the totalnumber of features). The training patterns in D are initially sorted by class labels:
D
=
{
D
1
,
···
,
D
K
}
where
D
k
=
{
x
i
1
,
···
,
x
i
N k
}
denotes the set of class k patterns and
N
k
is the number of patterns included in
D
k
, with
K
∑
k
=
1
N
k
=
N
. Following the OVA methodology, the authorsinitially train a set of M binary KSVM classiﬁers on each single feature, to obtain fuzzymembership of each pattern to its class. Let
x
ij
denote the feature j component of pattern
x
i
,
i
=
1
,
···
,
N
. According to FOKSVM, fuzzy membership value
µ
k
(
x
ij
)
∈
[
0
,
1
]
of
x
ij
to class k is computed by
µ
k
(
x
ij
) =
0
.
5 if
f
k
(
x
ij
) =
m
ijk
=
111
+
e
ln
1
−
γ γ
·
f
k
(
x
ij
)
−
m
ijk

1
−
m
ijk

if m
ijk
=
1(1)where
f
k
(
x
ij
)
is the decision value of the
k
−
th KSVM binary classiﬁer trained by
x
ij
,
m
ijk
=
max
l
=
k
f
l
(
x
ij
)
is the maximum decision value obtained by the rest
(
k
−
1
)
KSVM binary classiﬁers, and
γ
is the membership degree threshold ﬁxed by the user.The fuzzy partition vector (FPV) of feature
j
is deﬁned as
G
(
j
) =
{
µ
G
(
x
1
j
)
,
···
,
µ
G
(
x
N j
)
}
(2)where
µ
G
(
x
ij
) =
µ
c
i
(
x
ij
)
∈
[
0
,
1
]
,
i
=
1
,
···
,
N
.
Generally,
µ
G
(
x
ij
)
is determined usingthe general formula (1) by replacing
k
with
c
i
, i.e., the class label which pattern
x
ij
belongs to. Each FPV can be considered as a fuzzy set deﬁned on D:
G
(
j
) =
{
x
ij
,
µ
G
(
x
ij
)

x
ij
∈
D
}
,

D

=
N
,
i
=
1
,
···
,
N
,
where
µ
G
(
x
ij
)
denotes the membership value of
x
ij
to fuzzy set
G
.Consider a set of initial features,
S
=
{
z
1
,
···
,
z
n
}
. where
,
z
j
= [
x
1
j
,
···
,
x
N j
]
T
. For each
feature they construct in advance the associated FPV by means of the FOKSVM technique. Let
FS
(
p
) =
{
z
l
1
,
···
,
z
l
p
}
denote the set of
p
features selected up to and including iteration
p
. The cumulative set
CS
(
p
)
is an FPV representing the aggregating effect(union) of FPVs of the features contained in
FS
(
p
)
:
CS
(
p
) =
G
(
z
l
1
)
∪···∪
G
(
z
l
p
)
(3)
CS
(
p
)
ﬁts out approximatively the quality of data coverage obtained by the featuresselected at the
p
−
th
iteration.Let
z
l
p
be a candidate feature to be selected at iteration
p
.
AC
(
p
,
z
l
p
)
denotes theadditional contribution of
z
l
p
with respect to the cumulative set
CS
(
p
−
1
)
obtained atthe preceding iteration, and it is determined by
AC
(
p
,
z
l
p
) =
G
(
z
l
p
)
−
CS
(
p
−
1
)
(4)Feature selection, according to SVMFuzCoC, follows the algorithm in Fig. 1
3 Fuzzy feature selection ReGEC
The proposed FFSReGEC is a wrapper FS, incorporating a FuzCoC. The training patterns in D are initially sorted by class labels. Following the OVA methodology, we initially train a set of M binary linear ReGEC classiﬁers on each single feature, to obtainfuzzy membership of each pattern to its class. Let
x
ij
denote the feature j componentof pattern
x
i
,
i
=
1
,
···
,
N
. According to FOReGEC (Fuzzy Output ReGEC), a fuzzymembership value
µ
c
i
(
x
ij
)
∈
[
0
,
1
]
of
x
ij
to its own class
c
i
is computed by
µ
c
i
(
x
ij
) =
f
+(
1
−
f
)
·
e
−
x
ij
−
c
i
2
dm
2
(5)where
x
ij
−
c
i
2
isthesquareddistanceof
x
ij
fromitsoriginalclass
c
i
,
dm
2
=
min
l
=
i
x
ij
−
c
l
2
is the minimum squared distance of
x
ij
from the other classes and
f
is the minimummembership (ﬁxed). The fuzzy score
s
j
of feature
j
is deﬁned as
s
j
=
N
∑
i
=
1
=
µ
c
i
(
x
ij
)
(6)Feature selection according to FFSReGEC consists of the following steps. From thefeature set we select the feature
j
with the highest score
s
j
, obtained by (6). Then weconsider the set of nonselected features. At each iteration
p
, we consider a candidatewith the highest score among nonselected ones. Let
D
p
be the dataset obtained considering the features selected at iteration
(
p
−
1
)
and the candidate. We consider a linearMultiReGEC algorithm ( Guarracino
et al.
[20]) and we compute the accuracy rate ontraining set. If the last added feature increases accuracy on training set, we add it to theset of selected features. We iterate the procedure until a candidate leads an incrementof the total accuracy. In order to explain better this procedure, the algorithm of the FSis presented in Fig. 2