International Journal of
Multiobjective P Mixed d
Pablo Barbaro Martin DarianHoracio Grass Bo (CENA Amanda Robinson,P a.
A
BSTRACT
In this paper we have designed an applied to mixed (continuous and multiobjective partitional clusterin external validity indexes Adjusted performing multiobjective partitio modeled with fuzzy logic, allo clusters, was obtained as a result of
CCS Concepts
•
Theory of computation
Unsupeoptimization and decisionmaking
•
C
Keywords
Partitional clustering; multiobj
rtificial Intelligence and Soft Computing (IJAISC) Vol. 1,
rtitional Clustering for Fu ta through Hill Climbing
ez Pedroso Hotel Melia Cayo Coco,Ciego de Á Cuba,+53 58053324 martinezpedroso@gmail.com ada Centro de Aplicaciones de Tecnologías Av AV) La Habana, Cuba,+53 54500567 dgrass@cenatav.co.cu
rovalis Research,Montreal Canada. +514 268 5obinson@provalisresearch.com
d implemented multiple possible stochastic hill climbi categorical) data in a fuzzy context, as proposed problem. To validate the efficacy of this approach and Index (ARI) and Minkowski Score (MS). An appro nal clustering with mixed data, which also provi ing for a better description of the distribution of obj the research.
vised learning and clustering
•
Applied computing
Mmputing methodologies
Cluster analysis.
ective hill climbing ;fuzzy domain; mixed data.
o. 1
1
zy and
ila, nzadas 88
g alternatives, olutions to a e selected the ach capable of des solutions cts among the
ulti  criterion
International Journal of
1.
INTRODUCTION
The large volume of information st of analyzing data analysis and c applied in order to extract unknow task of this process, which is define Giving a set of objects
,
,
is divided into
partitions (cluste
Equation (3) defines crisp partition are not very clear. Modeling the pr memberships of objects among the belongingness of objects to cluster order to model fuzzy partitional cl membership matrix, defined in (Xu
where
∈ 0,1
is m Where
is the number of clusters distribution of each object among th As (Xu 2009) outlines, optimal part heuristics are needed, although opti found. A well known technique is the km selected as centers of clusters. All o(Euclidian distance). Then centers until no new centersare computed o characterize the clusters.
rtificial Intelligence and Soft Computing (IJAISC) Vol. 1,
red in enterprises, entities, institutions, etc. surpasses hu mprehension. Knowledge discovery from databases p and interesting trends. Partitional clustering is a relevan d as follows
…,
where
,
,…,
and
is a featur s, groups)
,
,…,
,
where:
,1,…, 1
2
∩
,1,…,
3
al clustering. However, there exist domains where fron blem as fuzzy partitional clustering allows more accura roups, in contrast to crisp partitional clustering that cons s. This information plays an important role for the deci ustering, Equation (3) is substituted for a data structur 2009) as: mbership coefficient of
thgroup. Satisfying the followi
1,∀ 4
0
,∀ 5
and
the total amount of objects. Equation (4) is t e clusters whereas Equation (5) prevents obtaining empt tioning cannot be obtained due to the extreme computati mal solutions cannot be provided, at least nearoptimal sans algorithm, the procedure is as follows. First
object ther objects are grouped to the nearest center, based on of clusters are updated through Equation (7). This pro r an iteration limit is reached. The final centers obtaine
o. 1
2
man capability rocess can be t unsupervised e of the object. iers of groups y in respect to iders complete sion maker. In
e known as a ng restrictions: e membership groups. onal cost. Thus lutions can be s are randomly istance metric cedure iterates represent and
International Journal of
The Kmeans algorithm is only sui Nevertheless, many real life data s 2007), and a variation of kmeans is In this variation dissimilarity betwe
where: Centers are constructed with the mo However, most of the real problemsuch domains none of the previous limitation (Huang 1998) proposes this procedure distance function Eq numerical and categorical features mixed features denoted as vector of
,
Where
is used to avoid favoritis used for numerical data and mode f Representative objects of clusters c kmeans and variations as previous cluster can degrade substantially representative and all other objects
rtificial Intelligence and Soft Computing (IJAISC) Vol. 1,


6
1
∈
7
able for a numeric domain since Euclidian distance is p ets are categorical in nature as is pointed in (Anirban needed. n objects is measured by Equation (8) extracted from (H
,
,
8
,
0
1
9
de of each feature of cluster objects, known as kmodes. ed data sets are mixed in nature (numeric and categoric alternatives could by applied in their srcinal design. To n integration of both techniques in a procedure called k uation (6) and dissimilarity function Equation (8) are u respectively. Thus the difference between two objects attributes as
,
,…,
,
,…,
is calculated as f
,
10
of any features types. Computing representative object r categorical data. an be observed being constructed. In (Kamber 2006) au y presented, are sensitive to outliers, i.e. extremely dist the solution. An alternative is selecting an existin re grouped to the most similar, computed with Equation
o. 1
3
urely numeric. ukhopadhyay uang 1998): al features). In overcome this prototypes. In ed to compare
and
with
ollows: Equation (7) is thors state that nt data from a object as a (11).
International Journal of
Where
is total sum of error,
an strategy is known as kmedoids, representative object. Methods based in kmedoids are n dissimilarity measure Equation (8) Equation (10) is selected in order tackled however most of the real da overcome this limitation, a variatio medoids, it partitions the entire dat with a degree of belongingness or as follows.
Where
represents the m and
,…,
vector of med
The method starts with k rando calculated with Equation (13), it is
of
th cluster satisfies
su
The techniques presented so far opt type of distribution optimizing co unsupervised technique there is nocriterion alone cannot uncover grou
rtificial Intelligence and Soft Computing (IJAISC) Vol. 1,
∈
11
object of cluster
and
its correspondent representati a medoid being the most centrally located object in ot restricted to specific data type, thus distant metric can be adapted without limitation to kmedoid procedur
to cover a mixed data domain. So far a mixed data do ta set does not have clear enough frontiers between clust of the previously mentioned technique is adopted, kno a set into k clusters considering that each object belong embership, defined in (Mukhopadhyay 2013)
,:
∗
,
12
atrix of fuzzy partition,
membership degree of obje ids.
1∑
,
,
13
ly selected medoids. In each iteration, after member sed to recomputmedoids with Equation (14). Medoid ch as:
∗
,
14
imize only one criterion (compactness) over the entire d pactness can yield good solutions. However, since c previous knowledge about distribution of the objects. ps of distinct types, therefore, and as (Hruschka 2009) s
o. 1
4
e object. This a cluster, i.e. quation (6) or . Nevertheless main has been ers. In order to n as fuzzy k to all clusters ct
to cluster
ship matrix is ta set. For this lustering is an Moreover, one ggests quality
International Journal of
of clusters should be measured by Optimizing more than one crite multiobjective (Hruschka 2009).(H be more robust and provides better exploits entirely the potential of usi provided by multiples single obje simultaneously optimized. On the The multiobjective approach introd conflictive between them, thus opfunctions are considered as the probThe formalization of the Multiobjec Find the vector
̅
∗
∗
,
∗
,…,
∗
th
and
̅
To clarify when a solution is consi concepts can be found in (Coello C A solution
∈,
is said to be opti
ʹ
ʹ
,…,
ʹ
dodominates another vector
(
,
is,
∀ ∈ 1,…,,
∧∃1,
than one, a set of solutions is obta process. For a given MOO problem,
∗
:
The objective of this paper is to de capable of covering mixed and fuzz
rtificial Intelligence and Soft Computing (IJAISC) Vol. 1,
ultiple criteria instead of a single criterion. rion has been proposed in two main approaches andl J. 2007) and (Hruschka 2009) outline of ensemble solutions than single objective optimization, they posit ng various criteria. Since ensemble is restricted to integr ctive optimization techniques it does not exploit sol ther hand such solutions are explored by the multiobje uced in (Handl J. 2004a), optimizes simultaneously vari timizing one, degrades other. In such an approach, lem, and every one with the same level of priority. tive optimization problem is extracted from (Mukhopadh of decision variables that will satisfy the
inequality c
̅0,1,2,…, 15
e
equality constraints
̅0,1,2,…, 16
optimizes the vector function
̅
̅,
̅,…,
̅
17
dered optimal principles of Pareto are applied in this re ello 2007) and are defined as follows. al of Pareto respecting to
if and only if there is no
ʹ
minates
,…,
. A vector
…,
) denoted by (
≼
if and only if
is partially l
…,
such as
. Applying principles of Pareto t ined, known as the Pareto optimal set, which is in fact
, Pareto optimal set
∗
, is defined as:
∈  ∄
ʹ
∈
ʹ
≼ 18
velop a multiobjective optimization procedure for partiti y data.
o. 1
5
ensemble and which tends to hat it does not ating solutions tions that are tive approach. ous objectives, any objective
yay 2007a): onstraints earch. Related
∈
for which (
,…,
) ess than
,
this o MOO rather the aim of the onal clustering