2009 ISECS International Colloquium on Computing, Communication, Control, and Management
9781424442461/09/$25.00 ©2009 IEEE CCCM 2009
A Semisupervised Clustering via Orthogonal Projection
Cui Peng
Harbin Engineering UniversityHarbin 150001, Chinacuipeng83@163.com
Zhang Rubo
Harbin Engineering UniversityHarbin 150001, Chinazrbzrb@hrbeu.edu.cn
Abstract
—
As dimensionality is very high, image feature spaceis usually complex. For effectively processing this space,technology of dimensionality reduction is widely used. Semisupervised clustering incorporates limited information intounsupervised clustering in order to improve clusteringperformance. However, many existing semisupervisedclustering methods can not be used to handle highdimensionalsparse data. To solve this problem, we proposed a semisupervised fuzzy clustering method via constrained orthogonalprojection. With results of experiments on different datasets, itshows the method has good clustering performance forhandling high dimensionality data.
Keywordsdimension reduction; clustering; projection; semi supervised learning
I.
I
NTRODUCTION
In recent years, because of fast extension of featureinformation and volume of image data, many tasks inmultimedia processing have become increasinglychallenging Dimensionality reduction techniques have been proposed to uncover the underlying low dimensionalstructures of the highdimensional image space
[1].Theseefforts have proved to be very useful in image retrieval,classification and clustering. There are a number of dimensionality reduction techniques in the literature. One of the classical methods is Principal Component Analysis(PCA) [2], which minimizes the information loss in thereduction process. One of the disadvantages of PCA is thatit likely distorts the local structures of a dataset. LocalityPreserving Projection (LPP) [34] encodes the localneighborhood structure into a similarity matrix and derives alinear manifold embedding as the optimal approximation tothis matrix, but LPP, on the other hand, may overlook theglobal structures.Recently, semisupervised learning has gained muchattention [610], which leverages domain knowledgerepresented in the form of pairwise constraints. Variousreduction techniques have been developed to utilize this formof knowledge[1112].The constrained FLD defines the embedding basedsolely on mustlink constraints. SemiSupervisedDimensionality Reduction (SSDR) [13], preserves theintrinsic global covariance structure of the data whileexploiting both constraints.As many semisupervised clustering methods are baseddensity or distance, they are difficult to handle highdimensional data. Thus, reduced feature must be added intosemisupervised clustering process. We proposeCOPFC(Constrained Orthogonal Projection FuzzyClustering)method to solve this problem.II.
COPEC
M
ETHOD
F
RAMEWORK Figure 1. COPFC framework
Figure 1 shows the framework of the COPFC method.Given a set of instances and a set of supervision in the formof mustlink constraints C
ML
={(
x
i
,
x
j
)}, (
x
i
,
x
j
)
where (
x
i
,
x
j
)must reside in the same cluster, and cannotlink constraints,
C
CL
={(
x
i
,
x
j
)}, (
x
i
,
x
j
) where should be in the differentclusters, the COPFC method is composed of three steps. Inthe first step, a preprocessing method is exploited to reducethe unlabelled instances and pairwise constraints accordingto the transitivity property of mustlink
constraints. In thesecond step, a constraintguided Orthogonal projectionmethod, called COPFC
proj
, is used to project the srcinaldata into a lowdimensional space. Finally, we apply a semisupervised fuzzy clustering algorithm, called COPFC
fuzzy
, produce the clustering results on the projected lowdimensional dataset.
356
III.
COPFC
PROJ

A
C
ONSTRAINED
O
RTHOGONAL
P
ROJECTION
M
ETHOD
In a typical image retrieval system, each image isrepresented by an
m
dimensional feature vector
x
whose jthvalue is denoted as
x
j
. During the retrieval process, the user is allowed to mark several images with mustlinks whichmatch his query interest, and also to indicate thoseapparently irrelevant with cannotlinks. COPFC
proj
is alinear method and depends on a set of
l
axes
p
i
.
For a givenimage
x
, its embedding coordinates are the projection of
x
onto
l
axes, which are
1
, 1
m xijij j
Pxpil
=
= ≤ ≤
∑
.As the images in the set
ML
are considered mutuallysimilar to each other, they should be kept compactly in thenew space. In other words, the distances among them should be kept small, while the irrelevant images in
CL
are to bemapped far apart from those in
ML
as much as possible. Theabove two criteria can be formally stated as follows:
21
min()
l xyii xMLyMLi
PP
∈ ∈ =
−
∑ ∑ ∑
(1)
21
max()
l xyii xMLyCLi
PP
∈ ∈ =
−
∑ ∑ ∑
(2)Intuitively, equation (1) forces the embedding to havethe image points in reside in a small local neighborhood inthe new feature space, and equation (2) reflects our objective to prevent the points in and close together after theembedding. To construct a salient embedding, COPFC
proj
combines these two criteria and finds the axis in the onebyone fashion which optimizes the following objective,
2
min()
xyii xMLyML
PP
∈ ∈
−
∑ ∑
(3)subject to
2
min()1
xyii xMLyCL
PP
∈ ∈
− =
∑ ∑
(4)
1231
...0
TTTT iiiii
pppppppp
−
= = = = =
(5)
T
is the transpose of a vector. The choice of constant 1 onthe right hand side of equation (4) is rather arbitrary as anyother value (except 0) would not cause any substantialchanges in the embedding produced. The constraint inequation (5) is to force all the axes to be mutuallyorthogonal. Equations (3) and (4) are implicit functions of the axes
p
i
, which should be rewritten in the explicit forms.First, we introduce the necessary notations. For a given set
X
of image points, the mean of
X
is an dimensional columnvector
M
(
X
) , whose
i
th component is
1()
ii xX
MXx X
∈
=
∑
(6)and its covariance matrix
C
(
X
) is an
m
×
m
matrix:
1()()()
ijijij xX
CXxxMXMX X
∈
⎛ ⎞= −⎜ ⎟⎝ ⎠
∑
(7)For two sets
X
and
Y
, define an
m
×
m
matrix
M
(
X
,
Y
) , inwhich
(,)(()())(()())
T
MXYMXMYMXMY
= − −
.Accordingly, we canrewrite equation (3) as follows:
22
()2(())
xyT iiii xMLyML
PPpMLCMLp
∈ ∈
− =
∑ ∑
(8)Similarly, we can rewrite equation (4) as follows:
2
()((()()(,)))
xyT iii xMLyCLi
PPpMLCLCXCY MXYp
∈ ∈
− = ++
∑ ∑
(9)Hence, the problem to be solved is
min
T ii
pAp
, subjectto
11
1, ...0
TTT iiiii
pBppppp
−
= = = =
, where
2
2(), (()()(,))
AMLCMLBMLCLCXCYMXY
= = + +
.It is easy to see that both
A
and
B
are symmetric and positive semidefinite. The above problem can be solvedusing the Lagrange Multipliers method. Below we discussthe procedure to obtain the optimal axes.The first projection axis is the eigenvector of thegeneralized eigenproblem
Ap
1
=
λ
Bp
1
corresponding to thesmallest eigenvalue. After that, we compute the remainingaxes one by one in the following fashion. Suppose wealready obtained the first (
k
1) axes, define:
(1)121(1)(1)1(1)
[,,...,],[]
k k kkTk
PpppQPBP
−−− − − −
==
(10)Then the
k
th axis
p
k
is the eigenvector associated with thesmallest eigenvalue for the eigenproblem:
1(1)(1)1(1)1
([][])
kkkT kk
IBPQPBApp
λ
− − − − − −
− =
(11)We adopt the above procedure to determine the optimal
l
orthogonal projection axes, which can preserve the metricstructure of the image space for the given relevancefeedback information. The new coordinates for the imagedata points can then be derived accordingly.IV.
COPFC
FUZZY
S
EMI

SUPERVISED
C
LUSTERING
COPFC
fuzzy
is new searchbased semisupervisedclustering algorithm that allows the constraints to help theclustering process towards an appropriate partition. To thisend, we define an objective function that takes into account both the featurebased similarity between data points andthe pairwise constraints [1416]. Let
ML
be the set of mustlink
constraints, i.e.(
x
i
,
x
j
)
∈
ML
implies that
x
i
and
x
j
should be assigned to the same cluster, and
CL
the set of cannotlink constraints,(
x
i
,
x
j
)
∈
CL x
i
and
x
j
should be assigned todifferent clusters. we can write the objective functionCOPFC
fuzzy
must minimize:
：
2211(,)11,(,)1211
(,)()(,) ()
ijij
CN ikik kiCCC ikjlikjk MLkllkxxCLk CN ik ki
JVUud uuuuu
μ λ γ
= =∈ = = ≠ ∈ == =
=⎛ ⎞+ +⎜ ⎟⎜ ⎟⎝ ⎠⎡ ⎤−⎢ ⎥⎣ ⎦
∑∑∑ ∑ ∑ ∑ ∑∑ ∑
xx
x
(12)
357
The first term in equation (12) is the sum of squareddistances to the prototypes weighted by constrainedmemberships (Fuzzy CMeans objective function). Thisterm reinforces the compactness of the clusters.The second component in equation (12) is composed of:the cost of violating the pairwise mustlink constraints; thecost of violating the pairwise cannotlink constraints. Thisterm is weighted by
λ
, a constant factor that specifies therelative importance of the supervision.The third component in equation (12) is the sum of thesquares of the cardinalities of the clusters controls thecompetition between clusters. It is weighted by
γ
.
When the parameters are well chosen, the final partitionwill minimize the sum of intracluster distances, while partitioning the data set into the smallest number of clusterssuch that the specified constraints are respected as well as possible.
V.
E
XPERIMENTAL
E
VALUATION
A.
Dataset selection and evaluation criterion
We performed experiments on COREL image databaseand 2 datasets from UCI as follows:(1) We selected 1500 images from COREL imagedatabase. They were divided into 15 sufficiently distinctclasses of 100 images each. In our experiments, each imagewas represented by a 37dimensional vector, which included3 types of features extracted for the image. We comparedCOPFC
proj
algorithm against PCA and SSDR. The performance of each technique was evaluated under variousamounts of domain knowledge and different reduceddimensionalities. In different scenarios, after thedimensionality reduction, the Kmeans was applied toclassify the test images.(2) Iris and Wine datasets from UCI repository. Irisdataset contains three classes of 50 instances each and 4numerical attributes; Wine dataset contains three classes 178instances, and 13 numerical attributes. The simplicity andlow dimension of this data set also allows us to display theconstraints that are actually selected. To evaluate clustering performance of COPFC
fuzzy
, we compared COPFC
fuzzy
algorithm against Kmeans and PCKmeans algorithm.(3) Evaluation criterion. In this paper, we use CorrectedRand Index (CRI) as the clustering validation measure.
CRI(1)/2
AC nnC
−=× − −
(13)where
A
is number of instance pairs which assigned cluster meets with actual cluster;
n
is number of all instances in thedataset, then
(1)/2
nn
× −
is number all instance pairs indataset;
C
is number of all constraints.For each dataset, we run each experiment 20 times. Tostudy the effect of constraints 100 constraints are generatedrandomly for test set. Each point on the learning curve is anaverage of results over 20 runs.
B.
The effectiveness of COPFC
In figure 2, we use three different dimensionalityreduction methods (COPFC
proj
, PCA, SSDR) for srcinalimages. Dimensionalities are reduced 15, 20 respectively.For data of reduced dimension, we used Kmeans for clustering. The curves in figure 2 show clustering performance of PCA method is independent of number of constraints. However clustering performance of SSDR hadslight changes. For COPFC
proj
, clustering performanceobtained largely improvement with increasing number of constraints. When there are small amount of constraints,clustering performance of COPFC
proj
is worst in theremethods. In general, COPFC
proj
outperforms PCA andSSDR for reducing dimensionalities.
102030405060708090100 Number of constraints0.60.650.70.75
COPFC
proj
SSDR PCA
0.80.85Dimension=20
(a) (b)Figure 2. Clustering performance with different number of constraints
Figure 3 shows clustering performance of three methodson Iris and Wine datasets. For all datasets, COPFC
fuzzy
allobtained best performance. In three methods, clustering performance of Kmeans is worst. Though clustering performance of PCKmeans is effectively improved, it still isworse than that of COPFC
fuzzy
.
1020 30405060708090100Numberofconstraints
C R I
0.80.830.860.890.950.920.981.01
COPFCPCKmeansKmeans
C R I
(a) Iris dataset (b) Wine datasetFigure 3. Clustering performance on UCI datasets
VI.
C
ONCLUSION AND
F
UTURE
W
ORK
We propose a semisupervised fuzzy clustering viaorthogonal projection to handle highdimensional sparsedata in image feature space. The method reducesdimensionalities of images via orthogonal projection, andclusters data of reduced dimensionalities by constrainedfuzzy clustering algorithm.There are several potential directions for future research.First, we are interested in automatically identifying the rightnumber for the reduced dimensionality based on the background knowledge other than providing a prespecifiedvalue. Second, we plan to explore alternative methods toemploy supervision in guiding the unsupervised clustering.
358
R
EFERENCES
[1]
X. Yang, H. Fu and H. Zha. “SemiSupervised Nonlinear Dimensionality Reduction”.
In Proc. of the 23rdIntl. Conf. onMachine Learning
, 2006.[2]
C. Ding and X. He. “KMeans Clustering via Principal ComponentAnalysis”.
In Proc. of the 21st Intl. Conf. on Machine Learning
, 2004.[3]
D. Cai, and X. F. He. “Orthogonal Locality Preserving Projection”.
In Proc. of the 28th Intl. ACM SIGIR Conf. on Research and Development in information Retrieval
,2005.[4]
X. F. He and P. Niyogi. “Locality Preserving Projections”.
Neural Information Processing Systems
. NIPS ’03, 2003.[5]
H. Cheng, K. Hua, and K. Vu. “SemiSupervised DimensionalityReduction in Image Feature Space.Technical Report”, University of Central Florida, 2007.[6]
Wagstaff. K and Cardie C. “Clustering with instance—levelconstraints”.
Proc. of the 17th Int’1 Conf. on Machine Learning
. SanFrancisco: Morgan Kaufmann Publishers, 2000.[7]
S. Basu. “Semisupervised Clustering: Probabilistic Models,Algorithms and Experiments”. Austin: The University of Texas, 2005[8]
S. Basu , A. Banerjee and R.J. Mooney, “Semisupervised clustering by seeding”.
Proceedings of the 19th
Int’l Conf. on Machine Learning(ICML 2002). 19
−
26[9]
Wagstaff K, Cardie C and Rogers S. “Constrained Kmeans clusteringwith background knowledge”.
Proc. of the 18th Int’l Conf. onMachine Learning
. Williamstown: Williams College, MorganKaufmann Publishers, 2001. 577
−
584.[10]
Klein D, Kamvar SD andManning CD. “From instanceLevelconstraints to spacelevel constraints: Making the most of prior knowledge in data clustering”.
In Proc. of the 19th Int’l Conf. onMachine Learning
. University of New South Wales. Sydney: MorganKaufmann Publishers, 2002. 307
−
314.[11]
Hertz T, Shental N and BarHillel A. “Enhancing image and videoretrieval: Learning via equivalence constraint”.
Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition
. Madison: IEEEComputer Society, 2003. pp.668
−
674.[12]
T. Deselaers, D. Keysers, and H. Ney. “Features for Image Retrieval – a Quantitative Comparison”.
In Pattern Recognition, 26th DAGM Symposium
, 2004.[13]
D. Zhang, Z. H. Zhou, and S. Chen. “SemiSupervisedDimensionality Reduction”.
In Proc. of the 2007 SIAM Intl.Conf. on Data Mining. SDM ’07
, 2007.[14]
N. Grira, M. Crucianu, N. Boujemaa. “Semisupervised fuzzyclustering with pairwiseconstrained competitive agglomeration”,
in: IEEE International Conference on Fuzzy Systems
, 2005.[15]
H. Frigui, R. Krishnapuram. “Clustering by competitiveagglomeration”,
Pattern Recognition
30 (7) ,1997 1109–1119.[16]
M. Bilenko, R.J. Mooney. “Adaptive duplicate detection usinglearnable string similarity measures”. i
n: International Conference on Knowledge Discovery and Data Mining
, Washington, DC, 2003, pp.39–48
.
359