Math & Engineering

A Semi-supervised Clustering via Orthogonal Projection

Description
As dimensionality is very high, image feature space is usually complex. For effectively processing this space,technology of dimensionality reduction is widely used. Semi-supervised clustering incorporates limited information into unsupervised clustering in order to improve clustering performance. However, many existing semi-supervised clustering methods can not be used to handle high-dimensional sparse data. To solve this problem, we proposed a semi-supervised fuzzy clustering method via constrained orthogonal projection. With results of experiments on different datasets, it shows the method has good clustering performance for handling high dimensionality data.
Published
of 4
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Transcript
  2009 ISECS International Colloquium on Computing, Communication, Control, and Management 978-1-4244-4246-1/09/$25.00 ©2009 IEEE CCCM 2009 A Semi-supervised Clustering via Orthogonal Projection Cui Peng Harbin Engineering UniversityHarbin 150001, Chinacuipeng83@163.com Zhang Ru-bo Harbin Engineering UniversityHarbin 150001, Chinazrbzrb@hrbeu.edu.cn   Abstract   —  As dimensionality is very high, image feature spaceis usually complex. For effectively processing this space,technology of dimensionality reduction is widely used. Semi-supervised clustering incorporates limited information intounsupervised clustering in order to improve clusteringperformance. However, many existing semi-supervisedclustering methods can not be used to handle high-dimensionalsparse data. To solve this problem, we proposed a semi-supervised fuzzy clustering method via constrained orthogonalprojection. With results of experiments on different datasets, itshows the method has good clustering performance forhandling high dimensionality data.  Keywords-dimension reduction; clustering; projection; semi- supervised learning  I.   I  NTRODUCTION  In recent years, because of fast extension of featureinformation and volume of image data, many tasks inmultimedia processing have become increasinglychallenging Dimensionality reduction techniques have been proposed to uncover the underlying low dimensionalstructures of the high-dimensional image space   [1].Theseefforts have proved to be very useful in image retrieval,classification and clustering. There are a number of dimensionality reduction techniques in the literature. One of the classical methods is Principal Component Analysis(PCA) [2], which minimizes the information loss in thereduction process. One of the disadvantages of PCA is thatit likely distorts the local structures of a dataset. LocalityPreserving Projection (LPP) [3-4] encodes the localneighborhood structure into a similarity matrix and derives alinear manifold embedding as the optimal approximation tothis matrix, but LPP, on the other hand, may overlook theglobal structures.Recently, semi-supervised learning has gained muchattention [6-10], which leverages domain knowledgerepresented in the form of pairwise constraints. Variousreduction techniques have been developed to utilize this formof knowledge[11-12].The constrained FLD defines the embedding basedsolely on must-link constraints. Semi-SupervisedDimensionality Reduction (SSDR) [13], preserves theintrinsic global covariance structure of the data whileexploiting both constraints.As many semi-supervised clustering methods are baseddensity or distance, they are difficult to handle high-dimensional data. Thus, reduced feature must be added intosemi-supervised clustering process. We proposeCOPFC(Constrained Orthogonal Projection FuzzyClustering)method to solve this problem.II.   COPEC   M ETHOD F RAMEWORK Figure 1. COPFC framework  Figure 1 shows the framework of the COPFC method.Given a set of instances and a set of supervision in the formof must-link constraints C ML ={(  x i ,  x  j )}, (  x i ,  x  j )   where (  x i ,  x  j )must reside in the same cluster, and cannot-link constraints, C  CL ={(  x i ,  x  j )}, (  x i ,  x  j ) where should be in the differentclusters, the COPFC method is composed of three steps. Inthe first step, a preprocessing method is exploited to reducethe unlabelled instances and pairwise constraints accordingto the transitivity property of must-link    constraints. In thesecond step, a constraint-guided Orthogonal projectionmethod, called COPFC  proj , is used to project the srcinaldata into a low-dimensional space. Finally, we apply a semi-supervised fuzzy clustering algorithm, called COPFC  fuzzy , produce the clustering results on the projected low-dimensional dataset. 356  III.   COPFC  PROJ  -   A   C ONSTRAINED O RTHOGONAL P ROJECTION M ETHOD In a typical image retrieval system, each image isrepresented by an m -dimensional feature vector   x whose jthvalue is denoted as  x  j . During the retrieval process, the user is allowed to mark several images with must-links whichmatch his query interest, and also to indicate thoseapparently irrelevant with cannot-links. COPFC  proj is alinear method and depends on a set of  l  axes p i . For a givenimage  x , its embedding coordinates are the projection of   x  onto l  axes, which are 1 , 1 m xijij j  Pxpil  = = ≤ ≤ ∑   .As the images in the set ML are considered mutuallysimilar to each other, they should be kept compactly in thenew space. In other words, the distances among them should be kept small, while the irrelevant images in CL are to bemapped far apart from those in ML as much as possible. Theabove two criteria can be formally stated as follows: 21 min() l  xyii xMLyMLi  PP  ∈ ∈ = − ∑ ∑ ∑ (1) 21 max() l  xyii xMLyCLi  PP  ∈ ∈ = − ∑ ∑ ∑ (2)Intuitively, equation (1) forces the embedding to havethe image points in reside in a small local neighborhood inthe new feature space, and equation (2) reflects our objective to prevent the points in and close together after theembedding. To construct a salient embedding, COPFC  proj  combines these two criteria and finds the axis in the one-by-one fashion which optimizes the following objective, 2 min()  xyii xMLyML  PP  ∈ ∈ − ∑ ∑ (3)subject to 2 min()1  xyii xMLyCL  PP  ∈ ∈ − = ∑ ∑ (4) 1231 ...0 TTTT iiiii  pppppppp − = = = = = (5) T  is the transpose of a vector. The choice of constant 1 onthe right hand side of equation (4) is rather arbitrary as anyother value (except 0) would not cause any substantialchanges in the embedding produced. The constraint inequation (5) is to force all the axes to be mutuallyorthogonal. Equations (3) and (4) are implicit functions of the axes  p i , which should be re-written in the explicit forms.First, we introduce the necessary notations. For a given set  X  of image points, the mean of   X  is an -dimensional columnvector  M  (  X  ) , whose i th component is 1() ii xX  MXx X  ∈ = ∑ (6)and its covariance matrix C  (  X  ) is an m × m matrix: 1()()() ijijij xX  CXxxMXMX  X  ∈ ⎛ ⎞= −⎜ ⎟⎝ ⎠ ∑ (7)For two sets  X  and Y  , define an m × m matrix M  (  X  , Y  ) , inwhich (,)(()())(()()) T  MXYMXMYMXMY  = − − .Accordingly, we canrewrite equation (3) as follows: 22 ()2(())  xyT iiii xMLyML  PPpMLCMLp ∈ ∈ − = ∑ ∑ (8)Similarly, we can rewrite equation (4) as follows: 2 ()((()()(,)))  xyT iii xMLyCLi  PPpMLCLCXCY MXYp ∈ ∈ − = ++ ∑ ∑ (9)Hence, the problem to be solved is min T ii  pAp , subjectto 11 1, ...0 TTT iiiii  pBppppp − = = = = , where 2 2(), (()()(,))  AMLCMLBMLCLCXCYMXY  = = + + .It is easy to see that both  A and  B are symmetric and positive semi-definite. The above problem can be solvedusing the Lagrange Multipliers method. Below we discussthe procedure to obtain the optimal axes.The first projection axis is the eigenvector of thegeneralized eigen-problem  Ap 1 =  λ  Bp 1 corresponding to thesmallest eigenvalue. After that, we compute the remainingaxes one by one in the following fashion. Suppose wealready obtained the first ( k  -1) axes, define: (1)121(1)(1)1(1) [,,...,],[] k k kkTk   PpppQPBP  −−− − − − == (10)Then the k  th axis  p k  is the eigenvector associated with thesmallest eigenvalue for the eigen-problem: 1(1)(1)1(1)1 ([][]) kkkT kk   IBPQPBApp λ  − − − − − − − = (11)We adopt the above procedure to determine the optimal l  orthogonal projection axes, which can preserve the metricstructure of the image space for the given relevancefeedback information. The new coordinates for the imagedata points can then be derived accordingly.IV.   COPFC  FUZZY    S EMI - SUPERVISED C LUSTERING COPFC  fuzzy is new search-based semi-supervisedclustering algorithm that allows the constraints to help theclustering process towards an appropriate partition. To thisend, we define an objective function that takes into account both the feature-based similarity between data points andthe pairwise constraints [14-16]. Let ML  be the set of must-link    constraints, i.e.(  x i ,  x  j ) ∈ ML implies that  x i   and  x  j   should be assigned to the same cluster, and CL the set of cannot-link constraints,(  x i ,  x  j ) ∈ CL x i   and  x  j should be assigned todifferent clusters. we can write the objective functionCOPFC  fuzzy must minimize: :   2211(,)11,(,)1211 (,)()(,) () ijij CN ikik kiCCC ikjlikjk MLkllkxxCLk CN ik ki  JVUud uuuuu μ λ γ  = =∈ = = ≠ ∈ == = =⎛ ⎞+ +⎜ ⎟⎜ ⎟⎝ ⎠⎡ ⎤−⎢ ⎥⎣ ⎦ ∑∑∑ ∑ ∑ ∑ ∑∑ ∑ xx x   (12) 357  The first term in equation (12) is the sum of squareddistances to the prototypes weighted by constrainedmemberships (Fuzzy C-Means objective function). Thisterm reinforces the compactness of the clusters.The second component in equation (12) is composed of:the cost of violating the pairwise must-link constraints; thecost of violating the pairwise cannot-link constraints. Thisterm is weighted by  λ , a constant factor that specifies therelative importance of the supervision.The third component in equation (12) is the sum of thesquares of the cardinalities of the clusters controls thecompetition between clusters. It is weighted by γ .  When the parameters are well chosen, the final partitionwill minimize the sum of intra-cluster distances, while partitioning the data set into the smallest number of clusterssuch that the specified constraints are respected as well as possible.   V.   E XPERIMENTAL E VALUATION    A.    Dataset selection and evaluation criterion We performed experiments on COREL image databaseand 2 datasets from UCI as follows:(1) We selected 1500 images from COREL imagedatabase. They were divided into 15 sufficiently distinctclasses of 100 images each. In our experiments, each imagewas represented by a 37-dimensional vector, which included3 types of features extracted for the image. We comparedCOPFC  proj algorithm against PCA and SSDR. The performance of each technique was evaluated under variousamounts of domain knowledge and different reduceddimensionalities. In different scenarios, after thedimensionality reduction, the Kmeans was applied toclassify the test images.(2) Iris and Wine datasets from UCI repository. Irisdataset contains three classes of 50 instances each and 4numerical attributes; Wine dataset contains three classes 178instances, and 13 numerical attributes. The simplicity andlow dimension of this data set also allows us to display theconstraints that are actually selected. To evaluate clustering performance of COPFC  fuzzy , we compared COPFC  fuzzy algorithm against Kmeans and PCKmeans algorithm.(3) Evaluation criterion. In this paper, we use CorrectedRand Index (CRI) as the clustering validation measure. CRI(1)/2  AC nnC  −=× − − (13)where  A is number of instance pairs which assigned cluster meets with actual cluster; n is number of all instances in thedataset, then (1)/2 nn × − is number all instance pairs indataset; C  is number of all constraints.For each dataset, we run each experiment 20 times. Tostudy the effect of constraints 100 constraints are generatedrandomly for test set. Each point on the learning curve is anaverage of results over 20 runs.  B.   The effectiveness of COPFC    In figure 2, we use three different dimensionalityreduction methods (COPFC  proj , PCA, SSDR) for srcinalimages. Dimensionalities are reduced 15, 20 respectively.For data of reduced dimension, we used Kmeans for clustering. The curves in figure 2 show clustering performance of PCA method is independent of number of constraints. However clustering performance of SSDR hadslight changes. For COPFC  proj , clustering performanceobtained largely improvement with increasing number of constraints. When there are small amount of constraints,clustering performance of COPFC  proj is worst in theremethods. In general, COPFC  proj outperforms PCA andSSDR for reducing dimensionalities. 102030405060708090100 Number of constraints0.60.650.70.75 COPFC  proj SSDR PCA 0.80.85Dimension=20   (a) (b)Figure 2. Clustering performance with different number of constraints Figure 3 shows clustering performance of three methodson Iris and Wine datasets. For all datasets, COPFC  fuzzy allobtained best performance. In three methods, clustering performance of Kmeans is worst. Though clustering performance of PCKmeans is effectively improved, it still isworse than that of COPFC  fuzzy . 1020 30405060708090100Numberofconstraints       C      R      I 0.80.830.860.890.950.920.981.01 COPFCPCKmeansKmeans         C      R      I   (a) Iris dataset (b) Wine datasetFigure 3. Clustering performance on UCI datasets VI.   C ONCLUSION AND F UTURE W ORK   We propose a semi-supervised fuzzy clustering viaorthogonal projection to handle high-dimensional sparsedata in image feature space. The method reducesdimensionalities of images via orthogonal projection, andclusters data of reduced dimensionalities by constrainedfuzzy clustering algorithm.There are several potential directions for future research.First, we are interested in automatically identifying the rightnumber for the reduced dimensionality based on the background knowledge other than providing a pre-specifiedvalue. Second, we plan to explore alternative methods toemploy supervision in guiding the unsupervised clustering. 358   R  EFERENCES   [1]   X. Yang, H. Fu and H. Zha. “Semi-Supervised Nonlinear Dimensionality Reduction”.  In Proc. of the 23rdIntl. Conf. onMachine Learning  , 2006.[2]   C. Ding and X. He. “K-Means Clustering via Principal ComponentAnalysis”.  In Proc. of the 21st Intl. Conf. on Machine Learning  , 2004.[3]   D. Cai, and X. F. He. “Orthogonal Locality Preserving Projection”.  In Proc. of the 28th Intl. ACM SIGIR Conf. on Research and  Development in information Retrieval  ,2005.[4]   X. F. He and P. Niyogi. “Locality Preserving Projections”.  Neural  Information Processing Systems . NIPS ’03, 2003.[5]   H. Cheng, K. Hua, and K. Vu. “Semi-Supervised DimensionalityReduction in Image Feature Space.Technical Report”, University of Central Florida, 2007.[6]   Wagstaff. K and Cardie C. “Clustering with instance—levelconstraints”.  Proc. of the 17th Int’1 Conf. on Machine Learning  . SanFrancisco: Morgan Kaufmann Publishers, 2000.[7]   S. Basu. “Semi-supervised Clustering: Probabilistic Models,Algorithms and Experiments”. Austin: The University of Texas, 2005[8]   S. Basu , A. Banerjee and R.J. Mooney, “Semi-supervised clustering by seeding”.  Proceedings of the 19th Int’l Conf. on Machine Learning(ICML 2002). 19 − 26[9]   Wagstaff K, Cardie C and Rogers S. “Constrained K-means clusteringwith background knowledge”.  Proc. of the 18th Int’l Conf. onMachine Learning  . Williamstown: Williams College, MorganKaufmann Publishers, 2001. 577 − 584.[10]   Klein D, Kamvar SD andManning CD. “From instance-Levelconstraints to space-level constraints: Making the most of prior knowledge in data clustering”.  In Proc. of the 19th Int’l Conf. onMachine Learning  . University of New South Wales. Sydney: MorganKaufmann Publishers, 2002. 307 − 314.[11]   Hertz T, Shental N and Bar-Hillel A. “Enhancing image and videoretrieval: Learning via equivalence constraint”.  Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition . Madison: IEEEComputer Society, 2003. pp.668 − 674.[12]   T. Deselaers, D. Keysers, and H. Ney. “Features for Image Retrieval – a Quantitative Comparison”.  In Pattern Recognition, 26th DAGM Symposium , 2004.[13]   D. Zhang, Z. H. Zhou, and S. Chen. “Semi-SupervisedDimensionality Reduction”.  In Proc. of the 2007 SIAM Intl.Conf. on Data Mining. SDM ’07  , 2007.[14]    N. Grira, M. Crucianu, N. Boujemaa. “Semi-supervised fuzzyclustering with pairwise-constrained competitive agglomeration”, in: IEEE International Conference on Fuzzy Systems , 2005.[15]   H. Frigui, R. Krishnapuram. “Clustering by competitiveagglomeration”,  Pattern Recognition 30 (7) ,1997 1109–1119.[16]   M. Bilenko, R.J. Mooney. “Adaptive duplicate detection usinglearnable string similarity measures”. i n: International Conference on Knowledge Discovery and Data Mining  , Washington, DC, 2003, pp.39–48 .   359

holi color pack

Nov 19, 2017

profile

Nov 19, 2017
Search
Tags
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks