Description

A Novel Approach for Determination of Optimal Number of Cluster

All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.

Related Documents

Share

Transcript

A Novel Approach for Determination of Optimal Number of Cluster
Debashis Ganguly
Computer Science and Engineering Department, Heritage Institute of Technology, Anandapur Kolkata – 700107, India DebashisGanguly@gmail.com
Swarnendu Mukherjee, Somnath Naskar
Computer Science and Engineering Department, Heritage Institute of Technology, Anandapur Kolkata – 700107, India mukherjee.swarnendu@gmail.com somnath_naskar_heritage@yahoo.co.in
Partha Mukherjee
Computer Science and Engineering Department, Heritage Institute of Technology, Anandapur Kolkata – 700107, India pmkjr2k@gmail.com
Abstract –
Image clustering and categorization is a means for high-level description of image content. In the field of content-based image retrieval (CBIR), the analysis of gray scale images has got very much importance because of its immense application starting from satellite images to medical images. But the analysis of an image with such number of gray shades becomes very complex, so, for simplicity we cluster the image into a lesser number of gray levels. Using K-Means clustering algorithm we can cluster an image to obtain segments. The main disadvantage of the k-means algorithm is that the number of clusters, K, must be supplied as a parameter. Again, this method does not specify the optimal cluster number. In this paper, we have provided a mathematical approach to determine the optimal cluster number of a clustered grayscale images. A simple index, based on the intra-cluster and inter-cluster distance measures has been proposed in this paper, which allows the number of clusters to be determined automatically.
Keywords:
Image, grayscale, clustering, k-means, validity.
I.
I
NTRODUCTION
Image segmentation was, is and will be a major research topic for many image processing researchers. The reasons are obvious and applications endless: most computer vision and image analysis problems require a segmentation stage in order to detect objects or divide the image into regions which can be considered homogeneous according to a given criterion, such as color, motion, texture, etc. Clustering is the search for distinct groups in the feature space. It is expected that these groups have different structures and that can be clearly differentiated. The clustering task separates the data into number of partitions, which are volumes in the n-dimensional feature space. These partitions define a hard limit between the different groups and depend on the functions used to model the data distribution [1]. In the process of image analysis, an image is considered to be complete data set. Now as we know that the operation on a huge data set is tedious, so, we firstly cluster the image to obtain small data sets from the bigger one. With the help of this technique we can easily group data points having same features together. Thus from the given data set we get some set of small data sets which are treated as segments. The goal of this segmentation is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze. So, our whole operational complexity reduces by a lot. Many approaches to image segmentation have been proposed over the years [1-12]. Of these various methods, clustering is one of the simplest, and has been widely used in segmentation of grey level images [13-15]. Most of the algorithms designed on the basis of the principle of K-Means clustering, requires priorior knowledge about the image, which is to be clustered, as this algorithm requires the initial cluster numbers for clustering. Now, again depending on the provided cluster numbers, the quality of the images will vary after clustering. This apporach makes the designed algorithm weeker as in the case of real time image processing, it is not possible to study the images before applying the algorithm. Hence, we need to have some validity measure technique using which we can balance these two trade-offs. So, in this paper, we have provided a technique for the determination of optimal cluster number from the supplied set of initial cluster centers in K-Means clustering algorithm. To the best of our knowledge, this work is specifically focused on validation of cluster number after clustering an image using unsupervised K-Means clustering technique. The design of this technique is based on extensive analytical as well as experimental modeling of the data-clustering process. II.
K-M
EANS
C
LUSTERING
T
ECHNIQUE
K-means (MacQueen, 1967), is one of the simplest unsupervised learning algorithms, that solves the well known clustering problem. The procedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters) fixed a priori. The main idea is to define k centroids, one for each cluster. These centroids should be placed in a cunning way because of different location causes different result. So, the better choice is to place them as much as possible far away from each other. The next step is to take each point belonging to a given data set and associate it to the nearest centroid. When no point is pending, the first step is completed and an early grouping is done. At this point we need to re-calculate k new centroids as barycenters of the clusters resulting from the previous step.
International Conference on Computer and Automation Engineering
978-0-7695-3569-2/09 $25.00 © 2009 IEEEDOI 10.1109/ICCAE.2009.40113
After we have these k new centroids, a new binding has to be done between the same data set points and the nearest new centroid. A loop has been generated. As a result of this loop we may notice that the k centroids change their location step by step until no more changes are done. In other words centroids do not move any more [22]. Finally, this algorithm aims at minimizing an objective function, in this case a squared error function. The objective function is given below. Here ||X
i (j)
- C
j
||
2
is a chosen distance measure
between a
data point
x
i (j)
and the cluster centre C
j
, is an indicator of the distance of the n data points from their respective cluster centre. This distance is popularly known as Euclidian Distance. The algorithm is composed of the following steps. 1.
Start. 2.
Place K points into the space represented by the objects that are being clustered. These points represent initial group centroids. 3.
Assign each object to the group that has the closest centroid. 4.
When all objects have been assigned, recalculate the positions of the K centroids. 5.
Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into groups from which the metric to be minimized can be calculated. 6.
Stop. Although it can be proved that the procedure will always terminate, the k-means algorithm does not necessarily find the most optimal configuration, corresponding to the global objective function. The algorithm is also significantly sensitive to the initial randomly selected cluster centers. The k-means algorithm can be run multiple times to reduce this effect [23]. K-means is a simple algorithm that has been adapted to many problem domains. Now, using K-Means clustering algorithm we can cluster an image to obtain segments. To run this algorithm, we need to provide the value of K which is nothing but the number of cluster centers. According to that number, this algorithm will cluster the supplied image. But, from the algorithm it is very much clear that there is no way to find the optimal number. So, we have to have some methods or heuristic approaches to validate it. III.
E
XISTING
V
ALIDITY
M
EASURES
Many criteria have been developed for determining cluster validity [16-21], all of which have a common goal to find the clustering which results in compact clusters which are well separated. To gather knowledge regarding our approach, we have concentrated on some techniques and methods which are described below. The Davies-Bouldin index [16], for example, is a function of the ratio of the sum of within-cluster scatter to between-cluster separation. The objective is to minimize this measure as we want to minimize the within-cluster scatter and maximize the between-cluster separation. It is defined as: In the equation, n is the number of clusters,
i
is the average distance of all patterns in cluster i to their cluster center
c
i
,
j
is the average distance of all patterns in cluster j to their cluster center
c
j
, and d(
c
i
,
c
j
) is the distance of cluster centers
c
i
and
c
j
. Small values of DB correspond to clusters that are compact, and whose centers are far away from each other. Consequently, the number of clusters that minimizes DB is taken as the optimal number of clusters. Bezdek and Pal [18] have given a generalization of Dunn's index [20]. Also, by considering five different measures of distance function between clusters and three different measures of cluster diameter, they obtained fifteen different values of the generalized Dunn's index. The Dunn’s index is generally represented as: In the above relation,
d
min
denote the smallest distance between two objects from different clusters, and
d
max
the largest distance of two objects from the same cluster. The Dunn’s index is limited to the interval [0,
] and should be maximized. Hubert, L. and Schultz, J. has also proposed a technique for validating cluster numbers [19]. They have introduced the concept of C-index. The C-index is defined as: In the given equation, S is the sum of distances over all pairs of objects form the same cluster, n is the number of those pairs and Smin is the sum of the n smallest distances if all pairs of objects are considered. Likewise Smax is the sum of the n largest distances out of all pairs. The C-index is limited to the interval [0, 1] and should be minimized. IV.
O
UR
A
PPROACH
From all the available measures that we have considered here, we can conclude that these are based on global concept. We have seen that if we use any global concept to find out the optimal cluster number, then we get an improper distribution space. The main objective of determining the optimal cluster
114
number is to obtain more compact cluster. By the term compact cluster, we try to mean those clusters whose segments are packed and the cluster centers are well separated. The packed segment means the set of data points which are closely related to its cluster center. So, to obtain the fine cluster, the global concept will not work well. As here the amount of participation of data points against a particular cluster center has to be determined first. On the other hand, if we use a local concept, by emphasizing the number of pixels in each cluster, it would show satisfactory result for any type of image. Hence, a new validity measure technique namely DSS index, has been proposed which is based on the local concept. In our method, we have defined two mathematical terms which are Intra and Inter. They are given below. In the above equations, X is a particular pixel belonging to the cluster C
j
and Z
j
is the cluster center of C
j
. N
j
is the number of pixels in the j
th
. Since the k-means method aims to minimize the sum of squared distances from all points to their cluster centers, this should result in compact clusters. We can therefore use the distances of the points from their cluster centre to determine whether the clusters are compact. For this purpose, we use the Intra-cluster distance measure, which is simply the sum of the average distances of the points from its cluster centre. We obviously want to minimize this measure. We can also measure the inter-cluster distance, or the distance between clusters, which we want to be as big as possible. We calculate this as the distance between cluster centers, and take the minimum of this value. This is termed as inter. We take only the minimum of this value as we want the smallest of this distance to be maximized, and the other larger values will automatically be bigger than this value. Now, after obtaining these two measures, we need to combine them to obtain the optimal cluster validity measure keeping two constraints. One of them is the minimization of Intra and the other one is the maximization of inter. So, we have taken ration of these two factors. The ratio is termed as DSS index and it is given below. As we have already stated that we are using the concept of local minimum instead of global maximum, then a local minimum in the values of the validity measure is defined to occur at k if,
DSS
(k - 1) > DSS (k) < DSS
(k + 1),
Once all the local minima are found out, we then compute the minimum value among all those minima (we treat this as Minimin), which directly gives the optimal cluster number for the considered image. So, we can write mathematically,
K
optimal
= min [Local minima (DSS)]
Now, after defining the DSS index, we have tested it with the help of different set of images. The results are shown in the next section. V.
R
ESULT
A
ND
D
ISCUSSION
As stated earlier in Section III., all the existing methods for validity measure calculation are based on global concept. But as K-means clustering techniques is best valid for non-textured images with no overlapped clusters, so it in turn results into discrete crude clusters. So, while dealing with validity index, we should emphasize on local behaviors and distribution of features rather than their global impact and contribution to the image quality. So, if we consider a generalized validity measure with global feature set contribution as below, And other keeping as same as of algorithm proposed by us, the output image with optimal number of clusters result as in Fig. 1.2, 2.2 and 3.2 where Fig. 1.1, 2.1 and 3.1 are the srcinal input images and the images presented as in Fig. 1.3, 2.3 and 3.3 are the output resulting from our algorithm after clustering with optimal number of clusters.
Lena.jpg LenaOldOp.jpg LenaNewOP.jpg (Original Input) (Output of Existing (Output of Our Concept at K
Optimal
= 10) Method
at K
Optimal
= 20)
Fig. 1.1 Fig. 1.2 Fig. 1.3
115
Brain.jpg BrainOldOp.jpg BrainNewOP.jpg (Original Input) (Output of Existing (Output of Our Concept at K
Optimal
= 5) Method
at K
Optimal
= 16)
Fig. 2.1 Fig. 2.2 Fig. 2.3
Cloud.jpg CloudOldOp.jpg CloudNewOP.jpg (Original Input) (Output of Existing (Output of Our Concept at K
Optimal
= 8) Method
at K
Optimal
= 23)
Fig. 3.1 Fig. 3.2 Fig. 3.3
If we use the local concept of feature distribution, we come in deriving the “Between the Cluster Distance”, i.e., calculation for Intra as stated in the formula stated in Section IV. The formula of our algorithm varies from the generalized formula of global feature set distribution at the point of summation of Intra Cluster distance of participating pixels with respect to a particular cluster centroid. In our algorithm we considered the participation count of pixels in each cluster, which is rather ignored in the global concept, as it varies majorly from cluster to cluster in crude clustered image resulting from K-means method. Hence, consideration of this cluster feature in our algorithm makes it better than the existing ones resulting better quality of images from both the visual point of view and expert analysis. Not only that it is to be notified that in our algorithm the global minima, i.e., the Minimin comes out to be at the first local minimum incurred, which was not the case for existing algorithm and this peculiarity of our algorithm helps it to achieve the status of an automatic selector of Optimal number Clusters of image as it can halt its mechanism at that very point where it first incurs a local minima and can declare that point as the actual optimal point of image clustering. The algorithm is tested over a wide variety of images like natural images like Lena.jpg, a medical image like Brain.jpg, a IR-Picture of fluid motion picture as in Cloud.jpg and many more and in all the cases the results are verified and all the special functionalities claimed are also validated. And thus our approach proves to be better and resulting more satisfactory clustering images than all other existing algorithms. VI.
C
ONCLUSION
After analyzing all the results, we can say that quality of the output images increases in the algorithm proposed by us. Whenever we do clustering for images then the possibility of information loss remains with the method that has been followed in the clustering procedure. The possibility becomes very much prone when the clustering becomes poor or the numbers of cluster centers are not optimal. Now, we know that grayscale images are very much textured and carry a lot of information when applied in real life situations. So, proper clustering is required to process them. We have observed that the validity measure criteria proposed by previous researchers also not the ultimate solution. Because the results obtained from this concept was not perfectly acceptable. In the Result and Discussion (Section V), we have seen the output images. If we verify them carefully, then we can easily comment that there is some definite information loss (in Fig. 1.2, 2.2 and 3.2). The result given by their method was approximately within 10. This number may not be optimal because the output images corresponding to some higher cluster number was also very much similar to the input image, i.e., resulting better clustered images than them. On the other hand, the technique proposed by us to validate a cluster number as optimal has given a satisfactory result. We have got the optimal cluster numbers always above 10 and around 20. The output images corresponding to the optimal cluster number were almost equivalent to the srcinal image. Again our method has also proved that only increase in the cluster number does not increase the image quality as the first local minima gives the actual global minima, i.e., Minimin over local minima and comes out to be the optimal number of cluster for the images. We hope that the method proposed by us to obtain optimal number cluster for grayscale images will definitely give a route to the researchers as well as developers to work with this type of problems related with image processing. R
EFERENCES
[1] N.R. Pal and S.K. Pal, A review on image segmentation techniques,
Pattern Recognition
, vol. 26, pp. 1277-1294, 1993. [2] K.S. Fu and J.K. Mui, A survey on image segmentation,
Pattern Recognition
, vol. 13, pp. 3-16, 1981. [3] R.M. Haralick and L.G. Shapiro, Survey image segmentation techniques,
Comput. Vision Graphics Image Process
., vol. 29, pp. 100-132, 1985. [4] A. Rosenfeld and L.S. Davis, Image segmentation and image models,
Proc. IEEE
, vol. 67, pp. 764-772, 1979. [5] P.K. Sahoo, S. Soltani, A.K.C. Wong and Y.C. Chen, A survey of thresholding techniques,
Comput. Vision Graphics Image Process.
, vol. 41, pp. 233-260, 1988. [6] A. Perez and R.C. Gonzalez, An iterative thresholding algorithm for image segmentation,
IEEE Trans. Pattern Anal. Machine Intell.
, vol. 9, pp. 742-751, 1987. [7] T. Peli and D. Malah, A study of edge detection algorithms,
Comput. Graphics Image Process.
, vol. 20, pp. 1-21, 1982. [8] R. Ohlander, K. Price and D.R. Reddy, Picture segmentation using a recursive region splitting method
, Comput. Graphics Image Process.
, vol. 8, pp. 313-333, 1978.
116
[9] S.L. Horowitz and T. Pavlidis, Picture segmentation by directed split and merge procedure,
Proc. 2nd Int. Joint Conf. Pattern Recognition
, pp. 424-433, 1974. [10] J. Liu and Y. Yang, Multiresolution color image segmentation,
IEEE Trans. Pattern Anal. Machine Intell.,
vol. 16, pp. 689-700, 1994. [11] M. Amadasun and R.A. King, Low level segmentation of multispectural images via agglomerative clustering of uniform neighbours,
Pattern Recognition
, vol. 21, pp. 261-268, 1988. [12] B. Bhanu and B.A. Rarvin, Segmentation of natural scene,
Pattern Recognition
, vol. 20, pp. 487-496, 1987. [13] G.B. Coleman and H.C. Andrews, Image segmentation by clustering,
Proc. IEEE
, vol. 67, pp. 773-785, 1979. [14] A.K. Jain and R.C. Dubes
, Algorithms for Clustering Data
, New Jersey: Prentice Hall, 1988. [15] R. Nevatia, Image segmentation, In T.Y.Young and K.S. Fu (Eds.),
Handbook of Pattern Recognition and Image Processing
, Orlando: Academic Press, 1986. [16] D.L. Davies and D.W. Bouldin, A cluster separation measure,
IEEE Trans. Pattern Anal. Machine Intell.
Vol. 1, pp. 224-227, 1979. [17] J.C. Dunn, A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters,
J. Cybern.
,vol. 3,pp. 32-57,1973. [18] J.C. Bezdek and N.R. Pal, Some new indexes of cluster validity,
IEEE Trans. Syst. Man. Cybern.
, vol. 28, pp. 301-315, 1998. [19] G.W. Milligan, Clustering validation: Results and implications for applied analyses, In P. Arabie, L.J. Hubert and G. De Soete (Eds.),
Clustering and Classification
, Singapore: World Scientific, pp. 341-375, 1996. [20] G.W. Milligan and M.C. Cooper, An examination of procedures for determining the number of clusters in a data set,
Psychometrika
, vol. 50, pp. 159-179, 1985. [21] M.C. Cooper and G.W. Milligan, The effect of measurement error on determining the number of clusters in cluster analysis, In W. Gaul and M. Schader (Eds.),
Data, Expert Knowledge and Decisions
, Berlin: Springer- Verlag, pp. 319-328, 1988. [22] A tutorial on K-Means Clustering, http://home.dei.polimi.it/matteucc/Clustering as visited on 20/08/2008. [23] J. B. MacQueen, "Some Methods for classification and Analysis of Multivariate Observations, 5-th Berkeley Symposium on Mathematical Statistics and Probability", Berkeley, 1967, 1:281-297.
117

Search

Similar documents

Related Search

Development of a novel approach for identificAnalytical Study For Determination of the SanMethod Development for determination of PestiA Novel Model for Competition and CooperationA Manual For Writers Of Research PapersFeasibility Study for Establishment of a PrivDetermination of Runoff Ceofficent of a Sandya different reason for the building of SilburDevelopment of average model for control of aFinancial Ratios as a Tool for Prediction of

We Need Your Support

Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks