A variable-length genetic algorithm for clustering and classification

A variable-length genetic algorithm for clustering and classification
of 12
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
  ELSEVIER Pattern Recognition Letters 16 (1995) 789-800 Pattern R ecognition Lerters A variable-length genetic algorithm for clustering and classification * R. Srikanth a, , R. George a, N. Warsi a a Department of Computer Science, Clark Atlanta University, Atlanta, GA 30314, USA D. Prabhu b, RE. Petry b, B.P. Buckles b b Department of Computer Science, Tulane University, New Orleans, LA 70118, USA Received 20 November 1994; revised 14 March 1995 Abstract Pattern clustering and classification can be viewed as a search for, and labeling of a set of inherent clusters in any given data set. This approach can be divided broadly into two types namely supervised and unsupervised clustering. Motivated by human perception and Kohonen's method, we present a novel method of supervised clustering and classification using genetic algorithms. Clusters in the pattern space can be approximated by ellipses or sets of ellipses in two dimensions and ellipsoids in general, and the search for clusters can be approximated as the search for ellipsoids or sets of ellipsoids. By assigning fuzzy membership values to points in the pattern space a fuzzy ellipsoid is obtained. The process of thresholding which follows can be thought of as warping the contour of the ellipse to include and exclude certain points in pattern space and in effect producing an arbitrarily shaped cluster. Here we examine the use of genetic algorithms in generating fuzzy ellipsoids for learning the separation of the classes. Our evaluation function drives the genetic search towards the smallest ellipsoid or set of ellipsoids, which maximizes the number of correctly classified examples, and minimizes the number of misclassified examples. Keywords: Genetic algorithms; Pattern recognition; Clustering; Classification; Fuzzy; Neural nets 1. Introduction In this paper we discuss the use of genetic algo- rithms (GAs) in clustering and classification. Clus- tering and classification entails clustering of patterns in object space and automatically developing descrip- tions of the classes. Each class or cluster has certain constraints associated with it. These constraints have * This work was partially supported by the Army Center for Ex- cellence in Information Sciences, Clark Atlanta University, under ARO Grant DAAL03-G-0377. * Corresponding author. Email: to be fulfilled in order for an element to be classified as belonging to that class. In other words they have membership values of either 1 or 0. These sets (clus- ters) are called crisp sets. However not all situations lend themselves to crisp classification, and in these situations fuzzy sets can be used. While dealing with fuzzy sets (Zadeh, 1965; Bezdek and Pal, 1992), ele- ments are allowed to have membership values ranging from 0 to 1. A membership function is employed to map the dements of fuzzy sets on to set of real num- bers in the closed interval [0, 1 ]. A membership value is assigned to an dement based on the deviation of its 0167-8655/95/$09.50 © 1995 Elsevier Science B.V. All fights reserved SSDI 0167-8655(95)00043-7  790 R. Srikanth et al./Pattern Recognition Letters 16 1995) 789-800 Class Hyperplane Fig. 1. Clustering. Class 2 Membership Variation of range of Memberships Fig. 2. Membership value as a measure of distance. properties from a prototype of the set. Traditionally hyperplanes are used to section off portions of the object space, thereby providing a method for clustering and classification. In this paper however, we will generate sets of ellipsoids which will enclose portions of the object-space and thereby provide for clustering and classification. Here we fol- low the position that the use of ellipsoids as opposed to hyperplanes eliminates the necessity of classifica- tion of irrelevant parts of the object space. In Fig. 1, the hyperplane method classifies the point T as belonging to cluster 2 as it is to the right of the hy- perplane. However our fuzzy ellipsoid method will classify T as belonging to cluster 1, since by virtue of its position with respect to the two clusters and the ellipses, the point will claim higher membership value to cluster 1 and thus be assigned to it. Enclosing patterns of each class in a ellipsoid or set of ellipsoids, would be a logical way to cluster pat- terns. This however is non-trivial. The method pro- posed here attempts to generate ellipsoids which in turn produce clusters of patterns of the same class. In this approach each pattern not in the training set will be assigned a fuzzy membership value with respect to each cluster, depending on its distance from the el- lipse that encloses the cluster. Fig. 2 shows the con- tour plot of the ever decreasing values of membership as a measure of the distance from the innermost el- lipse which represents the cluster. The fuzzy values are then defuzzified, by assigning the pattern to the clus- ter which claims the largest membership value for the pattern. In contrast to another possible approach that may typically start from a prototype for each ellipsoid and build it to optimal size, we view the clustering problem as that of searching for an optimal set of el- lipsoids. Most search techniques use gradient descent approaches, as in the case of Srikanth et. al. (1993) who used elastic nets (a variant of neural networks), to find appropriate clusters. However, the technique is very slow and like most gradient based techniques is prone to getting stuck in local minima. We investigate the GA approach and a variant of the traditional ge- netic algorithms (a variable length genotype genetic algorithm) for this search. GAs are less prone to get- ting stuck in local minima; by using the power of the operators associated with them they are able to sample a significant portion of the solution space. In the following sections we discuss our approach to clustering using GAs, the details of the special GA model used and the corresponding genetic operators, and the representation and fitness computation issues. The methods are tested using the doughnut problem (see Fig. 3), and the two-class problem which is a traditional bench mark problem in pattern recognition literature. 2. Genetic algorithms Genetic algorithms (GAs) are a class of random- ized search procedures capable of adaptive and ro- bust search over a wide range of search space topolo- gies. Modeled after the adaptive emergence of bio- logical species from evolutionary mechanisms, and introduced by Holland (1975) in his seminal work Adaptation in Natural and Artificial Systems. GAs have been successfully applied in diverse fields such as image analysis (Ankenbrandt et al., 1990; Bala and Wechsler, 1993), scheduling (Syswerda and Pal- mucci, 1991 ), and engineering design (Bramlette and Bouchard, 1991 ), among others. Detailed introduction to GAs and their applications can be found in (Gold- berg, 1989; Davis, 199 I; Buckles and Petry, 1992; Pal et al., 1994; Buckles et al., 1994).  R. Srikanth et al./Pattern Recognition Letters 16 (1995) 789-800 791 • • :f \ °~o~o Fig. 3. Illustration of dataset for the Toroid problem. Genetic algorithms are distinguished from most other search techniques by their following character- istics: • They search using a set of solutions during each generation rather than a single solution. • Usually the search is done in a string-space, using some functional mapping from the solution space to the string space. • Search requires computation of fitness of each solu- tion as an estimation of its goodness in the problem domain. • Their search in the string-space represents a much larger parallel search in the space of schemata, the building blocks of encoded solutions. • The memory of the search done is represented solely by the set of solutions available for a generation. • They are randomized algorithms since search mech- anisms use probabilistic operators. • While progressing from one generation to the next, they find near-optimal balance between knowl- edge acquisition and exploitation by manipulating schemata. 2.1. Representation Any ellipsoid is completely defined by its geomet- ric parameters, namely, its srcin, length of its axes, and its orientation with respect to the axes of refer- ence. Thus search for the optimum ellipsoid or set of ellipsoids can be done by a GA, by searching in this parameter space. For the doughnut problem we have a two-feature space, and the ellipsoid in question be- comes a two-dimensional ellipse. The following section introduces the variable length genotype GA that was devised and used to investigate our approach.  792 R. Srikanth et al./Pattern Recognition Letters 16 1995) 789-800 2.2. Variable-length genotype GAs A principal drawback of using geometric shapes for clustering is that, the number of geometric shapes ( i.e., the number of clusters in the data set) is assumed to be known a priori. The research reported here addresses that drawback successfully. 2.2.1. Background There are some precedents for using variable- length genotypes. Smith (1980) has used them to encode sets of fixed-length rules in his "Pittsburgh" style LS-1 classifier system and the model discussed here is closest to his approach. Another important ap- proach to variable-length genotypes takes the form of tree-structured genotypes, as exemplified by Cramer (1985), or LISP-like S-expressions used by Koza (1992). This model is currently referred to as the genetic programming (GP) paradigm. In terms of specifying the structure of the solution, the approach described in this paper can be viewed as a compro- mise between the traditional GA and GP models. Messy GAs (Goldberg et al., 1989) also make use of genotypes of variable length. However, interpretation of such genotypes is based on the underlying fixed- length representation of the solution,just as in the case of traditional GA. Janikov's ( 1991) genetics-based inductive learning system makes use of variable- length genotypes for representing concepts together with domain-specific genetic operators. Harp and Samad (1992) have used a variable number of fixed- length blocks in constructing layers of a neural net- work. Their innovative concepts allow specifications of interconnections between different blocks. Finally, Harvey's (1992) SAGA model, geared towards appli- cations in Artificial Life, makes use of genotypes of arbitrarily varying lengths. In this model, addressing between building blocks is further extended by the use of template-matching. To achieve meaningful re- combination of parents, syntactic similarity measures are used in choosing the crossover points. 2.2.2. Basic model In the current application, a single ellipsoid consti- tutes a completely specified subsolution to the prob- lem. Any set or sequence of such subsolutions con- stitutes a feasible solution. Fitness of such a set of subsolutions can be determined by how well that set solves the given problem. The basic idea of this model of GA is as follows: First, a fixed-length string representation analogous to traditional GA is chosen for completely specifying a single subsolution. Next, an individual genotype is al- lowed to grow or shrink dynamically by the action of domain-independent genetic operators in such a way that the genotype always represents a set of an arbi- trary number of completely specified subsolutions. In other words, the length of a genotype is always an in- teger multiple of the length of the genotype needed for representing a single subsolution. Assuming the availability of a fixed-length repre- sentation for any completely specified subsolution in the problem domain, and a set of appropriate domain- independent genetic operators, this version of GA can follow the same algorithmic steps as the traditional GA. We can start with a random initial population, in which each individual is a singleton set consisting of one subsolution. The number of subsolutions and their allele values emerge as a result of the genetic search and the fitness function chosen for the problem. 2.2.3. Genetic operators A major issue in any variable-length GA formula- tion is the choice of genetic operators that always gen- erate meaningful offspring. In this approach, we have considered four such operators: crossover, insertion, deletion, and mutation. For all the examples given, let us assume the fixed length of a single subsolution to be 5 bits. 2.2.4. Crossover Since insertion and deletion effectively increase or decrease the length of a chromosome while restrict- ing it to be an integer multiple of a single subsolution length, crossover operators chosen should be compat- ible with the variable lengths of chromosomes and at the same time maintain integrity of the offspring cre- ated. The crossover operator defined below is analo- gous to the traditional single point crossover. Consider two chromosomes of lengths (in bits) Ii and 12, where both Ii and 12 are integer multiples of the length of a single subsolution. Let r > 0 be an integer drawn from a uniformly random distribution. If cl = (r mod ll) and c2 = (r mod 12) are taken to be the crossover  R. Srikanth et al./Pattern Recognition Letters 16 (1995) 789-800 793 points for these two individuals, the crossover results in two offspring the sum of whose lengths remains (ll + 12). Example: Let the two parents PI and P2 be made up of 5 and 2 (i.e 25 and 10 bits total length) subsolutions respectively. P1 ~ 10001 01101 00111 llllll 11100 P2 ~ 00110 000100 Let 68 be the generated random integer. Now cl = 68%25 = 18 and c2 = 68%10 = 8. The I in P1 and P2 shows these points. Now the resulting offspring are O1 ~ 10001 01101 00111 llll00 02 ~ 00110000111 11100. Note that each child is again guaranteed to be a set of completely specified subsolutions. We term this oper- ator modulo crossover for obvious reasons. Theoretically, coupled with an initial population consisting of genotypes of varying length, this cross- over operator alone should be sufficient to generate genotypes of necessary length as dictated by the re- quirements of the problem. However, for pragmatic reasons, we need additional operators for faster re- combination and juxtaposition at the level of subso- lutions. The following two operators can insert and delete sequences of subsolutions. 2.2.5. Insertion The insertion operator is used for adding new sub- solutions to an individual. To apply this operator on any individual, first randomly choose a insert-location on the individual, restricted to fall in between subso- lutions. Next, use the selection mechanism to choose a mate (or donor). Now, randomly choose a cut- location, restricted to fall in between subsolutions, fol- lowed by randomly choosing a cut-sequence-length in the feasible range. Finally insert this cut sequence at the insert-location of the individual. This operator ef- fectively increments the length of the individual by a multiple of single subsolution length. Example: Let P1 (an organism with 5 subsolutions (1,2, 3, 4, 5), where each subsolution is represented by 5 bits making the total length = 5 × 5 = 25 bits) be the individual and M1 (an organism with 4 subsolutions (1,2, 3,4) and total length = 4 x 5 = 20) be the mate. PI ~ 10001 0110100111 I 11111 1100 MI 000110 (0000010101 00110 ) Now, suppose the subsolution sequence of length 3 starting from the second subsolution (i.e., (2, 3, 4) in M1 ) is chosen from the mate (delimited by the (. - -) pair) and if the point just after subsolution 3 in PI is chosen as the insert-location (marked by 1). The insert operation results in the following new chromosome for P~ (new length = 8 × 5 = 40). The (.-.) pair shows the newly inserted portion. PI ::~ 10001 01101 00111 ( 0000010101 00110 ) 11111 11100 2.2.6. Deletion The probabilistic application of this operator has the opposite effect of the insert operation. Randomly cho- sen delete-location and delete-sequence-length values analogous to the above are used for deleting a sequence of subsolutions from the given individual. This results in a shorter chromosome for the individual. Consider the individual P1 : P1 ~ 10001 01101 (00111 11111 ) 11100 The result of deletion using a subsolution sequence of length 2 starting from the third subsolution ( i.e., (3, 4) in P1) is: PI ~ 10001 01101 11100 2.2.7. Mutation This has exactly the same function and form as that of the traditional mutation operator. 3. Fitness function The primary requirement of a fitness function in GAs is to estimate the goodness of a given encoded solution in the problem domain. For the current ap- plication, the computed fitness value must reflect the extent to which the generated ellipsoids correctly clas- sify the given training samples. Consider a k-dimensional feature space with the training samples belonging to n distinct classes. Let mi, i = 1 ..... n denote the number of training samples which belong to the ith class. Let Pij, i = .... n; j = 1 ..... mi denote the jth training sample of the ith class. Since each ellipsoid can represent
Similar documents
View more...
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks