Nature & Wildlife

A study of standardization of variables in cluster analysis

Description
A study of standardization of variables in cluster analysis
Published
of 24
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Transcript
  Journal of Classification 5:181-204 (1988) A Study of Standardization of Variables in Cluster Analysis Glenn W. MiUigan Martha C. Cooper The Ohio State University The Ohio State University Abstract: A methodological problem in applied clustering involves the decision of whether or not to standardize the input variables prior to the computation of a Euclidean distance dissimilarity measure. Existing results have been mixed with some studies recommending standardization and others suggesting that it may not be desirable. The existence of numerous approaches to standardization compli- cates the decision process. The present simulation study examined the standardiza- tion problem. A variety of data structures were generated which varied the inter- cluster spacing and the scales for the variables. The data sets were examined in four different types of error environments. These involved error free data, error perturbed distances, inclusion of outliers, and the addition of random noise dimen- sions. Recovery of true cluster structure as found by four clustering methods was measured at the correct partition level and at reduced levels of coverage. Results for eight standardization strategies are presented. It was found that those approaches which standardize by division by the range of the variable gave con- sistently superior recovery of the underlying cluster structure. The result held over different error conditions, separation distances, clustering methods, and coverage levels. The traditional z-score transformation was found to be less effective in several situations. Keywords: Standard scores; Cluster analysis. Authors' Addresses: Glenn W. Milligan, Faculty of Management Sciences, 301 Hagerty Hall, The Ohio State University, Columbus, Ohio 43210, USA. Martha C. Cooper, Faculty of Marketing, 421 Hagerty Hall, The Ohio State University, Columbus, Ohio 43210, USA.  182 G.W. Milligan and M. C. Cooper I. Introduction Although the issue of standardization of variables in a duster analytic study is important to an applied researcher, little is known about the impact of this procedural step. Standardization of variables would seem to be necessary in those cases where the dissimilarity measure, such as Euclidean distance, is sensitive to differences in the magnitudes or scales of the input variables. Sneath and Sokal (1973) indicate that standardization is accomplished with the use of translation and expansion. The purpose is to equalize the size or magnitude and the variability of the input variables. Similarly, Anderberg (1973) states that the purpose is to adjust the magnitude of the scores and the relative weighting of the variables. Romesburg (1984) makes the same argu- ments, particularly with respect to the idea of equal weighting of variables. The concept of achieving equal weight for each of the input variables has received much support. Cormack (1971) indicates that equal weighting is obtained by using an adjustment factor which is inversely proportional to the variability of the measure. Several authors have argued that the best strategy is not to standardize across all elements on the variable. Rather, standardization should occur within clusters on each variable. This position has been taken by Cormack (1971), Everitt (1980), Hartigan (1975), Fleiss and Zubin (1969), and Overall and Klett (1972). A number of authors have provided short derivations or graphical illustrations which show that the total variability is not only a func- tion of the within-cluster variances, but also of the distance between cluster means (Bayne, Beauchamp, Begovich, and Kane 1980; Fleiss and Zubin 1969; Hartigan 1975; Lorr 1983; and Sp~ith 1980). The inclusion of the latter distance confounds the standardization process and serves to reduce the apparent separation between clusters. However, it is impossible to directly standardize within-clusters in an applied analysis. To find the clusters, one must know the assignments of the elements to the clusters beforehand in order to perform the standardization. That is, one must know what the clusters are before finding the clusters. Overall and Klett (1972) proposed iterating the cluster analysis in an attempt to overcome this circularity problem. One first obtains clusters based on overall estimates. Next, these initial clusters are used to help determine within group variances for standardization in a second cluster analysis. The process can continue until no changes in cluster membership occur. Other logical difficulties face the uncritical use of standardization of variables. First, consider the possibility that the clusters which exist in the data are embedded in the unstandardized variable space. This situation seems at least as likely to occur as the existence of the clusters in the rescaled space. Sawery, Keller, and Conger (1960) were early advocates of the use of the  Standardization of Variables 183 unscaled input data for direct clustering. Second, although the arguments in favor of equal weighting of the input variables may seem appealing, there is no compelling reason to practice democracy while performing all cluster ana- lyses. In some cases, the differential weighting of the variables before stan- dardization may represent information that defines the clusters. In a study not addressing the issue of standardization, the concept of differential weighting of variables was approached by De Soete, DeSarbo, and Carroll (1985). The authors reported substantial improvements in cluster recovery with the use of a differential weighting algorithm for variables. A different strategy based on applying weights to the variables after standardization was introduced by Hohenegger (1986). Support for differential weighting also can be found in the early clustering literature. Both Hall (1965) and Williams, Dale, and MacNaughton-Smith (1964) advocated scaling by a measure of the impor- tance of the variable. Hence, there has not been universal agreement that equal weighting is necessary or optimal. 1.1 Forms of Standardization Numerous approaches to standardization of variables exist. The present study considers only the case involving numerical variables. Categorical variables, or the mixture of categorical and numerical variables are not considered. Researchers from social science backgrounds usually assume that a standardized variable has been transformed to have zero mean and unity variance as found with the typical z-score formula. However, other proposals for the standardization or scaling of variables can be found in the classification literature and these are reviewed in this section. For con- venience, the term standardization will be used in a generic sense in this arti- cle. The first form of standardization is the z-score formula used for transforming normal vafiates to standard score form: Z1 = (X -i~) / s, (1) where X is the srcinal data value, and X and s are the sample mean and stan- dard deviation, respectively. The transformed variable will have a mean of 0.0 and a variance of 1.00. The transformation has been proposed by numerous authors including Dubes and Jain (1980), Everitt (1980), Lorr (1983), Romesburg (1984), SAS (SAS User s Guide: Statistics, 1985), Sokal (1961), Sp~th (1980), and Williams, Lambert, and Lance (1966). Sp~th warns that ZI may not perform properly if there are substantial differences among the within-cluster standard deviations.  184 G.W. Minigan and M. C. Cooper It is important to note that standardization Z 1 must be applied in a glo- bal manner (across all items on the variable) and not within individual clus- ters. To understand this restriction, consider the case where three well- separated clusters exist in the data. Further, assume that a sample point is located at each of the three cluster centroids. Standardization within clusters would lead to a vector of scores for each of the three centroid points which contains zeros for all entries. Most any clustering procedure using the stand- ardized data would place the three centroid points in the same cluster. More generally, the same result occurs if the three points are located at the same standardized coordinate vector relative to each of the three cluster centroids. For example, there will be three different locations in a two variable space that would have coordinate values (1.0, -1.0). Data points found near these three different locations would have near zero interpoint distances and likely would be clustered together. Thus, such results would lead to an erroneous and very. misleading clustering solution. Thus, Z1 must not be used when standardization is computed within-cluster. The next standardization is similar to Z 1 and is computed as: Z2 = X / s . (2) Formula Z2 will result in a transformed variable with a variance of 1.00 and a transformed mean equal to X/s. However, since the scores have not been centered by subtracting the mean, the information as to the comparative loca- tion between scores remains. As such, Z2 will not suffer from the problem of the loss of information about the cluster centroids as is the case for Zl as noted earlier. Several authors have proposed the use of Z2 including Ander- berg (1973), Cormack (1971), Fleiss and Zubin (1969), Hartigan (1975), and Overall and Klett (1972). The reader should note that Z1 and Z2 are linear functions of each other. As such, Euclidean distances computed using the two formulas lead to identical dissimilarity values when global means and variances are used. The third procedure involves standardization with the use of the max- imum score on the variable: Z3 = X / Max (X) . (3) If all values are greater than or equal to zero, then the transformed variable acts as a proportional measure with all scores falling between 0.0 and 1.0. (If some X's are negative, a sufficiently large positive constant can be added to all values to obtain the proportionality property.) The transformed mean and standard deviation are X / Max (X) and s / Max (X), respectively. Although the upper limit of 1.00 is obtained in each data set, the lowest observed value  Standardization of Variables 185 may be greater than 0.0. Note that Z3 will not result in constant variances across transformed variables. In fact, Z3 leaves the relative variability (the range divided by the maximum value) unchanged before and after transfor- mation (Sneath and Sokal 1973). Further, Z3 is susceptible to the presence of outliers. A single large observation on a variable can have the effect of compressing the remaining values near 0.0. It would seem that Z3 is mean- ingful only when the variable constitutes a ratio scale. Standardization Z3 has been proposed by Cain and Harrison (1958), Hall (1969), and Romesburg (1984). The fourth and fifth standardizations involve using the range of the variable as the divisor: Z 4 = X / (Max (X) - Min (X)) , (4) Z5 = (X - Min (X)) / (Max (X) - Min (X)) . (5) Assuming nonnegative values, standardization Z 5 is bounded by 0.0 and 1.0 with at least one observed value at each of these end points. Formula Z4 is not bounded in this manner and wiHnot usually behave as a proportion. The transformed mean will be X/(Max(X)-Min(X)) for Z4 and (X - Min (X)) / (Max (X) - Min (X)) for Zs. Both procedures result in a transformed standard deviation of s / (Max(X) - Min(X)). The transformed mean or variance will not be constant across variables for either Z4 or Zs. However, an upper limit for the variance exists and is equal to .25. As with Z3, both Z4 and Z5 may be adversely affected by the presence of outliers on the variable. Standardization Z4 has been mentioned by Anderberg (1973), Carmichael, George, and Julius (1968), Cormack (1971), and Lance and Wil- liams (1967). Formula Z 5 has been proposed by Gower (1971), Romesburg (1984), Sneath and Sokal (1973), and Sp~th (1980). In particular, Sneath and Sokal prefer this transformation for many applications. As with the pair Z 1 and Z2, Z4 is a linear function of Zs. Hence, Euclidean distances based on Z4 will be identical in value to those computed from Z5 for the same data set. On occasion, a standardization based on normalizing to the sum of the observations has been suggested: Z 6 = X ] ~ . (6) Formula Z6 will normalize the sum of the transformed values to 1.00 and the transformed mean will equal 1 / n. As such, the mean will be constant across variables, but the variances will differ. Procedure Z6 was proposed by Romesburg (1984) and is similar to a formula given by Anderberg (1973) which used division by the sample mean.
Search
Similar documents
View more...
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks