Graphics & Design

A robust estimator based on density and scale optimization and its application to clustering

Description
A robust estimator based on density and scale optimization and its application to clustering
Published
of 9
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Transcript
  A Robust Estimator Based on Density and ScaleOptimization and its Application to Clustering Olfa NASRAOUI and Raghu KRISHNAPURAMDepartment of Computer Engineering and Computer ScienceUniversity of Missouri-Columbia,Columbia, MO, 65211 USAolfa or raghu@ece.missouri.eduDecember 2, 1999 Abstract In this paper, we propose a new robust algorithm that estimates the prototypeparameters of a given structure from a possibly noisy data set. The new algorithmhas several attractive features. It does not make any assumptions on the proportionof noise in the data set. Instead, it dynamically estimates a scale parameter and theweights/memberships associated with each data point, and softly rejects outliersbased on these weights. The algorithm essentially optimizes a density criterion,since it tries to minimize the size while maximizing the cardinality. Moreover,the proposed algorithm is computationally simple, and can be extended to performparameter estimation when the data set consists of multiple clusters. 1 Introduction It is well known that classical statistical estimators such as the Least Squares (LS) areinadequate for most real applications where data can be corrupted by arbitrary noise.To overcome this problem, several robust estimators have been proposed. Examples of such estimators include M  ?   , W  ?   , and L  ?   estimators [3]. These estimators offer anadvantage in terms of easy computation and robustness. However, they have relativelylow breakdown points. Most of them are better able to withstand outliers in the depen-dent variable, but break down very early in the presence of leverage points (outliers inthe explanatory variables). More recently, the Least Median Of Squares (LMedS), theLeast Trimmed Squares (LTS) [5], and the Reweighted Least Squares (RLS) [2] havebeen proposed. These estimators can reach a 50% breakdown point. However, theyhavenonlinearordiscontinuousobjectivefunctionsthat arenotamenableto mathemat-ical optimization. This means that a quasi-exhaustive search on all possible parametervalues needs to be done to find the global minimum. As a variation, random sam-pling/searching of some kind has been suggested to find the best fit. In any case, theseestimators are limited to estimating the parametersof a single homogenousstructure in  a data set. Other limitations of these estimators include their strong dependence on agood initialization, and their reliance on a known or assumed amount of noise presentin the data set (contamination rate), or equivalently an estimated scale value or inlierbound. When faced with more noise than assumed, all these algorithms may fail. Andwhen the amount of noise is less than the assumed level, the parameterestimates sufferin terms of accuracy, since not all the good data points are taken into account.In this paper, we present a new algorithm for the robust estimation of the param-eters of a given component without any presuppositions about the noise proportion.The Maximal Density Estimator algorithm (MDE) yields a robust estimate of the pa-rameters by minimizing an objective function that incorporates both the overall errorand the contamination rate via a set of robust weights and a scale factor. Unlike mostalgorithms, the MDE is computationally attractive and practically insensitive to ini-tialization. A modified version of the algorithm can perform clustering or parameterestimation in the case when multiple components are present in the data set. By se-lecting an appropriate distance measure in the energy function to be minimized, thealgorithm can estimate the prototypeparameters of structures of different shapes in thedata set. Forinstance the Euclideandistance is used when the centersof sphericalcom-ponentsare beingsought. Otherdistance measuresshouldbe used while estimating theparameters of ellipsoidal, linear or quadratic clusters. 2 Background In estimating the parameters    , instead of minimizing the sum of squared residuals,Rousseeuw [5] proposed minimizing their median, i.e., min    med j d  2  j  ; (1)  where d  j   are residuals or distances from data points x  j   to the prototype being esti-mated. This estimatorbasically trims the b  n  2  c   observationshavingthe largest residuals.Hence it  assumes  that the noise proportionis 50%. A major drawback of the LMedS isits low efficiency, since it only uses the middle residual value.The LTS [5] offers a more efficient way to find robust estimates by minimizing theobjective function given by min   h  X  j  =1  ?  d  2    j  : n  ; (2)  where ?  d  2    j  : n   is the j  th   smallest residual or distance when the residuals are ordered inascending order, i.e., ?  d  2    1: n    ?  d  2    2: n    ?  d  2    n  : n  : Since h   is the number of data points whose residuals are included in the sum, thisestimator basically finds a robust estimate by identifying the (  n  ?  h  )   points having thelargest residuals as outliers, and discarding (trimming) them from the data set. Theresulting estimates are essentially LS estimates of the trimmed data set. It can be seen  that h   should be as close as possible to the number of good points in the data set,because the higher the number of good points used in the estimates, the more accuratethe estimates are. In this case, the LTS will yield the best possible estimate. Oneproblem with the LTS is that its objective function does not lend itself to mathematicaloptimization. Besides, the estimation of  h   itself is difficult in practice. In addition, theLTS objective function is based on hard rejection. That is, a given data point is eithertotally included in the estimation process or totally excluded from it. This may lead toinstabilities when optimizing the objective function with respect to the parameters.Instead of the noise proportion, some algorithms use weights that distinguish be-tween inliers and outliers. However, these weights usually depend on a scale measurewhich is also difficult to estimate. For example, the RLS [2] tries to minimize min   n  X  j  =1  w  j  d  2  j  : (3)  where d  j   are robust residuals resulting from an approximateLMedS or LTS procedure.Here the weights w  j   essentially trim outliers from the data used in LS minimization,and can be computed after a preliminary approximate phase of the LMedS or the LTS.The function w  j   is usually continuous and has a maximum at 0 and is monotonicalynon-increasing with distance. In addition, w  j   depends on an error scale    which isusually heuristically estimated from the results of the LMedS orthe LTS. The RLS wasintendedto refine the estimates resultingfromotherrobustbut less efficientalgorithms.Hence it requires a good initialization.Several algorithms [4, 6] address the dilemma of having to know the noise propor-tion (or equivalently the scale parameter related to the inlier bound) beforehand. Mostof them perform a robust estimation process repetitively, with different fixed contam-ination rates (or equivalently inlier bounds). They finally choose the estimate that op-timizes a goodness of fit measure. This procedure can be lengthy and computationallyexpensive, since it performs an exhaustive search over a discretized contamination rateor scale interval. 3 The maximal density estimator algorithm To confront the problem of estimating h   in the LTS, we may want to allow its value tobe variable, and optimize a modified objective function which minimizes the trimmedsum of errors while trying to include as many good points as possible in the estimationprocess. Toreflectthese multipleobjectives,wecanformulatethefollowingcompoundenergy function to be minimized min  ;h h  X  j  =1  ?  d  2    j  : n  ?  h; (4)  where    is a constant that reflects the relative importance of the two objectives. Un-fortunately, the main drawback of the LTS is still present, since the mathematical opti-mization with respect to h   is still not possible. To get aroundthis problem,we consider  a close relative of the LTS, the RLS. The weights in the RLS determine its success,and the key to this success usually lies in the estimation of a scale measure,    , thatreflects the variance of the set of good points. In other words, it is related to the inlierbound, which determines the maximal residual of the good data points. Unfortunately,mathematical optimization with respect to a scale parameter usually leads to the scaleshrinking to zero. This is because a zero value for scale corresponds to the case whenall weights are zero, and this situation results in a global minimum of the objectivefunction. Since the scale parameter is closely related to the proportion of noise, wecan reformulate the above objective function so that scale and weights are parametersrather than the number of good points h   . This formulation is given by min  ;    J  =  n  X    j  =1  w  j  d  2  j    ?    n  X    j  =1  w  j    ; (5)  where w  j   are a set of positive decreasing weights. The weight w  j   can also be consid-ered as the membership of data point x  j   in the set of good points. The first term of this objective function tries to minimize the volume enclosed by the residual vectors of the good points. The second term of this objective function tries to use as many goodpoints (inliers) as possible in the estimation process, via their high weights. Thus thecombined effect is to optimize the density, i.e., the ratio of the total number of goodpoints to the volume. In the first term, the distances are normalized by the scale mea-sure for several reasons. First, this normalization counteracts the tendency of the scaleto shrink towards zero. Second, unlike the absolute distance d  2  j   , d  2  j     is a relative mea-sure that indicates how close a data point is to the center relative to the inlier bound.Therefore using this normalized measure is a more sensible way to penalize the inclu-sion of outliers in the estimation process in a way that is independent of the clustersize. Finally, this normalization makes the two terms of the energy function compara-ble in magnitude. This relieves us from the problem of estimating a value for    whichotherwise would depend on the data set’s contamination rate and cluster sizes. Hence,the value of     is fixed as follows:   =1  : Finally, we should note that d  2  j   should be a suitable distance measure, tailored to detectdesired shapes, such as the Euclidean distance for spherical clusters, or the Gustafson-Kessel (GK) distance [1] for ellipsoidal clusters characterized by a covariance matrix,etc.Sincetheobjectivefunctiondependsonseveralvariables,wecanusethealternatingoptimization technique, where in each iteration one variable is optimized while fixingall others. If the weights are fixed, then the optimal prototype parameters are found bysetting @J @  = 1    n  X    j  =1  w  j  @d  2  j  @  =  0  For instance if  d  2  j   is the Euclidean distance d  2  j  =  k  x  j  ?  c  k   , then the center c   is givenby c  =  P   n j  =1  w  j  x  j  P   n j  =1  w  j  : (6)   To derive the optimal scale regardless of the distance measure being used, we fix theprototype parameters, and set @J @  =0  : Furthersimplification of this equationdependson the definitionof the weight function.We choose to use the Gaussian weight function w  j  =exp  ?  d  2  j  2    (7)  to obtain the following update equation for the scale parameter   = 1 3  P   n j  =1  w  j  d  4  j  P   n j  =1  w  j  d  2  j  : (8)  Therefore, the algorithm consists of alternative updates of the prototype parameters,the scale parameter and the weights in an iterative fashion until convergence, or for afixed maximum number of iterations. 4 Generalization of the maximal density estimator al-gorithm to clustering The objective function in (5) can be extended to allow the estimation of  C   prototypeparameters simultaneously,   =      1  ;:::;  C     , and to allow for different scales   i   , asfollows min    ;  i   J  =  C  X    i  =1  X    x  j  2C  i w  ij  d  2  ij    i  ?  C  X    i  =1  X    x  j  2C  i w  ij    ; (9)  where w  ij  =exp  ?  d  2  ij  2    i  ; (10)  and d  2  ij   is the distance from data point x  j   to the prototype of cluster C  i   . Here w  ij   canalso be considered as the membership of data point x  j   in cluster C  i   . The partition of the data space is done in a mimimum distance classifier sense. That is, C  i  =    x  j  2Xj d  2  ij  =  C  min  k  =1  d  2  kj    (11)  Since each cluster is independent of the rest, it is easy to show that the optimal updateequationsare similar to the ones obtained for estimating the parametersfor one cluster.The scale parameter of the i   th cluster is given by   i  = 1 3  P   x  j  2C  i w  ij  d  4  ij  P   x  j  2C  i w  ij  d  2  ij  : (12) 
Search
Similar documents
View more...
Tags
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks