Fashion & Beauty

A statistical test to identify differences in clustering structures

Description
A statistical test to identify differences in clustering structures
Published
of 29
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Transcript
  A statistical test to identify differences in clusteringstructures Andr´e Fujita* Department of Computer Science, Institute of Mathematics and Statistics,University of S˜ao Paulo, Brazil. Daniel Y. Takahashi Department of Psychology and Neuroscience Institute, Green Hall,Princeton University, USA. Alexandre G. Patriota Department of Statistics, Institute of Mathematics and Statistics,University of S˜ao Paulo, Brazil. Jo˜ao R. Sato Center of Mathematics, Computation, and Cognition, Federal University of ABC, Brazil.*Corresponding authorRua do Mato, 1010 - Building C, Cidade Universit´ariaS˜ao Paulo - SP - Brasil - CEP 05508-090Phone: +55 11 3091 5177 Abstract Statistical inference on functional magnetic resonance imaging (fMRI) data is an im-portant task in brain imaging. One major hypothesis is that the presence or not of a psychiatric disorder can be explained by the differential clustering of neurons in thebrain. In view of this fact, it is clearly of interest to address the question of whether theproperties of the clusters have changed between groups of patients and controls. The nor-mal method of approaching group differences in brain imaging is to carry out a voxel-wiseunivariate analysis for a difference between the mean group responses using an appropri-ate test (e.g. a t-test) and to assemble the resulting “significantly different voxels” intoclusters, testing again at cluster level. In this approach of course, the primary voxel-leveltest is blind to any cluster structure. Direct assessments of differences between groups 1   a  r   X   i  v  :   1   3   1   1 .   6   7   3   2  v   1   [  s   t  a   t .   M   E   ]   2   6   N  o  v   2   0   1   3  (or reproducibility within groups) at the cluster level have been rare in brain imaging.For this reason, we introduce a novel statistical test called ANOCVA - ANalysis Of Clus-ter structure Variability, which statistically tests whether two or more populations areequally clustered using specific features. The proposed method allows us to comparethe clustering structure of multiple groups simultaneously, and also to identify featuresthat contribute to the differential clustering. We illustrate the performance of ANOCVAthrough simulations and an application to an fMRI data set composed of children withADHD and controls. Results show that there are several differences in the brain’s cluster-ing structure between them, corroborating the hypothesis in the literature. Furthermore,we identified some brain regions previously not described, generating new hypothesis tobe tested empirically. Keywords  : clustering; silhouette method; statistical test. 1 Introduction Biological data sets are growing enormously, leading to an information-driven science (Stein,2008) and allowing previously impossible breakthroughs. However, there is now an increasingconstraint in identifying relevant characteristics among these large data sets. For example, inmedicine, the identification of features that characterize control and disease subjects is key forthe development of diagnostic procedures, prognosis and therapy (Rubinov and Sporn, 2010).Among several exploratory methods, the study of clustering structures is a very appealing can-didate method, mainly because several biological questions can be formalized in the form: Arethe features of populations A and B equally clustered? One typical example occurs in neuro-science. It is believed that the brain is organized in clusters of neurons with different majorfunctionalities, and deviations from the typical clustering pattern can lead to a pathologicalcondition (Grossberg, 2000). Another example is in molecular biology, where the gene expres-sion clustering structures depend on the analyzed population (control or tumor, for instance)(Furlan  et al  ., 2011; Wang  et al  ., 2013). Therefore, in order to better understand diseases, itis necessary to differentiate the clustering structures among different populations. This leadsto the problem of how to statistically test the equality of clustering structures of two or morepopulations followed by the identification of features that are clustered in a different manner.The traditional approach is to compare some descriptive statistics of the clustering structure(number of clusters, common elements in the clusters, etc) (Meila, 2007; Cecchi  et al  ., 2009;Kluger  et al  ., 2003), but to the best of our knowledge, little or nothing is known regarding for-mal statistical methods to test the equality of clustering structures among populations. Withthis motivation, we introduce a new statistical test called ANOCVA - ANalysis Of Clusterstructure VAriability - in order to statistically compare the clustering structures of two or morepopulations.Our method is an extension of two well established ideas: the silhouette statistic (Rousseeuw,2  1987) and ANOVA. Essentially, we use the silhouette statistic to measure the “variability” of theclustering structure in each population. Next, we compare the silhouette among populations.The intuitive idea behind this approach is that we assume that populations with the sameclustering structures also have the same “variability”. This simple idea allows us to obtaina powerful statistic test for equality of clustering structures, which (1) can be applied to alarge variety of clustering algorithms; (2) allows us to compare the clustering structure of multiple groups simultaneously; (3) is fast and easy to implement; and (4) identifies featuresthat significantly contribute to the differential clustering.We illustrate the performance of ANOCVA through simulation studies under different realis-tic scenarios and demonstrate the power of the test in identifying small differences in clusteringamong populations. We also applied our method to study the whole brain functional magneticresonance imaging (fMRI) recordings of 759 children with typical development (TD), Attentiondeficit hyperactivity disorder (ADHD) with hyperactivity/impulsivity and inattentiveness, andADHD with hyperactivity/impulsivity without inattentiveness. ADHD is a psychiatric disorderthat usually begins in childhood and often persists into adulthood, affecting at least 5-10% of children in the US and non-US populations (Fair  et al  ., 2007). Given its prevalence, impacts onthe children’s social life, and the difficult diagnosis, a better understanding of its pathology isfundamental. The statistical analysis using ANOCVA on this large fMRI data set composed of ADHD and subjects with TD identified brain regions that are consistent with already knownliterature of this physiopathology. Moreover, we have also identified some brain regions pre-viously not described as associated with this disorder, generating new hypothesis to be testedempirically. 2 Methods We can describe our problem in the following way. Given  k  populations  T  1 ,T  2 ,...,T  k  whereeach population  T   j  (  j  = 1 ,...,k ), is composed of   n  j  subjects, and each subject has  N   itemsthat are clustered in some manner, we would like to verify whether the cluster structures of the k  populations are equal and, if not, which items are differently clustered. To further formalizeour method, we must define what we mean by cluster structure. The silhouette statistic is usedin our proposal to identify the cluster structure. We briefly describe it in the next section. 2.1 The silhouette statistic The silhouette method was proposed in 1987 by Rousseeuw (1987) with the purpose of verifying whether a specific item was assigned to an appropriate cluster. In other words,the silhouette statistic is a measure of goodness-of-fit of the clustering procedure. Let  X   = { x 1 ,...,x N  }  be the items of one subject that are clustered into  C   =  { C  1 ,...,C  r }  clusters by3  a clustering algorithm according to an optimal criterion. Note that  X   =  rq =1  C  q . Denote by d ( x,y ) the dissimilarity (e.g. Euclidian, Manhattan, etc) between items  x  and  y  and define d ( x,C  ) = 1# C   y ∈ C  d ( x,y ) (1)as the average dissimilarity of   x  to all items of cluster  C   ⊂ X   (or  C   ∈ C  ), where # C   is thenumber of items of   C  . Denote by  D q  ∈ C   the cluster to which  x q  has been assigned by theclustering algorithm and by  E  q  ∈C   any other cluster different of   D q , for all  q   = 1 ,...,N  . Allquantities involved in the silhouette statistic are given by a q  =  d ( x q ,D q ) and  b q  = min E  q  = D q d ( x q ,E  q ) ,  for  q   = 1 ,...,N, where  a q  is the “within” dissimilarity and  b q  is the smallest “between” dissimilarity for thesample unit  x q . Then a natural proposal to measure how well item  x q  has been clustered isgiven by the silhouette statistic(Rousseeuw, 1987) s q  =  b q − a q max { b q ,a q } ,  if # D q  >  1 , 0 ,  if # D q  = 1 . (2)The choice of the silhouette statistic is interesting due to its interpretations. Notice that, if  s q  ≈ 1, this implies that the “within” dissimilarity is much smaller than the smallest “between”dissimilarity ( a q  ≪  b q ). In other words, item  x q  has been assigned to an appropriate clustersince the second-best choice cluster is not nearly as close as the actual cluster. If   s q  ≈ 0, then a q  ≈  b q , hence it is not clear whether  x q  should have been assigned to the actual cluster orto the second-best choice cluster because it lies equally far away from both. If   s q  ≈ − 1, then a q  ≫ b q , so item  x q  lies much closer to the second-best choice cluster than to the actual cluster.Therefore it is more natural to assign item  x q  to the second-best choice cluster instead of theactual cluster because this item  x q  has been “misclassified”. To conclude,  s q  measures how wellitem  x q  has been labeled.Let  Q  = { d ( x l ,x q ) }  be the ( N   × N  )-matrix of dissimilarities, then it is symmetric and haszero diagonal elements. Let  l  = ( l 1 ,l 2 ,...,l N  ) be the labels obtained by a clustering algorithmapplied to the dissimilarity matrix  Q , i.e., the labels represent the cluster each item belongs to.It can be easily verified that the dissimilarity matrix  Q  and the vector of labels  l  are sufficientto compute the quantities  s 1 ,...,s N  . In order to avoid notational confusions, we will write s ( Q , l ) q  rather than  s q  for all  q   = 1 ,...,N  , because we deal with many data sets in the nextsection.4  2.2 Extension of the silhouette approach In the previous section, we introduced notations when we have  N   items in one subject. Inthe present section, we extend the approach to many populations and many subjects in eachpopulation. Let  T  1 ,T  2 ,...,T  k  be  k  types of populations. For the  j th population,  n  j  subjectsare collected, for  j  = 1 ,...,k . In order to establish notations, the items of the  i th subjecttaken from the  j th population are represented by the matrix  X i,j  = ( x i,j, 1 ,..., x i,j,N  ) whereeach item  x i,j,l  ( l  = 1 ,...,N  ) is a vector.First we define the matrix of dissimilarities among items of each matrix  X i,j , by A i,j  =  d ( x i,j,l , x i,j,q ) ,  for  i  = 1 ,...,n  j , j  = 1 ,...,k. Notice that each  A i,j  is symmetric with diagonal elements equal zero. Also, we define thefollowing average matrices of dissimilarities¯ A  j  = 1 n  jn j  i =1 A i,j  = 1 n  jn j  i =1 d ( x i,j,l , x i,j,q ) and ¯¯ A  = 1 n k   j =1 n  j  ¯ A  j where  n  =   k j =1  n  j ,  l,q   = 1 ,...,N  . The ( N   × N  )-matrices ¯ A 1 ,...,  ¯ A k  and ¯¯ A  are the onlyquantities required for proceeding with our proposal.Now, based on the matrix of dissimilarities ¯¯ A  we can use a clustering algorithm to find theclustering labels  l ¯¯ A . Then, we compute the following silhouette statistics s (¯¯ A , l ¯¯ A ) q  and  s (¯ A j , l ¯¯ A ) q  ,  for  q   = 1 ,...,N. The former is the silhouette statistic based on the matrix of dissimilarities ¯¯ A  and the latteris the silhouette statistic based on the dissimilarity matrix ¯ A  j , both obtained by using theclustering labels computed via the matrix ¯¯ A . We expect that if the items from all populations T  1 ,...,T  k  are equally clustered, the quantities  s (¯¯ A , l ¯¯ A ) q  and  s (¯ A j , l ¯¯ A ) q  must be close for all  j  =1 ,...k  and  q   = 1 ,...,N  . 2.3 Statistical tests Define the following vectors S  =  s (¯¯ A , l ¯¯ A )1  ,...,s (¯¯ A ,l ¯¯ A ) N   ⊤ and  S  j  =  s (¯ A j , l ¯¯ A )1  ,...,s (¯ A j , l ¯¯ A ) N   ⊤ . We want to test if all  k  populations are clustered in the same manner, i.e.:H 0  : “Given the clustering algorithm, the data from  T  1 ,T  2 ,...,T  k  are equally clus-tered”.5
Search
Similar documents
View more...
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks