Description

A statistical test to identify differences in clustering structures

All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.

Related Documents

Share

Transcript

A statistical test to identify diﬀerences in clusteringstructures
Andr´e Fujita*
Department of Computer Science, Institute of Mathematics and Statistics,University of S˜ao Paulo, Brazil.
Daniel Y. Takahashi
Department of Psychology and Neuroscience Institute, Green Hall,Princeton University, USA.
Alexandre G. Patriota
Department of Statistics, Institute of Mathematics and Statistics,University of S˜ao Paulo, Brazil.
Jo˜ao R. Sato
Center of Mathematics, Computation, and Cognition, Federal University of ABC, Brazil.*Corresponding authorRua do Mato, 1010 - Building C, Cidade Universit´ariaS˜ao Paulo - SP - Brasil - CEP 05508-090Phone: +55 11 3091 5177
Abstract
Statistical inference on functional magnetic resonance imaging (fMRI) data is an im-portant task in brain imaging. One major hypothesis is that the presence or not of a psychiatric disorder can be explained by the diﬀerential clustering of neurons in thebrain. In view of this fact, it is clearly of interest to address the question of whether theproperties of the clusters have changed between groups of patients and controls. The nor-mal method of approaching group diﬀerences in brain imaging is to carry out a voxel-wiseunivariate analysis for a diﬀerence between the mean group responses using an appropri-ate test (e.g. a t-test) and to assemble the resulting “signiﬁcantly diﬀerent voxels” intoclusters, testing again at cluster level. In this approach of course, the primary voxel-leveltest is blind to any cluster structure. Direct assessments of diﬀerences between groups
1
a r X i v : 1 3 1 1 . 6 7 3 2 v 1 [ s t a t . M E ] 2 6 N o v 2 0 1 3
(or reproducibility within groups) at the cluster level have been rare in brain imaging.For this reason, we introduce a novel statistical test called ANOCVA - ANalysis Of Clus-ter structure Variability, which statistically tests whether two or more populations areequally clustered using speciﬁc features. The proposed method allows us to comparethe clustering structure of multiple groups simultaneously, and also to identify featuresthat contribute to the diﬀerential clustering. We illustrate the performance of ANOCVAthrough simulations and an application to an fMRI data set composed of children withADHD and controls. Results show that there are several diﬀerences in the brain’s cluster-ing structure between them, corroborating the hypothesis in the literature. Furthermore,we identiﬁed some brain regions previously not described, generating new hypothesis tobe tested empirically.
Keywords
: clustering; silhouette method; statistical test.
1 Introduction
Biological data sets are growing enormously, leading to an information-driven science (Stein,2008) and allowing previously impossible breakthroughs. However, there is now an increasingconstraint in identifying relevant characteristics among these large data sets. For example, inmedicine, the identiﬁcation of features that characterize control and disease subjects is key forthe development of diagnostic procedures, prognosis and therapy (Rubinov and Sporn, 2010).Among several exploratory methods, the study of clustering structures is a very appealing can-didate method, mainly because several biological questions can be formalized in the form: Arethe features of populations A and B equally clustered? One typical example occurs in neuro-science. It is believed that the brain is organized in clusters of neurons with diﬀerent majorfunctionalities, and deviations from the typical clustering pattern can lead to a pathologicalcondition (Grossberg, 2000). Another example is in molecular biology, where the gene expres-sion clustering structures depend on the analyzed population (control or tumor, for instance)(Furlan
et al
., 2011; Wang
et al
., 2013). Therefore, in order to better understand diseases, itis necessary to diﬀerentiate the clustering structures among diﬀerent populations. This leadsto the problem of how to statistically test the equality of clustering structures of two or morepopulations followed by the identiﬁcation of features that are clustered in a diﬀerent manner.The traditional approach is to compare some descriptive statistics of the clustering structure(number of clusters, common elements in the clusters, etc) (Meila, 2007; Cecchi
et al
., 2009;Kluger
et al
., 2003), but to the best of our knowledge, little or nothing is known regarding for-mal statistical methods to test the equality of clustering structures among populations. Withthis motivation, we introduce a new statistical test called ANOCVA - ANalysis Of Clusterstructure VAriability - in order to statistically compare the clustering structures of two or morepopulations.Our method is an extension of two well established ideas: the silhouette statistic (Rousseeuw,2
1987) and ANOVA. Essentially, we use the silhouette statistic to measure the “variability” of theclustering structure in each population. Next, we compare the silhouette among populations.The intuitive idea behind this approach is that we assume that populations with the sameclustering structures also have the same “variability”. This simple idea allows us to obtaina powerful statistic test for equality of clustering structures, which (1) can be applied to alarge variety of clustering algorithms; (2) allows us to compare the clustering structure of multiple groups simultaneously; (3) is fast and easy to implement; and (4) identiﬁes featuresthat signiﬁcantly contribute to the diﬀerential clustering.We illustrate the performance of ANOCVA through simulation studies under diﬀerent realis-tic scenarios and demonstrate the power of the test in identifying small diﬀerences in clusteringamong populations. We also applied our method to study the whole brain functional magneticresonance imaging (fMRI) recordings of 759 children with typical development (TD), Attentiondeﬁcit hyperactivity disorder (ADHD) with hyperactivity/impulsivity and inattentiveness, andADHD with hyperactivity/impulsivity without inattentiveness. ADHD is a psychiatric disorderthat usually begins in childhood and often persists into adulthood, aﬀecting at least 5-10% of children in the US and non-US populations (Fair
et al
., 2007). Given its prevalence, impacts onthe children’s social life, and the diﬃcult diagnosis, a better understanding of its pathology isfundamental. The statistical analysis using ANOCVA on this large fMRI data set composed of ADHD and subjects with TD identiﬁed brain regions that are consistent with already knownliterature of this physiopathology. Moreover, we have also identiﬁed some brain regions pre-viously not described as associated with this disorder, generating new hypothesis to be testedempirically.
2 Methods
We can describe our problem in the following way. Given
k
populations
T
1
,T
2
,...,T
k
whereeach population
T
j
(
j
= 1
,...,k
), is composed of
n
j
subjects, and each subject has
N
itemsthat are clustered in some manner, we would like to verify whether the cluster structures of the
k
populations are equal and, if not, which items are diﬀerently clustered. To further formalizeour method, we must deﬁne what we mean by cluster structure. The silhouette statistic is usedin our proposal to identify the cluster structure. We brieﬂy describe it in the next section.
2.1 The silhouette statistic
The silhouette method was proposed in 1987 by Rousseeuw (1987) with the purpose of verifying whether a speciﬁc item was assigned to an appropriate cluster. In other words,the silhouette statistic is a measure of goodness-of-ﬁt of the clustering procedure. Let
X
=
{
x
1
,...,x
N
}
be the items of one subject that are clustered into
C
=
{
C
1
,...,C
r
}
clusters by3
a clustering algorithm according to an optimal criterion. Note that
X
=
rq
=1
C
q
. Denote by
d
(
x,y
) the dissimilarity (e.g. Euclidian, Manhattan, etc) between items
x
and
y
and deﬁne
d
(
x,C
) = 1#
C
y
∈
C
d
(
x,y
) (1)as the average dissimilarity of
x
to all items of cluster
C
⊂ X
(or
C
∈ C
), where #
C
is thenumber of items of
C
. Denote by
D
q
∈ C
the cluster to which
x
q
has been assigned by theclustering algorithm and by
E
q
∈C
any other cluster diﬀerent of
D
q
, for all
q
= 1
,...,N
. Allquantities involved in the silhouette statistic are given by
a
q
=
d
(
x
q
,D
q
) and
b
q
= min
E
q
=
D
q
d
(
x
q
,E
q
)
,
for
q
= 1
,...,N,
where
a
q
is the “within” dissimilarity and
b
q
is the smallest “between” dissimilarity for thesample unit
x
q
. Then a natural proposal to measure how well item
x
q
has been clustered isgiven by the silhouette statistic(Rousseeuw, 1987)
s
q
=
b
q
−
a
q
max
{
b
q
,a
q
}
,
if #
D
q
>
1
,
0
,
if #
D
q
= 1
.
(2)The choice of the silhouette statistic is interesting due to its interpretations. Notice that, if
s
q
≈
1, this implies that the “within” dissimilarity is much smaller than the smallest “between”dissimilarity (
a
q
≪
b
q
). In other words, item
x
q
has been assigned to an appropriate clustersince the second-best choice cluster is not nearly as close as the actual cluster. If
s
q
≈
0, then
a
q
≈
b
q
, hence it is not clear whether
x
q
should have been assigned to the actual cluster orto the second-best choice cluster because it lies equally far away from both. If
s
q
≈ −
1, then
a
q
≫
b
q
, so item
x
q
lies much closer to the second-best choice cluster than to the actual cluster.Therefore it is more natural to assign item
x
q
to the second-best choice cluster instead of theactual cluster because this item
x
q
has been “misclassiﬁed”. To conclude,
s
q
measures how wellitem
x
q
has been labeled.Let
Q
=
{
d
(
x
l
,x
q
)
}
be the (
N
×
N
)-matrix of dissimilarities, then it is symmetric and haszero diagonal elements. Let
l
= (
l
1
,l
2
,...,l
N
) be the labels obtained by a clustering algorithmapplied to the dissimilarity matrix
Q
, i.e., the labels represent the cluster each item belongs to.It can be easily veriﬁed that the dissimilarity matrix
Q
and the vector of labels
l
are suﬃcientto compute the quantities
s
1
,...,s
N
. In order to avoid notational confusions, we will write
s
(
Q
,
l
)
q
rather than
s
q
for all
q
= 1
,...,N
, because we deal with many data sets in the nextsection.4
2.2 Extension of the silhouette approach
In the previous section, we introduced notations when we have
N
items in one subject. Inthe present section, we extend the approach to many populations and many subjects in eachpopulation. Let
T
1
,T
2
,...,T
k
be
k
types of populations. For the
j
th population,
n
j
subjectsare collected, for
j
= 1
,...,k
. In order to establish notations, the items of the
i
th subjecttaken from the
j
th population are represented by the matrix
X
i,j
= (
x
i,j,
1
,...,
x
i,j,N
) whereeach item
x
i,j,l
(
l
= 1
,...,N
) is a vector.First we deﬁne the matrix of dissimilarities among items of each matrix
X
i,j
, by
A
i,j
=
d
(
x
i,j,l
,
x
i,j,q
)
,
for
i
= 1
,...,n
j
, j
= 1
,...,k.
Notice that each
A
i,j
is symmetric with diagonal elements equal zero. Also, we deﬁne thefollowing average matrices of dissimilarities¯
A
j
= 1
n
jn
j
i
=1
A
i,j
= 1
n
jn
j
i
=1
d
(
x
i,j,l
,
x
i,j,q
) and ¯¯
A
= 1
n
k
j
=1
n
j
¯
A
j
where
n
=
k j
=1
n
j
,
l,q
= 1
,...,N
. The (
N
×
N
)-matrices ¯
A
1
,...,
¯
A
k
and ¯¯
A
are the onlyquantities required for proceeding with our proposal.Now, based on the matrix of dissimilarities ¯¯
A
we can use a clustering algorithm to ﬁnd theclustering labels
l
¯¯
A
. Then, we compute the following silhouette statistics
s
(¯¯
A
,
l
¯¯
A
)
q
and
s
(¯
A
j
,
l
¯¯
A
)
q
,
for
q
= 1
,...,N.
The former is the silhouette statistic based on the matrix of dissimilarities ¯¯
A
and the latteris the silhouette statistic based on the dissimilarity matrix ¯
A
j
, both obtained by using theclustering labels computed via the matrix ¯¯
A
. We expect that if the items from all populations
T
1
,...,T
k
are equally clustered, the quantities
s
(¯¯
A
,
l
¯¯
A
)
q
and
s
(¯
A
j
,
l
¯¯
A
)
q
must be close for all
j
=1
,...k
and
q
= 1
,...,N
.
2.3 Statistical tests
Deﬁne the following vectors
S
=
s
(¯¯
A
,
l
¯¯
A
)1
,...,s
(¯¯
A
,l
¯¯
A
)
N
⊤
and
S
j
=
s
(¯
A
j
,
l
¯¯
A
)1
,...,s
(¯
A
j
,
l
¯¯
A
)
N
⊤
.
We want to test if all
k
populations are clustered in the same manner, i.e.:H
0
: “Given the clustering algorithm, the data from
T
1
,T
2
,...,T
k
are equally clus-tered”.5

Search

Similar documents

Related Search

In Order to Identify the the Problems the FooDifferences in the Responses to the Black DeaPolicy Responses to HIV/AIDS In Sub-Saharan ATO IDENTIFY THE IMPACT OF TAXATION IN GHANA'STeach Kids to View Life In a Global WayI Want to Do Research in Teaching Because I ACultural Differences in Parental Aspiration aSex Differences In HumansCultural differences in communications stylesA Linguistic Approach to Narrative

We Need Your Support

Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks