Description

A Kernel Statistical Test of Independence

All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.

Related Documents

Share

Transcript

A Kernel Statistical Test of Independence
Arthur Gretton
MPI for Biological CyberneticsT¨ubingen, Germany
arthur@tuebingen.mpg.de
Kenji Fukumizu
Inst. of Statistical MathematicsTokyo Japan
fukumizu@ism.ac.jp
Choon Hui Teo
NICTA, ANUCanberra, Australia
choonhui.teo@gmail.com
Le Song
NICTA, ANUand University of Sydney
lesong@it.usyd.edu.au
Bernhard Sch ¨olkopf
MPI for Biological CyberneticsT¨ubingen, Germany
bs@tuebingen.mpg.de
Alexander J. Smola
NICTA, ANUCanberra, Australia
alex.smola@gmail.com
Abstract
Although kernel measures of independence have been widely applied in machinelearning (notably in kernel ICA), there is as yet no method to determine whetherthey have detected statistically signiﬁcant dependence. We provide a novel test of the independence hypothesis for one particular kernel independence measure, theHilbert-Schmidt independence criterion (HSIC). The resulting test costs
O(
m
2
)
,where
m
is the sample size. We demonstrate that this test outperformsestablishedcontingency table and functional correlation-based tests, and that this advantageis greater for multivariate data. Finally, we show the HSIC test also applies totext (and to structured data more generally), for which no other independence testpresently exists.
1 Introduction
Kernel independencemeasures have been widely applied in recent machine learning literature, mostcommonly in independentcomponentanalysis (ICA) [2, 11], but also in ﬁtting graphical models [1]and in feature selection [22]. One reason for their success is that these criteria have a zero expectedvalue if and only if the associated random variables are independent, when the kernels are universal(in the sense of [23]). There is presently no way to tell whether the
empirical estimates
of thesedependence measures indicate a
statistically signiﬁcant
dependence, however. In other words, weare interested in the threshold an empirical kernel dependence estimate must exceed, before we candismiss with high probability the hypothesis that the underlying variables are independent.Statistical tests of independencehave been associated with a broad variety of dependencemeasures.Classical tests such as Spearman’s
ρ
and Kendall’s
τ
are widely applied, however they are notguaranteed to detect all modes of dependence between the random variables. Contingency table-based methods, and in particular the power-divergence family of test statistics [17], are the bestknowngeneralpurposetestsofindependence,butarelimitedtorelativelylowdimensions,sincetheyrequire a partitioning of the space in which each random variable resides. Characteristic function-based tests [6, 13] have also been proposed, which are more general than kernel density-based tests[19], although to our knowledge they have been used only to compare univariate random variables.In this paper we present three main results: ﬁrst, and most importantly,we show how to test whetherstatistically signiﬁcant dependence is detected by a particular kernel independence measure, theHilbert Schmidt independence criterion (HSIC, from [9]). That is, we provide a fast (
O
(
m
2
)
forsample size
m
) and accurate means of obtaining a
threshold
which HSIC will only exceed withsmall probability, when the underlying variables are independent. Second, we show the distribution1
of our empirical test statistic in the large sample limit can be straightforwardly parameterised interms of kernels on the data. Third, we apply our test to structured data (in this case, by establishingthe statistical dependence between a text and its translation). To our knowledge, ours is the ﬁrstindependence test for structured data.We begin our presentation in Section 2, with a short overview of cross-covariance operators be-tween RKHSs and their Hilbert-Schmidt norms: the latter are used to deﬁne the Hilbert SchmidtIndependence Criterion (HSIC). In Section 3, we describe how to determine whether the depen-dence returned via HSIC is statistically signiﬁcant, by proposing a hypothesis test with HSIC as itsstatistic. In particular,we show that this test can be parameterisedusinga combinationof covarianceoperator norms and norms of mean elements of the random variables in feature space. Finally, inSection 4, we give our experimental results, both for testing dependence between random vectors(which could be used for instance to verify convergence in independent subspace analysis [25]),and for testing dependence between text and its translation. Software to implement the test may bedownloaded from
http
:
//
www
.
kyb
.
mpg
.
de
/
bs
/
people
/
arthur
/
indep
.
htm
2 Deﬁnitions and description of HSIC
Our problem setting is as follows:
Problem 1
Let
P
xy
be a Borel probability measure deﬁned on a domain
X
×
Y
, and let
P
x
and
P
y
be the respective marginal distributions on
X
and
Y
. Given an i.i.d sample
Z
:= (
X,Y
) =
{
(
x
1
,y
1
)
,...,
(
x
m
,y
m
)
}
of size
m
drawn independently and identically distributed according to
P
xy
, does
P
xy
factorise as
P
x
P
y
(equivalently, we may write
x
⊥⊥
y
)?
We begin with a description of our kernel dependence criterion, leaving to the following section thequestion of whether this dependence is signiﬁcant. This presentation is largely a review of materialfrom[9, 11,22], themaindifferencebeingthatweestablish linkstothecharacteristicfunction-basedindependencecriteria in [6, 13]. Let
F
be an RKHS, with the continuous feature mapping
φ
(
x
)
∈
F
from each
x
∈
X
, such that the inner product between the features is given by the kernel function
k
(
x,x
′
) :=
φ
(
x
)
,φ
(
x
′
)
. Likewise, let
G
be a second RKHS on
Y
with kernel
l
(
·
,
·
)
and featuremap
ψ
(
y
)
. Following [7], the cross-covariance operator
C
xy
:
G
→
F
is deﬁned such that for all
f
∈
F
and
g
∈
G
,
f,C
xy
g
F
=
E
xy
([
f
(
x
)
−
E
x
(
f
(
x
))][
g
(
y
)
−
E
y
(
g
(
y
))])
.
The cross-covariance operator itself can then be written
C
xy
:=
E
xy
[(
φ
(
x
)
−
µ
x
)
⊗
(
ψ
(
y
)
−
µ
y
)]
,
(1)where
µ
x
:=
E
x
φ
(
x
)
,
µ
y
:=
E
y
φ
(
y
)
, and
⊗
is the tensor product [9, Eq. 6]: this is a generalisationof the cross-covariance matrix between random vectors. When
F
and
G
are universal reproducingkernel Hilbert spaces (that is, dense in the space of bounded continuous functions [23]) on thecompact domains
X
and
Y
, then the largest singular value of this operator,
C
xy
, is zero if and onlyif
x
⊥⊥
y
[11, Theorem6]: the operatorthereforeinducesan independencecriterion, and can be usedto solve Problem1. The maximumsingularvaluegives a criterionsimilar to that srcinallyproposedin[18], butwithmorerestrictivefunctionclasses (ratherthanfunctionsofboundedvariance). Ratherthan the maximum singular value, we may use the squared Hilbert-Schmidt norm (the sum of thesquared singular values), which has a population expression
HSIC(
P
xy
,
F
,
G
) =
E
xx
′
yy
′
[
k
(
x,x
′
)
l
(
y,y
′
)] +
E
xx
′
[
k
(
x,x
′
)]
E
yy
′
[
l
(
y,y
′
)]
−
2
E
xy
[
E
x
′
[
k
(
x,x
′
)]
E
y
′
[
l
(
y,y
′
)]]
(2)(assuming the expectations exist), where
x
′
denotes an independent copy of
x
[9, Lemma 1]: wecall this the Hilbert-Schmidt independence criterion (HSIC).We now address the problem of estimating
HSIC(
P
xy
,
F
,
G
)
on the basis of the sample
Z
. Anunbiased estimator of (2) is a sum of three U-statistics [21, 22],
HSIC(
Z
) =1(
m
)
2
(
i,j
)
∈
i
m
2
k
ij
l
ij
+1(
m
)
4
(
i,j,q,r
)
∈
i
m
4
k
ij
l
qr
−
21(
m
)
3
(
i,j,q
)
∈
i
m
3
k
ij
l
iq
,
(3)2
where
(
m
)
n
:=
m
!(
m
−
n
)!
, theindexset
i
mr
denotesthesetall
r
-tuplesdrawnwithoutreplacementfromthe set
{
1
,...,m
}
,
k
ij
:=
k
(
x
i
,x
j
)
, and
l
ij
:=
l
(
y
i
,y
j
)
. For the purpose of testing independence,however, we will ﬁnd it easier to use an alternative, biased empirical estimate [9, Deﬁnition 2],obtained by replacing the U-statistics with V-statistics
1
HSIC
b
(
Z
) =1
m
2
m
i,j
k
ij
l
ij
+1
m
4
m
i,j,q,r
k
ij
l
qr
−
21
m
3
m
i,j,q
k
ij
l
iq
=1
m
2
trace(
KHLH
)
,
(4)where the summation indices now denote all
r
-tuples drawn with replacement from
{
1
,...,m
}
(
r
beingthenumberofindicesbelowthesum),
K
is the
m
×
m
matrixwithentries
k
ij
,
H
=
I
−
1
m
11
⊤
,and
1
is an
m
×
1
vector of ones (the cost of computing this statistic is
O(
m
2
)
). When a Gaussiankernel
k
ij
:= exp
−
σ
−
2
x
i
−
x
j
2
is used (or a kernel deriving from [6, Eq. 4.10]), the latterstatistic is equivalent to the characteristic function-based statistic [6, Eq. 4.11] and the
T
2
n
statisticof [13, p. 54]: details are reproduced in [10] for comparison. Our setting allows for more generalkernels, however, such as kernels on strings (as in our experiments in Section 4) and graphs (see[20] for further details of kernels on structures): this is not possible under the characteristic functionframework, which is restricted to Euclidean spaces (
R
d
in the case of [6, 13]). As pointed out in [6,Section 5], the statistic in (4)can also be linked to the srcinalquadratictest of Rosenblatt [19] givenan appropriatekernel choice; the main differencesbeing that characteristic function-basedtests (andRKHS-based tests) are not restricted to using kernel densities, nor should they reduce their kernelwidth with increasing sample size. Another related test described in [4] is based on the functionalcanonical correlation between
F
and
G
, rather than the covariance: in this sense the test statisticresembles those in [2]. The approach in [4] differs with both the present work and [2], however,in that the function spaces
F
and
G
are represented by ﬁnite sets of basis functions (speciﬁcallyB-spline kernels) when computing the empirical test statistic.
3 Test description
We now describe a statistical test of independence for two random variables, based on the teststatistic
HSIC
b
(
Z
)
. We begin with a more formal introduction to the framework and terminologyof statistical hypothesis testing. Given the i.i.d. sample
Z
deﬁned earlier, the statistical test,
T
(
Z
) :(
X
×
Y
)
m
→ {
0
,
1
}
is used to distinguish between the null hypothesis
H
0
:
P
xy
=
P
x
P
y
andthe alternative hypothesis
H
1
:
P
xy
=
P
x
P
y
. This is achieved by comparing the test statistic, inour case
HSIC
b
(
Z
)
, with a particular threshold: if the threshold is exceeded, then the test rejectsthe null hypothesis (bearing in mind that a zero population HSIC indicates
P
xy
=
P
x
P
y
). Theacceptance region of the test is thus deﬁned as any real number below the threshold. Since the testis based on a ﬁnite sample, it is possible that an incorrect answer will be returned: the Type I erroris deﬁned as the probability of rejecting
H
0
based on the observed sample, despite
x
and
y
beingindependent. Conversely, the Type II error is the probability of accepting
P
xy
=
P
x
P
y
when theunderlying variables are dependent. The level
α
of a test is an upper bound on the Type I error, andis a design parameter of the test, used to set the test threshold. A consistent test achieves a level
α
,and a Type II error of zero, in the large sample limit.How, then, do we set the threshold of the test given
α
? The approach we adopt here is to derivethe asymptotic distribution of the empirical estimate
HSIC
b
(
Z
)
of
HSIC(
P
xy
,
F
,
G
)
under
H
0
. Wethen use the
1
−
α
quantile of this distribution as the test threshold.
2
Our presentation in this sectionis therefore dividedinto two parts. First, we obtain the distribution of
HSIC
b
(
Z
)
under both
H
0
and
H
1
; the latterdistributionis also neededto ensureconsistencyof thetest. We shall see, however,thatthe null distribution has a complex form, and cannot be evaluated directly. Thus, in the second partof this section, we describe ways to accurately approximate the
1
−
α
quantile of this distribution.
Asymptotic distribution of
HSIC
b
(
Z
)
We now describe the distribution of the test statistic in (4)The ﬁrst theorem holds under
H
1
.
1
The U- and V-statistics differ in that the latter allow indices of different sums to be equal.
2
Analternativewould betousealargedeviation bound, asprovided forinstance by[9] based onHoeffding’sinequality. It has been reported in [8], however, that such bounds are generally too loose for hypothesis testing.
3
Theorem 1
Let
h
ijqr
=14!
(
i,j,q,r
)
(
t,u,v,w
)
k
tu
l
tu
+
k
tu
l
vw
−
2
k
tu
l
tv
,
(5)
where the sum represents all ordered quadruples
(
t,u,v,w
)
drawn without replacement from
(
i,j,q,r
)
, and assume
E
h
2
<
∞
. Under
H
1
,
HSIC
b
(
Z
)
converges in distribution as
m
→ ∞
to a Gaussian according to
m
12
(HSIC
b
(
Z
)
−
HSIC(
P
xy
,
F
,
G
))
D
→
N
0
,σ
2
u
.
(6)
The variance is
σ
2
u
= 16
E
i
E
j,q,r
h
ijqr
2
−
HSIC(
P
xy
,
F
,
G
)
,
where
E
j,q,r
:=
E
z
j
,z
q
,z
r
.
Proof
We ﬁrst rewrite (4) as a single V-statistic,
HSIC
b
(
Z
) =1
m
4
m
i,j,q,r
h
ijqr
,
(7)where we note that
h
ijqr
deﬁned in (5) does not change with permutation of its indices. The associ-ated U-statistic
HSIC
s
(
Z
)
convergesin distribution as (6) with variance
σ
2
u
[21, Theorem5.5.1(A)]:see [22]. Since the differencebetween
HSIC
b
(
Z
)
and
HSIC
s
(
Z
)
dropsas
1
/m
(see [9], orTheorem3 below),
HSIC
b
(
Z
)
converges asymptotically to the same distribution.The second theorem applies under
H
0
Theorem 2
Under
H
0
, the U-statistic
HSIC
s
(
Z
)
corresponding to the V-statistic in (7) is degen-erate, meaning
E
i
h
ijqr
= 0
. In this case,
HSIC
b
(
Z
)
converges in distribution according to [21,Section 5.5.2]
m
HSIC
b
(
Z
)
D
→
∞
l
=1
λ
l
z
2
l
,
(8)
where
z
l
∼
N
(0
,
1)
i.i.d., and
λ
l
are the solutions to the eigenvalue problem
λ
l
ψ
l
(
z
j
) =
h
ijqr
ψ
l
(
z
i
)
dF
i,q,r
,
where the integral is over the distribution of variables
z
i
,
z
q
, and
z
r
.
Proof
This follows from the discussion of [21, Section 5.5.2], making appropriate allowance forthe fact that we are dealing with a V-statistic (which is why the terms in (8) are not centred: in thecase of a U-statistic, the sum would be over terms
λ
l
(
z
2
l
−
1)
).
Approximating the
1
−
α
quantile of the null distribution
A hypothesis test using
HSIC
b
(
Z
)
could be derived from Theorem 2 above by computing the
(1
−
α
)
th quantile of the distribution (8),where consistency of the test (that is, the convergence to zero of the Type II error for
m
→ ∞
) isguaranteed by the decay as
m
−
1
of the variance of
HSIC
b
(
Z
)
under
H
1
. The distribution under
H
0
is complex, however: the question then becomes how to accurately approximate its quantiles.One approach, taken by [6], is to use a Monte Carlo resampling technique: the ordering of the
Y
sample is permuted repeatedly while that of
X
is kept ﬁxed, and the
1
−
α
quantile is obtainedfrom the resulting distribution of
HSIC
b
values. This can be very expensive, however. A secondapproach,suggestedin [13, p. 34],is toapproximatethenulldistributionas atwo-parameterGammadistribution[12, p. 343,p. 359]: this is oneof themorestraightforwardapproximationsofan inﬁnitesum of
χ
2
variables (see [12, Chapter 18.8] for further ways to approximate such distributions; inparticular, we wish to avoid using moments of order greater than two, since these can becomeexpensive to compute). Speciﬁcally, we make the approximation
m
HSIC
b
(
Z
)
∼
x
α
−
1
e
−
x/β
β
α
Γ(
α
)where
α
=(
E
(HSIC
b
(
Z
)))
2
var(HSIC
b
(
Z
))
, β
=
m
var(HSIC
b
(
Z
))
E
(HSIC
b
(
Z
))
.
(9)4
Figure 1:
m
HSIC
b
cumulative distributionfunction (
Emp
) under
H
0
for
m
= 200
,obtained empirically using
5000
indepen-dent draws of
m
HSIC
b
. The two-parameterGamma distribution (
Gamma
) is ﬁt using
α
= 1
.
17
and
β
= 8
.
3
×
10
−
4
in (9), withmean and variance computed via Theorems3 and 4.
00.511.5200.20.40.60.81
mHSIC
b
P ( m H S I C
b
( Z ) < m H S I C
b
)
EmpGamma
An illustration of the cumulative distribution function(CDF) obtained via the Gamma approximation is givenin Figure 1, along with an empirical CDF obtained byrepeated draws of
HSIC
b
. We note the Gamma approxi-mation is quite accurate, especially in areas of high prob-ability (which we use to compute the test quantile). Theaccuracy of this approximation will be further evaluatedexperimentally in Section 4.To obtain the Gamma distribution from our observa-tions, we need empirical estimates for
E
(HSIC
b
(
Z
))
and
var(HSIC
b
(
Z
))
under the null hypothesis. Expressionsfor these quantities are given in [13, pp. 26-27], howeverthese are in terms of the joint and marginal characteris-tic functions, and not in our more general kernel setting(see also [14, p. 313]). In the following two theorems,we provide much simpler expressions for both quantities,in terms of norms of mean elements
µ
x
and
µ
y
, and thecovariance operators
C
xx
:=
E
x
[(
φ
(
x
)
−
µ
x
)
⊗
(
φ
(
x
)
−
µ
x
)]
and
C
yy
, in feature space. The main advantage of our new expressions is that they are computedentirely in terms of kernels, which makes possible the application of the test to any domains onwhich kernels can be deﬁned, and not only
R
d
.
Theorem 3
Under
H
0
,
E
(HSIC
b
(
Z
)) =1
m
Tr
C
xx
Tr
C
yy
=1
m
1 +
µ
x
2
µ
y
2
−
µ
x
2
−
µ
y
2
,
(10)
where the second equality assumes
k
ii
=
l
ii
= 1
. An empirical estimate of this statistic is obtained by replacing the norms above with
µ
x
2
= (
m
)
−
12
(
i,j
)
∈
i
m
2
k
ij
,
bearing in mind that this resultsin a (generally negligible) bias of
O(
m
−
1
)
in the estimate of
µ
x
2
µ
y
2
.
Theorem 4
Under
H
0
,
var(HSIC
b
(
Z
)) =2(
m
−
4)(
m
−
5)(
m
)
4
C
xx
2HS
C
yy
2HS
+ O(
m
−
3
)
.
Denoting by
⊙
the entrywise matrix product,
A
·
2
the entrywise matrix power, and
B
=((
HKH
)
⊙
(
HLH
))
·
2
, an empirical estimate with negligible bias may be found by replacing the product of covariance operator norms with
1
⊤
(
B
−
diag(
B
))
1
: this is slightly more efﬁcient thantaking the product of the empirical operator norms (although the scaling with
m
is unchanged).
Proofsofboththeoremsmaybefoundin[10],wherewealsocomparewiththeoriginalcharacteristicfunction-based expressions in [13]. We remark that these parameters, like the srcinal test statisticin (4), may be computed in
O(
m
2
)
.
4 Experiments
General tests of statistical independence are most useful for data having complex interactions thatsimple correlation does not detect. We investigate two cases where this situation arises: ﬁrst, wetest vectors in
R
d
which have a dependence relation but no correlation, as occurs in independentsubspaceanalysis; andsecond, we study the statistical dependencebetween a text and its translation.
Independence of subspaces
One area where independence tests have been applied is in deter-mining the convergence of algorithms for independent component analysis (ICA), which involvesseparating random variables that have been linearly mixed, using only their mutual independence.ICA generally entails optimisation over a non-convex function (including when HSIC is itself theoptimisation criterion [9]), and is susceptible to local minima, hence the need for these tests (in fact,for classical approaches to ICA, the
global
minimum of the optimisation might not correspond toindependenceforcertain sourcedistributions). Contingencytable-basedtests havebeen applied[15]5

Search

Similar documents

Tags

Related Search

Test of IndependenceTest Of English As A Foreign LanguageDiagnostic And Statistical Manual Of Mental DInternational Statistical Classification Of DA Manual For Writers Of Research PapersUnited States Declaration Of IndependenceWars Of IndependenceIrish War Of IndependenceA and B Theory of TimeStatistical Design of Experiment (DoE)

We Need Your Support

Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks