Copyright
zyxwvutsrqpon
 
1996 by
the
Genetics
Society
zyxwvutsrqpo
f
zyxwvutsr
merica
A
Statistical Test of
a
Neutral Model
zy
sing
the Dynamics
of
Cytonuclear Disequilibria
Susmita
Datta,*
Mike Kipar~ky,~
avid
M.
Randt
and
onathan
Arnoldf
zy
Department
of
Biostatistics, Rollins School
of
Public Health, Emory University, Atlanta, Geor ‘a 30322, Department
z
f
Ecology and Evolutionary Biology, Brown University, Providence,
mode
Island
02912
and
gt
epartment of Genetics, University
of
Georgia, Athens, Georgia
30602
Manuscript received February 26, 1996 Accepted for publication September 11, 1996 ABSTRACT In this paper
we
use
cytonuclear disequilibria
to
test
the neutrality of
mtDNA
markers. The data considered here involve sample frequencies
f
cytonuclear genotypes subject
o
both statistical sampling variation
as
well
as
genetic sampling variation.
First,
weobtain the dynamics of the sample cytonuclear disequilibria assuming random drift alone
s
the source
of
genetic sampling variation. Next, wedevelop
a
test statistic using cytonuclear disequilibria via the theory of generalized least squares
to test
the random drift model. The null distribution
f
the test statistic
is
shown to be approximately chi-squared using
an
asymptotic argument as well
as
computer simulation. Power of the
test
statistic
is
investigated under an alternative model with drift and selection. The method
is
illustrated using data
from
cage experiments utilizing different cytonuclear genotypes of
Drosophila melunogaster.
A
program for imple- menting the neutrality test
is
available upon request.
C
TONUCLEAR disequilibrium measures provide new inferential tools in analyzing hybrid zone data. In recent years, there has been increasing atten- tion on studying the association
or
interaction between a nuclear gene
or
genotype and maternally inherited cytoplasmic components such
as
mitochondria see
LAMB
and AVISE 1986;
ASMUSSEN
et al.
1987;
ARNOLD
1993). Questions have arisen concerning whether or not these cytonuclear associations can
be
explained without invoking natural selection. Since a significant amount of work in the last decade is based on neutrality of mitochondrial DNA (mtDNA) markers, a number of researchers have recently de- signed experiments o test the neutrality of mtDNA markers (CLARK nd LYCKEGAARD 988;
MACRAE
and
ANDERSON
1988;
FOS
et al.
1990; POLLAK 991; SCRIBNER and AVISE 1994a,b;
S.
T.
KILPATRICK
and
D.
M.
RAND
unpublished results; HUTTER nd
RAND
1995).
We
have taken a different approach to test the neutrality of a mtDNA marker with measures of association of mtDNA markers with nuclear genes
or
genotypes termed
qyte nuclear disequilibria.
Stochastic behavior of allelic disequilibria was ad- dressed by FU and
ARNOLD
(1992). The dynamics of cytonuclear genotypic disequilibria over different gen- erations have been described by DATTA
t
al.
(1996). Stochastic trajectories of cytonuclear disequilibria cal- culated in
Fu
and
ARNOLD
(1992) and DATTA
t
al.
(1996) were used to test the neutrality of mitochondrial
Cmesponding
author:
Jonathan Arnold, Department
of
Genetics, University
of
Georgia, Athens,
GA
30602.
E-mail: amold@bscr.uga.edu
Genetics
144:
1985-1992
(December,
1996)
DNA markers in a vertebrate age experiment involving two species of mosquito fish (SCRIBNER nd AVISE 994). In this experiment an artificial hybrid zone composed of
two
competing species of mosquito fish that inter- breed was established. Frequencies of cytonuclear geno- types and associated cytonuclear disequilibria were monitored over time and compared with their expecta- tions under drift. In such an xperiment here re two potential sources of variation in cytonuclear frequencies,
statisti- cal sampling variation
and
genetic sampling ariation
(WEIR,
1990). Statistical sampling variation arises from sampling individuals from a population to estimate fre- quencies of cytonuclear genotypes
(ASMUSSEN
et
aZ.,
1987). Genetic sampling variation arises from genetic drift, the sampling of gametes from a finite breeding pool of individuals in nature o constitute the next gen- eration
z
1996). In the experiment of SCRIBNER nd AVISE 1994) in every generation, all individuals from an entire population were sampled.
As
a consequence statistical sampling variation was eliminated, and only genetic sampling variation
(i.e.,
drift) remained. This unusual design to a cage experiment permitted
DATTA
and
ARNOLD
(1996) to develop a simple neutrality test of a mtDNA marker using the observed trajectories of cytonuclear disequilibria. It is the purpose
of
this report to extend the domain of applicability of our neutrality test of an mtDNA marker to include cage experiments with both statistical and genetic sampling variation in cytonuclear disequilibria. In Figure
1,
we present a schematic diagram for the design of a more general sampling scheme in a cage
 
S.
zyxwvu
atta
zyxwvu
t
al.
986
Generation
0
Generation
1
Generation
2
zyxwvutsr
b
b b
experiment
Initial Population
zyxwvut
 
zyxwvutsrqpon
 
.
Population
j
Sample
. __ ____ _.
Population
i
Sample Population
FIGURE
1.-Format
of
experiment.
than in
DATTA
and
ARNOLD
(1996). We note that a statistical test of neutrality of a single genetic locus was carried out by
SCHAFFER
t
zyxwvutsrq
l.
(1977) under a similar sampling scheme. We follow a similar ap- proach in testing whether or not the dynamics (over time) of the observed cytonuclear disequilibria are con- sistent with those expected under a random drift model. Tests of hypotheses about cytonuclear disequi- libria in the absence of genetic sampling variation have been considered by
ASMUSSEN
et
al.
(1987), FU and
ARNOLD
(1992), and
ASMUSSEN
and
BASTEN
(1994). The focus of this article is on a test of
fit
to a hypothe- sized drift process using sample estimates
of
cytonuclear disequilibria over time.
THE METHOD
Cytonuclear disequilibria:
Let us suppose we observe a hybrid population at a nuclear locus with
two
alleles
A
and
a,
and at a cytoplasmic locus with alleles
M
and
m
simultaneously as in
SCRIBNER
nd
AVISE
(1994a). The frequencies of the respective cytonuclear genotypic classes are denoted by
PI
. .
.
s,
respectively. For exam- ple,
pl
is the frequency of the class
AA/M
(see Table
1).
Note that the same population can be represented in terms of the frequencies f their allelic combinations
(A/M, A/m,
a/M,
a/m)
in Table
2.
TABLE
I
Frequencies
of
cytonuclear genotypes
Nuclear genotype Cytoplasm
AA
Aa
aa
Total
M
PI
P2
ps
zyxwvutsrqp
 
m
4
P5
ps
1-4
Total
U
U
zyxwvutsrq
 
1
zyxwvu
TABLE
2
Frequencies
of
allelic combinations
Nuclear allele Cytoplasm
A
U
Total
M
el
=
pl
+
p/2
e3
=
p,
+
p/2
Q
m
Q
=
p4
+
p/2
e4
=
ps
+
p/2
1
q
Total
P
1-P
1
Cytonuclear disequilibrium is defined to be the asso- ciation of nuclear genotypes or alleles with cytoplasmic alleles
(ASMUSSEN
et
al.
1987). The following are he cytonuclear disequilibria corresponding to the omozy- gote
AA/M
and the heterozygote
Aa/M,
respectively:
4
P,
Uq,
(1)
4
=
p,
vq,
(2)
where
u
=
P,
+
p4
and
v
=
p,
ps.
Similarly, the allelic disequilibrium is given by
D
=
e1
 
pq,
3)
where
p
=
el
+
,
and
q
=
el
e3,
and the
e,'s
are the frequencies of the allelic combinations (Table
3).
For detailed definitions and their moment dynamics we re- fer to
DATTA
t
al.
(1996). All the above quantities can be defined at the sample level,
as
well. We will supplement all of these variables with carets to denote their sample counterpart. For ex- ample,
l
=
il
ia
denotes the sample disequilibrium corresponding to
AA/M.
A
statistical test based on disequilibrium dynamics:
From now on, we will consider a cage experiment with the sampling scheme in Figure 1. The experiment
starts
with an initial base population with nonzero allelic dis- equilibrium
D
=
4.
enerations are discrete and non- overlapping, and only a fixed number of randomly se- lected offspring of the previous generation re introduced into the cage, leading to genetic drift
(Le.,
genetic sampling variation). The offspring are allowed to mature, mate, and have offspring, and are removed from the cage to make room for the next generation. From the adults of generation
t
removed from the cage, a total
of
a
)
adults are randomly sampled to estimate frequencies of cytonuclear genotypes in Table
1
(see Figure
1 ,
eading to statistical sampling variation. Con-
TABLE
3
Gametic (allelic) disequilibrium
Nuclear allele
Cytoplasm
A
a
Total
M
e,
=
pq
D
es
=
(1
p)q
D
Q
m
a=p(l-q)-D
~=(l-p)(l-q)+D 1-q
Total
P
1-P
1
 
Disequilibrium
zyxwv
est
1987
sequently, the sample genotypic counts are multinomial with parameters equal to the ytonuclear genotypic fre- quencies
zyxwvutsrq
l
(t),
zyxwvutsr
 
. .
,
p6(t)
of the tth generation of the population. n each generation these genotypic fre- quencies are he realization of a drift process. The model used here to describe the genetic sampling varia- tion is termed he random union of zygotes (RUZ) model
(WATTERSON
970, 1972). Genetic sampling variation:
Let
X@(
t)
be the number of individuals receiving gamete
zyxwvu
 
rom the father and gamete
m
from the mother at time
t,
and let
x(t)
=
(X11(
t),
.
.
.
,
zyxwv
4(
t))
be the vector of such counts. The probability distribution of the counts
X(t)
given the gametic combination counts at time
t
1,
X(t
1)
is multinomial and is given by Pr(X(t)
=
zyxwvuts
0
1x0
1))
where
N(
t)
=
XJm
@
t),
m
=
1, 2, 3, 4
(WATTEFSON,
1970)
and efand
e,
are frequencies of allelic combina- tions in fathers and mothers. It is not hard to see that the cytonuclear genotypic frequencies are linear ombi- nations of the gametic combination counts
Xfm
(see
e.g., DATTA
et
al.
1996).
Therefore, one can find the condi- tional moment generating function f
pl,
.
. .
,
&
at time
t
given the frequencies at time
t
1,
which in turn extends to the calculation of the moments of the cyto- nuclear disequilibria,
Dl
and
4.
See DATTA
t
al.
(1996)
for details.
Statistical
sampling variation:
The estimated disequi- libria based on the statistical sample from the popula- tion at time
t
are given by
B 4
=
1;dS
+
=
nj(t)/n(t),
zyxwvut
 
=
1,
. . .
,
6,
(7)
and
nl
(t)
zyx
 
)
are the genotypic counts in the sample from generation
t
=
0,
1,
*
.
Here the estimates are subject to multinomial sam- pling variation, conditional on the genotypic frequen- cies in Table
1,
at time
t
generated by the drift process. Total
sampling variation:
The total (statistical ge- netic) variance and covariances of the cytonuclear dis- equilibria can now be calculated from the above model. These expectations are necessary to construct a neutral-
ity
test comparing observed and expected disequilibria. The (unconditional) variance of
&t),
i
=
1,
2,
can be calculated by the formula where and
5
stand for the genetic and the statistical sampling, respectively. Now ignoring smaller terms
of
the order
O(n (t)Ar'(t))
and
O(n-'(t)), (i.e.,
terms like
l/n(t)N(t)
nd
l/n2(t)),
we can obtain an approxi- mation for the first term in
(8):
Varf,(E5(Dj(t))) Var,;(D,(t)),
(9)
which can be calculated from the formulae given in
DATTA
t
al.
(1996).
Note that
E,@,(
t))
s calculated by differentiating the moment generating function of a multinomial distribution with parameters
n(t)
and
pl(
),
. .
.
,
p6(
).
Direct calculation using the moment generating unction and gnoring he higher order terms shows that E4Var5(fi1(t)))
=
-(p
+
2dl-1$-$q0
40
1
n(
)
+
d:-1$
dt-lh)
+
pO(-d;-,&
+
2d:-1g
dl-1
+
2dt-1&0
+
o
 
3d?-1&0
+
4dt-i&qo
&qo
+
Po&
2$& 3dt-l&&
+
2&&
&&
2d:-1&
+
dt-lf$qo
+
dt-1poqo
+
zy
 
dt-1&70
)I,
(10)
where
po
=
p(0)
qo
=
q 0)
are the initial gene frequencies and
dl-]
=
Ec,(D(t
1)).
In deriving the above formula, Var,(G
(t)
)
is calculated by arguments similar to those for
E$(&
t))
from the multinomial statistical sampling. For calculating the second term, E,(Var,(a(t))), n
(8),
we used the 6-method and the fact that expectations
of
p
and
q
remain constant over generations. Using a result of
Fu
and
ARNOLD
(1992),
for the random drift model we find where
Nj
=
N i),
1.
In the same way, one obtains the second term in
(8)
 
1988
zyxwvutsrq
.
Datta
zyxwvu
t
al.
zyxwv
 
48dt-,&&
+
zyxwvu
66
+
24dt-1p&$ 16&$ 56d,-16$
+
32&$
+
32d,-1&$ 1664
+
8&&
16&&
+
8&$)}. (12) For a complete understanding of the system we also need to calculate the total covariance between the sam- ple disequilibria as well. Covariances can occur between disequilibria in the same generation or disequilibria in different generations. Both are now calculated. The to- tal covariance between
Bl
(t)
and
zyxwvut
(t)
(Cov(Bl(
),
I (
t)
)
in the same generation is given by cov~j(~~(Bl(t)),
\(&(O))
+
ECj(Covs(B1(t),
(t))).
(13)
Using similar calculations as before and ignoring higher order terms, we find cov(~l(t),&(t)) COV,,(Dl(t),
t))
1
+
pO(8d;-lP, l6d;-,& 24d;-,& 32d;-,&
+
l6d;-& l6d;-& 3d,-,pOqO 8d;-lpOqo
+
2&qo 4d?-l&qo
+
16d:-1&q0
+
64d,3_1&q0
zyx
640
8Od:-&o 120d:-l&qo
+
112d:-,&qO
+
64d;-l&qo 48d:-l&O 4&& 32&& 32d:-1&&
+
4&$
+
32d,-,&$
+
152d,'-1&$
4
)
112d,-,&& 216d:-1p&$
+
128dt-,&& 96d:-1&& 48dt-1&& 40d,-,p&$ 16&$
+
144dt-,&$ 4864 l68dt-,g$
+
48&$
+
64d,-lP:$
16&$
16&&
+
48&& 4863
+
16p$$)}, (14) where the first term Cov,(D,(t),
4
))
can be calculated following
DATTA
t
al.
(1996). The covariance between the estimates based on sam- ples from different generations are calculated next. For
1
zyxwvut
 
j
5
2,
t
0,
zyxwvut
 
1,
given
5,
the composition of the population at time
t,
Bi(t)
and
Bj(t
s
are independent.
Also
up to higher order terms,
E(Bi(
)
Tt
=
Di(
)
(15)
and
E(B;(t
s)
1s
=
E(E(Bj(t
+
s)
Yt+J
3,)
=
E(D,(t
+
s)
1s .
(16) It can be shown using the one-step conditional moment generating function
(
cJ:
DATTA, 
1996) that the above quantity equals
((N,+,r
l)/N,+,)E(D(t
s
l)p(t
+
s
1)(7,),
fj
=
1
and
((N,+,
1)/
N,+,)E D t
s
1)(1 2p(t
+
s
1))17,), fj
=
2, which in turn, by the &method, s approximately equal to (follows from the dynamics of
D,
ee
ARNOLD
and Fu 1992), where
rp
=
p,,,
if
j
=
1
and
=
1
2p0
if
j
=
2. Note that the above product needs to be interpreted as
1
for
s
=
1. Consequently one has, modulo higher order terms, C0V(Bi(t),
Bj(
+
s
=
(N+s
1)
N+
zy
 
(17) Once again, in the above equation Cov(Di(
),
D
))
can be found using the recursions of the first two moments of
Di(
)
in
DATTA
t
al.
(1996) and
D(t)
in Fu and
AR-
NOLD
(1992). Note that in obtaining the first term of the above expression we have used the independence of the samples at times
t
and
t
+
s given the population at time
t.
Let the vector
Y
=
(B1(1),
.
,
Bl(4,
&(I),
.
.
.
,
&(k))
(18)
be the trajectory of sample disequilibria,
fil
and
I .
Note that
E(Bi(t))
=
Etr(E$(Bi(t)))
S(Di(t)), (19) ignoring higher order terms, which is equal to
a,&&
if
i
=
1
and a,(l 2pO)& if
i
=
2. Thus, the trajectory of the sample disequilibria
Y
s approximately normally distributed with mean vector
p
=
Xp
and variance- covariance matrix
E,
where
X
=
[X, X,],
X1
=
(
al,
. . .
,
~k,
,
.
. .
,
O) ,
X2
=
(0,
. . .
,
0,
ai,
. . .
,
Q)',
with
P
=
PI, Pd,
I
=
A,
PZ
=
(1
2po)&. To recapitu- late, the trajectory of sample cytonuclear disequilibria is denoted by
Y.
The expectation of this trajectory is
p.
The estimates of the trajectory are approximatelyjointly normal, and the variancecovariance matrix of the sam- ple cytonuclear disequilibria,
Bl
and
4,
along the tra- jectory is
X.
The vector
p
is determined by the initial state of the cage. The calculation
of
the variance-covari- ance matrix of the disequilibria
X
was explained in
 
Disequilibrium Test
zyxwv
989
the earlier paragraphs. The normal approximation is just-ified when both
zyxwvutsr
(
zyxwvut
)
and
zyxwvuts
(
)
are large.
ivPutmlit?l
test:
With the mean vector
p
and the vari- ance-covariance matrix
C
of the estimated trajectory for
Bl
and
zyxwvut
2,
we can now construct
a
neutrality test by comparing the inferred trajectory of cytonuclear dis- equilibria
zyxwvut
Y)
ith
its
expectation over time
p).
If
the initial values
fi),
qo
and
Do
n the cage are known, one can use the following statistics to test the null hy- pothesis of
a
random drift model:
T=
Y
zyxwvut
 
P)~Z”(Y
p),
20)
which
will
have an approximate chi-square distribution with
2k
degrees of freedom. One would reject
a
random drift model
if
T
>
~:(2k),
here
~2(2k)
s the upper ath quantile of a chi-square distribution with
2k
degrees of freedom. However, often in practice, the composi- tion of the initial population (summarized by
p
is not known and only a simple random sample from it is available. In that case, we suggest (consistently) estimat- ing the variance covariance matrix
C
by substituting in the sample estimates
fi(O),
q 0)
nd
B(0)
in places of
b,
o
and
do,
respectively, in the expression for
E
rom
(9)
14)
and
(17).
Next, an overall estimate of
p
is obtained by the method of weighted least squares:
fi
=
(xrg-
X)
1xtg 1y.
Finally, we could use the test statistic
T
=
Y
@ r
e-’
Y
@),
which
will
have an approximately chi- square distribution with
2(k
1)
freedom under the random drift model. This proposition is validated with a simulation study for
k
=
2.
In the simulation
we
con- sidered that the initial generation had the frequencies
p,
=
0.5
and
p6
=
0.5,
and
all
the subsequent genera- tions had constant generation size
N
=
500.
In each simulation, random samples of constant sample size
n
=
50
are generated for
+
1
=
3
generations. The value of the test statistic is calculated for each simulation, nd the ntire process is repeated
5000
times. Conse- quently,
we
have
5000
independent realizations of the test statistic.
A
histogram
wa5
drawn with these values, which is in good agreement with an overlaid true chi- square distribution with
2
degrees of freedom (Figure
2).
The agreement is emphasized with a Q
Q
plot
also.
Note that for this neutrality test
it
is not necessary
to
have sample data from all consecutive generations. It is easy to adjust and reinterpret the est statistic in such cases. Software for carrying out this neutrality test is available from the author
(datta@fox.sph.emory.edu).
Instead of using the chi-square percentiles in con- structing the approximate rejection region of the test, one may resort to Monte Carlo simulation to find the exact Pvalue given a sample. This may be preferable
if
the sample sizes are small and one is afraid that the large sample distribution may not be adequate. Since
0 5
zyxwv
0
1s
20
TS
w
0
S
io
5
20
Qventlles
01
Chi Squmre disWbullon with
2
d.f.
FIGURE
.-Histogram and chisquared
Q
Q
plot based on
5000
Monte-Carlo replications of the test statistic
T.
The overlaid graph in solid line is the density of chi-squared distri- bution with 2 degrees
of
freedom.
the Pvalue calculation using present day computing is not expensive in terms of computing time, we anticipate that this approach may be preferred by many users.
If,
in practice, the entire history of the generation sizes is not known, then one may use simple interpola- tion. We found that the test statistic is not too sensitive with respective to misspecification of the generation sizes.
RESULTS
An
example:
To illustrate our test procedure, we now consider
a
real cage experiment constituting artificial hybrid zone data on
Drosophila
melanogmter
conducted by
M.
PARSKY.
The experiment uses controlled envi- ronmental conditions o study the effects of ytonuclear interaction in a population genetic context sing genet- ically manipulated strains of
D.
melanogmtpr.
The hy- brids were formed with the crosses of initial stocks of flies collected in Ega, Denmark and Death Valley,
CA.
Cages were maintained in
a
cycle of discrete genera- tions. In each generation, flies were allowed to mate, and the adult flies were taken out and kept frozen for genotyping by PCR analysis. Eggs were allowed o form the next generation.
A
random sample of adult flies were taken from the frozen adult population and using PCR-based techniques, the cytonuclear genotypes were observed at two nuclear loci
(60A
and DPP) and at a cytoplasmic locus simultaneously. The experiment was continued for four generations. Data were not initially collected on the third eneration except for the opula- tion size. For carrying out our test procedure we have ignored the data on their initial generation as the ex- periment started with a purely hetero7ygous population in which
all
disequilibrium measures are zero. Thus, in
of 8