Copyright

zyxwvutsrqpon

1996 by

the

Genetics

Society

zyxwvutsrqpo

f

zyxwvutsr

merica

A

Statistical Test of

a

Neutral Model

zy

sing

the Dynamics

of

Cytonuclear Disequilibria

Susmita

Datta,*

Mike Kipar~ky,~

avid

M.

Randt

and

onathan

Arnoldf

zy

Department

of

Biostatistics, Rollins School

of

Public Health, Emory University, Atlanta, Geor ‘a 30322, Department

z

f

Ecology and Evolutionary Biology, Brown University, Providence,

mode

Island

02912

and

gt

epartment of Genetics, University

of

Georgia, Athens, Georgia

30602

Manuscript received February 26, 1996 Accepted for publication September 11, 1996 ABSTRACT In this paper

we

use

cytonuclear disequilibria

to

test

the neutrality of

mtDNA

markers. The data considered here involve sample frequencies

f

cytonuclear genotypes subject

o

both statistical sampling variation

as

well

as

genetic sampling variation.

First,

weobtain the dynamics of the sample cytonuclear disequilibria assuming random drift alone

s

the source

of

genetic sampling variation. Next, wedevelop

a

test statistic using cytonuclear disequilibria via the theory of generalized least squares

to test

the random drift model. The null distribution

f

the test statistic

is

shown to be approximately chi-squared using

an

asymptotic argument as well

as

computer simulation. Power of the

test

statistic

is

investigated under an alternative model with drift and selection. The method

is

illustrated using data

from

cage experiments utilizing different cytonuclear genotypes of

Drosophila melunogaster.

A

program for imple- menting the neutrality test

is

available upon request.

C

TONUCLEAR disequilibrium measures provide new inferential tools in analyzing hybrid zone data. In recent years, there has been increasing atten- tion on studying the association

or

interaction between a nuclear gene

or

genotype and maternally inherited cytoplasmic components such

as

mitochondria see

LAMB

and AVISE 1986;

ASMUSSEN

et al.

1987;

ARNOLD

1993). Questions have arisen concerning whether or not these cytonuclear associations can

be

explained without invoking natural selection. Since a significant amount of work in the last decade is based on neutrality of mitochondrial DNA (mtDNA) markers, a number of researchers have recently de- signed experiments o test the neutrality of mtDNA markers (CLARK nd LYCKEGAARD 988;

MACRAE

and

ANDERSON

1988;

FOS

et al.

1990; POLLAK 991; SCRIBNER and AVISE 1994a,b;

S.

T.

KILPATRICK

and

D.

M.

RAND

unpublished results; HUTTER nd

RAND

1995).

We

have taken a different approach to test the neutrality of a mtDNA marker with measures of association of mtDNA markers with nuclear genes

or

genotypes termed

qyte nuclear disequilibria.

Stochastic behavior of allelic disequilibria was ad- dressed by FU and

ARNOLD

(1992). The dynamics of cytonuclear genotypic disequilibria over different gen- erations have been described by DATTA

t

al.

(1996). Stochastic trajectories of cytonuclear disequilibria cal- culated in

Fu

and

ARNOLD

(1992) and DATTA

t

al.

(1996) were used to test the neutrality of mitochondrial

Cmesponding

author:

Jonathan Arnold, Department

of

Genetics, University

of

Georgia, Athens,

GA

30602.

E-mail: amold@bscr.uga.edu

Genetics

144:

1985-1992

(December,

1996)

DNA markers in a vertebrate age experiment involving two species of mosquito fish (SCRIBNER nd AVISE 994). In this experiment an artificial hybrid zone composed of

two

competing species of mosquito fish that inter- breed was established. Frequencies of cytonuclear geno- types and associated cytonuclear disequilibria were monitored over time and compared with their expecta- tions under drift. In such an xperiment here re two potential sources of variation in cytonuclear frequencies,

statisti- cal sampling variation

and

genetic sampling ariation

(WEIR,

1990). Statistical sampling variation arises from sampling individuals from a population to estimate fre- quencies of cytonuclear genotypes

(ASMUSSEN

et

aZ.,

1987). Genetic sampling variation arises from genetic drift, the sampling of gametes from a finite breeding pool of individuals in nature o constitute the next gen- eration

1996). In the experiment of SCRIBNER nd AVISE 1994) in every generation, all individuals from an entire population were sampled.

As

a consequence statistical sampling variation was eliminated, and only genetic sampling variation

(i.e.,

drift) remained. This unusual design to a cage experiment permitted

DATTA

and

ARNOLD

(1996) to develop a simple neutrality test of a mtDNA marker using the observed trajectories of cytonuclear disequilibria. It is the purpose

of

this report to extend the domain of applicability of our neutrality test of an mtDNA marker to include cage experiments with both statistical and genetic sampling variation in cytonuclear disequilibria. In Figure

1,

we present a schematic diagram for the design of a more general sampling scheme in a cage

S.

zyxwvu

atta

zyxwvu

t

al.

986

Generation

0

Generation

1

Generation

2

zyxwvutsr

b

b b

experiment

Initial Population

zyxwvut

zyxwvutsrqpon

.

Population

j

Sample

. __ ____ _.

Population

i

Sample Population

FIGURE

1.-Format

of

experiment.

than in

DATTA

and

ARNOLD

(1996). We note that a statistical test of neutrality of a single genetic locus was carried out by

SCHAFFER

t

zyxwvutsrq

l.

(1977) under a similar sampling scheme. We follow a similar ap- proach in testing whether or not the dynamics (over time) of the observed cytonuclear disequilibria are con- sistent with those expected under a random drift model. Tests of hypotheses about cytonuclear disequi- libria in the absence of genetic sampling variation have been considered by

ASMUSSEN

et

al.

(1987), FU and

ARNOLD

(1992), and

ASMUSSEN

and

BASTEN

(1994). The focus of this article is on a test of

fit

to a hypothe- sized drift process using sample estimates

of

cytonuclear disequilibria over time.

THE METHOD

Cytonuclear disequilibria:

Let us suppose we observe a hybrid population at a nuclear locus with

two

alleles

A

and

a,

and at a cytoplasmic locus with alleles

M

and

m

simultaneously as in

SCRIBNER

nd

AVISE

(1994a). The frequencies of the respective cytonuclear genotypic classes are denoted by

PI

. .

.

s,

respectively. For exam- ple,

pl

is the frequency of the class

AA/M

(see Table

1).

Note that the same population can be represented in terms of the frequencies f their allelic combinations

(A/M, A/m,

a/M,

a/m)

in Table

2.

TABLE

I

Frequencies

of

cytonuclear genotypes

Nuclear genotype Cytoplasm

AA

Aa

aa

Total

M

PI

P2

ps

zyxwvutsrqp

m

4

P5

ps

1-4

Total

U

U

zyxwvutsrq

1

zyxwvu

TABLE

2

Frequencies

of

allelic combinations

Nuclear allele Cytoplasm

A

U

Total

M

el

=

pl

+

p/2

e3

=

p,

+

p/2

Q

m

Q

=

p4

+

p/2

e4

=

ps

+

p/2

1

q

Total

P

1-P

1

Cytonuclear disequilibrium is defined to be the asso- ciation of nuclear genotypes or alleles with cytoplasmic alleles

(ASMUSSEN

et

al.

1987). The following are he cytonuclear disequilibria corresponding to the omozy- gote

AA/M

and the heterozygote

Aa/M,

respectively:

4

P,

Uq,

(1)

4

=

p,

vq,

(2)

where

u

=

P,

+

p4

and

v

=

p,

ps.

Similarly, the allelic disequilibrium is given by

D

=

e1

pq,

3)

where

p

=

el

+

,

and

q

=

el

e3,

and the

e,'s

are the frequencies of the allelic combinations (Table

3).

For detailed definitions and their moment dynamics we re- fer to

DATTA

t

al.

(1996). All the above quantities can be defined at the sample level,

as

well. We will supplement all of these variables with carets to denote their sample counterpart. For ex- ample,

l

=

il

ia

denotes the sample disequilibrium corresponding to

AA/M.

A

statistical test based on disequilibrium dynamics:

From now on, we will consider a cage experiment with the sampling scheme in Figure 1. The experiment

starts

with an initial base population with nonzero allelic dis- equilibrium

D

=

4.

enerations are discrete and non- overlapping, and only a fixed number of randomly se- lected offspring of the previous generation re introduced into the cage, leading to genetic drift

(Le.,

genetic sampling variation). The offspring are allowed to mature, mate, and have offspring, and are removed from the cage to make room for the next generation. From the adults of generation

t

removed from the cage, a total

of

a

)

adults are randomly sampled to estimate frequencies of cytonuclear genotypes in Table

1

(see Figure

1 ,

eading to statistical sampling variation. Con-

TABLE

3

Gametic (allelic) disequilibrium

Nuclear allele

Cytoplasm

A

a

Total

M

e,

=

pq

D

es

=

(1

p)q

D

Q

m

a=p(l-q)-D

~=(l-p)(l-q)+D 1-q

Total

P

1-P

1

Disequilibrium

zyxwv

est

1987

sequently, the sample genotypic counts are multinomial with parameters equal to the ytonuclear genotypic fre- quencies

zyxwvutsrq

l

(t),

zyxwvutsr

. .

,

p6(t)

of the tth generation of the population. n each generation these genotypic fre- quencies are he realization of a drift process. The model used here to describe the genetic sampling varia- tion is termed he random union of zygotes (RUZ)
model

(WATTERSON

970, 1972). Genetic sampling variation:

Let

X@(

t)

be the number of individuals receiving gamete

zyxwvu

rom the father and gamete

m

from the mother at time

t,

and let

x(t)

=

(X11(

t),

.

.

.

,

zyxwv

4(

t))

be the vector of such counts. The probability distribution of the counts

X(t)

given the gametic combination counts at time

t

1,

X(t

1)

is multinomial and is given by Pr(X(t)

=

zyxwvuts

0

1x0

1))

where

N(

t)

=

XJm

@

t),

m

=

1, 2, 3, 4

(WATTEFSON,

1970)

and efand

e,

are frequencies of allelic combina- tions in fathers and mothers. It is not hard to see that the cytonuclear genotypic frequencies are linear ombi- nations of the gametic combination counts

Xfm

(see

e.g., DATTA

et

al.

1996).

Therefore, one can find the condi- tional moment generating function f

pl,

.

. .

,

&

at time

t

given the frequencies at time

t

1,

which in turn extends to the calculation of the moments of the cyto- nuclear disequilibria,

Dl

and

4.

See DATTA

t

al.

(1996)

for details.

Statistical

sampling variation:

The estimated disequi- libria based on the statistical sample from the popula- tion at time

t

are given by

B 4

=

1;dS

+

=

nj(t)/n(t),

zyxwvut

=

1,

. . .

,

6,

(7)

and

nl

(t)

zyx

)

are the genotypic counts in the sample from generation

t

=

0,

1,

*

.

Here the estimates are subject to multinomial sam- pling variation, conditional on the genotypic frequen- cies in Table

1,

at time

t

generated by the drift process. Total

sampling variation:

The total (statistical ge- netic) variance and covariances of the cytonuclear dis- equilibria can now be calculated from the above model. These expectations are necessary to construct a neutral-

ity

test comparing observed and expected disequilibria. The (unconditional) variance of

&t),

i

=

1,

2,

can be calculated by the formula where and

5

stand for the genetic and the statistical sampling, respectively. Now ignoring smaller terms

of

the order

O(n (t)Ar'(t))

and

O(n-'(t)), (i.e.,

terms like

l/n(t)N(t)

nd

l/n2(t)),

we can obtain an approxi- mation for the first term in

(8):

Varf,(E5(Dj(t))) Var,;(D,(t)),

(9)

which can be calculated from the formulae given in

DATTA

t

al.

(1996).

Note that

E,@,(

t))

s calculated by differentiating the moment generating function of a multinomial distribution with parameters

n(t)

and

pl(

),

. .

.

,

p6(

).

Direct calculation using the moment generating unction and gnoring he higher order
terms shows that E4Var5(fi1(t)))

=

-(p

+

2dl-1$-$q0

40

1

n(

)

+

d:-1$

dt-lh)

+

pO(-d;-,&

+

2d:-1g

dl-1

+

2dt-1&0

+

o

3d?-1&0

+

4dt-i&qo

&qo

+

Po&

2$& 3dt-l&&

+

2&&

&&

2d:-1&

+

dt-lf$qo

+

dt-1poqo

+

zy

dt-1&70

)I,

(10)

where

po

=

p(0)

qo

=

q 0)

are the initial gene frequencies and

dl-]

=

Ec,(D(t

1)).

In deriving the above formula, Var,(G

(t)

)

is calculated by arguments similar to those for

E$(&

t))

from the multinomial statistical sampling. For calculating the second term, E,(Var,(a(t))), n

(8),

we used the 6-method and the fact that expectations

of

p

and

q

remain constant over generations. Using a result of

Fu

and

ARNOLD

(1992),

for the random drift model we find where

Nj

=

N i),

1.

In the same way, one obtains the second term in

(8)

1988

zyxwvutsrq

.

Datta

zyxwvu

t

al.

zyxwv

48dt-,&&

+

zyxwvu

66

+

24dt-1p&$ 16&$ 56d,-16$

+

32&$

+

32d,-1&$ 1664

+

8&&

16&&

+

8&$)}. (12) For a complete understanding of the system we also need to calculate the total covariance between the sam- ple disequilibria as well. Covariances can occur between disequilibria in the same generation or disequilibria in different generations. Both are now calculated. The to- tal covariance between

Bl

(t)

and

zyxwvut

(t)

(Cov(Bl(

),

I (

t)

)

in the same generation is given by cov~j(~~(Bl(t)),

\(&(O))

+

ECj(Covs(B1(t),

(t))).

(13)

Using similar calculations as before and ignoring higher order terms, we find cov(~l(t),&(t)) COV,,(Dl(t),

t))

1

+

pO(8d;-lP, l6d;-,& 24d;-,& 32d;-,&

+

l6d;-& l6d;-& 3d,-,pOqO 8d;-lpOqo

+

2&qo 4d?-l&qo

+

16d:-1&q0

+

64d,3_1&q0

zyx

640

8Od:-&o 120d:-l&qo

+

112d:-,&qO

+

64d;-l&qo 48d:-l&O 4&& 32&& 32d:-1&&

+

4&$

+

32d,-,&$

+

152d,'-1&$

4

)

112d,-,&& 216d:-1p&$

+

128dt-,&& 96d:-1&& 48dt-1&& 40d,-,p&$ 16&$

+

144dt-,&$ 4864 l68dt-,g$

+

48&$

+

64d,-lP:$

16&$

16&&

+

48&& 4863

+

16p$$)}, (14) where the first term Cov,(D,(t),

4

))

can be calculated following

DATTA

t

al.

(1996). The covariance between the estimates based on sam- ples from different generations are calculated next. For

1

zyxwvut

j

5

2,

t

0,

zyxwvut

1,

given

5,

the composition of the population at time

t,

Bi(t)

and

Bj(t

s

are independent.

Also

up to higher order terms,

E(Bi(

)

Tt

=

Di(

)

(15)

and

E(B;(t

s)

1s

=

E(E(Bj(t

+

s)

Yt+J

3,)

=

E(D,(t

+

s)

1s .

(16) It can be shown using the one-step conditional moment generating function

(

cJ:

((N,+,r

l)/N,+,)E(D(t

s

l)p(t

+

s

1)(7,),

fj

=

1

and

((N,+,

1)/

N,+,)E D t

s

1)(1 2p(t

+

s

1))17,), fj

=

2, which in turn, by the &method, s approximately equal to (follows from the dynamics of

D,

ee

ARNOLD

and Fu 1992), where

rp

=

p,,,

if

j

=

1

and

=

1

2p0

if

j

=

2. Note that the above product needs to be interpreted as

1

for

s

=

1. Consequently one has, modulo higher order terms, C0V(Bi(t),

Bj(

+

s

=

(N+s

1)

N+

zy

(17) Once again, in the above equation Cov(Di(

),

D

))

can be found using the recursions of the first two moments of

Di(

)

in

DATTA

t

al.

(1996) and

D(t)

in Fu and

AR-

NOLD

(1992). Note that in obtaining the first term of the above expression we have used the independence of the samples at times

t

and

t

+

s given the population at time

t.

Let the vector

Y

=

(B1(1),

.

,

Bl(4,

&(I),

.

.

.

,

&(k))

(18)

be the trajectory of sample disequilibria,

fil

and

I .

Note that

E(Bi(t))

=

Etr(E$(Bi(t)))

S(Di(t)), (19) ignoring higher order terms, which is equal to

a,&&

if

i

=

1

and a,(l 2pO)& if

i

=

2. Thus, the trajectory of the sample disequilibria

Y

s approximately normally distributed with mean vector

p

=

Xp

and variance- covariance matrix

E,

where

X

=

[X, X,],

X1

=

(

al,

. . .

,

~k,

,

.

. .

,

O) ,

X2

=

(0,

. . .

,

0,

ai,

. . .

,

Q)',

with

P

=

PI, Pd,

I

=

A,

PZ

=

(1

2po)&. To recapitu- late, the trajectory of sample cytonuclear disequilibria is denoted by

Y.

The expectation of this trajectory is

p.

The estimates of the trajectory are approximatelyjointly normal, and the variancecovariance matrix of the sam- ple cytonuclear disequilibria,

Bl

and

4,

along the tra- jectory is

X.

The vector

p

is determined by the initial state of the cage. The calculation

of

the variance-covari- ance matrix of the disequilibria

X

was explained in

Disequilibrium Test

zyxwv

989

the earlier paragraphs. The normal approximation is just-ified when both

zyxwvutsr

(

zyxwvut

)

and

zyxwvuts

(

)

are large.

ivPutmlit?l

test:

With the mean vector

p

and the vari- ance-covariance matrix

C

of the estimated trajectory for

Bl

and

zyxwvut

2,

we can now construct

a

neutrality test by comparing the inferred trajectory of cytonuclear dis- equilibria

zyxwvut

Y)

ith

its

expectation over time

p).

If

the initial values

fi),

qo

and

Do

n the cage are known, one can use the following statistics to test the null hy- pothesis of

a

random drift model:

T=

Y

zyxwvut

P)~Z”(Y

p),

20)

which

will

have an approximate chi-square distribution with

2k

degrees of freedom. One would reject

a

random drift model

if

T

>

~:(2k),

here

~2(2k)

s the upper ath quantile of a chi-square distribution with

2k

degrees of freedom. However, often in practice, the composi- tion of the initial population (summarized by

p

is not known and only a simple random sample from it is available. In that case, we suggest (consistently) estimat- ing the variance covariance matrix

C

by substituting in the sample estimates

fi(O),

q 0)

nd

B(0)

in places of

b,

o

and

do,

respectively, in the expression for

E

rom

(9)

14)

and

(17).

Next, an overall estimate of

p

is obtained by the method of weighted least squares:

fi

=

(xrg-

X)

1xtg 1y.

Finally, we could use the test statistic

T

=

Y

@ r

e-’

Y

@),

which

will

have an approximately chi- square distribution with

2(k

1)

freedom under the random drift model. This proposition is validated with a simulation study for

k

=

2.

In the simulation

we

con- sidered that the initial generation had the frequencies

p,

=

0.5

and

p6

=

0.5,

and

all

the subsequent genera- tions had constant generation size

N

=

500.

In each simulation, random samples of constant sample size

n

=

50

are generated for

+

1

=

3

generations. The value of the test statistic is calculated for each simulation, nd the ntire process is repeated

5000

times. Conse- quently,

we

have

5000

independent realizations of the test statistic.

A

histogram

wa5

drawn with these values, which is in good agreement with an overlaid true chi- square distribution with

2

degrees of freedom (Figure

2).

The agreement is emphasized with a Q

Q

plot

also.

Note that for this neutrality test

it

is not necessary

to

have sample data from all consecutive generations. It is easy to adjust and reinterpret the est statistic in such cases. Software for carrying out this neutrality test is available from the author

(datta@fox.sph.emory.edu).

Instead of using the chi-square percentiles in con- structing the approximate rejection region of the test, one may resort to Monte Carlo simulation to find the exact Pvalue given a sample. This may be preferable

if

the sample sizes are small and one is afraid that the large sample distribution may not be adequate. Since

0 5

zyxwv

0

1s

20

TS

w

0

S

io

5

20

Qventlles

01

Chi Squmre disWbullon with

2

d.f.

FIGURE

.-Histogram and chisquared

Q

Q

plot based on

5000

Monte-Carlo replications of the test statistic

T.

The overlaid graph in solid line is the density of chi-squared distri- bution with 2 degrees

of

freedom.

the Pvalue calculation using present day computing is not expensive in terms of computing time, we anticipate that this approach may be preferred by many users.

If,

in practice, the entire history of the generation sizes is not known, then one may use simple interpola- tion. We found that the test statistic is not too sensitive with respective to misspecification of the generation sizes.

RESULTS

An

example:

To illustrate our test procedure, we now consider

a

real cage experiment constituting artificial hybrid zone data on

Drosophila

melanogmter

conducted by

M.

PARSKY.

The experiment uses controlled envi- ronmental conditions o study the effects of ytonuclear interaction in a population genetic context sing genet- ically manipulated strains of

D.

melanogmtpr.

The hy- brids were formed with the crosses of initial stocks of flies collected in Ega, Denmark and Death Valley,

CA.

Cages were maintained in

a

cycle of discrete genera- tions. In each generation, flies were allowed to mate, and the adult flies were taken out and kept frozen for genotyping by PCR analysis. Eggs were allowed o form the next generation.

A

random sample of adult flies were taken from the frozen adult population and using PCR-based techniques, the cytonuclear genotypes were observed at two nuclear loci

(60A

and DPP) and at a cytoplasmic locus simultaneously. The experiment was continued for four generations. Data were not initially collected on the third eneration except for the opula- tion size. For carrying out our test procedure we have ignored the data on their initial generation as the ex- periment started with a purely hetero7ygous population in which

all

disequilibrium measures are zero. Thus, in