Description

A general semi-parametric approach to the analysis of genetic association studies in population-based designs

All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.

Related Documents

Share

Transcript

Lutz
etal. BMC Genetics
2013,
14
:13http://www.biomedcentral.com/1471-2156/14/13
METHODOLOGY ARTICLE OpenAccess
A general semi-parametric approach to theanalysis of genetic association studies inpopulation-based designs
Sharon Lutz
1,3,4*
, Wai-Ki Yip
3,4
, John Hokanson
2
, Nan Laird
3,4
and Christoph Lange
3,4,5,6
Abstract
Background:
For genetic association studies in designs of unrelated individuals, current statistical methodologytypically models the phenotype of interest as a function of the genotype and assumes a known statistical model forthe phenotype. In the analysis of complex phenotypes, especially in the presence of ascertainment conditions, thespeciﬁcation of such model assumptions is not straight-forward and is error-prone, potentially causing misleadingresults.
Results:
In this paper, we propose an alternative approach that treats the genotype as the random variable andconditions upon the phenotype. Thereby, the validity of the approach does not depend on the correctness of assumptions about the phenotypic model. Misspeciﬁcation of the phenotypic model may lead to reduced statisticalpower. Theoretical derivations and simulation studies demonstrate both the validity and the advantages of theapproach over existing methodology. In the COPDGene study (a GWAS for Chronic Obstructive Pulmonary Disease(COPD)), we apply the approach to a secondary, quantitative phenotype, the Fagerstrom nicotine dependence score,that is correlated with COPD aﬀection status. The software package that implements this method is available.
Conclusions:
The ﬂexibility of this approach enables the straight-forward application to quantitative phenotypes andbinary traits in ascertained and unascertained samples. In addition to its robustness features, our method provides theplatform for the construction of complex statistical models for longitudinal data, multivariate data, multi-marker tests,rare-variant analysis, and others.
Keywords:
Genetic associations studies, Secondary phenotypes, Case-control, Ascertainment, Semi-parametric
Background
In genetic association studies, individuals are oftenrecruited based on case-control ascertainment conditionsof the primary phenotype [1]. For the analysis of sec-ondary phenotypes, this recruitment-scheme can becomeproblematic.Ifthesecondaryphenotypeiscorrelatedwiththe primary phenotype in a case-control study, the distri-bution of the secondary phenotype can be fundamentally diﬀerent from the general population. For example, in agenetic association study of COPD in which all cases have
*Correspondence: sharon.lutz@ucdenver.edu1Department of Biostatistics, University of Colorado Anschutz MedicalCampus, Aurora, USA3Department of Biostatistics, Harvard School of Public Health, Boston, USAFull list of author information is available at the end of the article
COPD and control subjects have normal pulmonary func-tion, the distribution of quantitative lung phenotypes candeviate substantially from their distribution in the gen-eral population. For samples that are ascertained in thisfashion, standard statistical methods may lead to mis-leading results or may lack statistical power to identify true genotype phenotype associations. There are severalmethods to accurately estimate the odds ratio of genetic variants for binary secondary phenotypes associated withcase-control status [2-10], but these methods cannot eas-ily accommodate continuous secondary phenotypes. Forthe special case that the secondary phenotype is normally distributed or binary, Lin & Zeng (2009) proposed anadjusted score test that incorporates genetic associationswith aﬀection status into the test statistic [11].
© 2013 Lutz et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the CreativeCommons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, andreproduction in any medium, provided the srcinal work is properly cited.
Lutz
etal. BMC Genetics
2013,
14
:13 Page 2 of 8http://www.biomedcentral.com/1471-2156/14/13
We present a more general approach that does notrequire any distribution assumptions for the secondary phenotype. We refer to the approach as the non-parametric population-based association test (NPBAT).The approach has a form similar to the Family BasedAssociation Test (FBAT), a non-parametric test statis-tic that is frequently used in the family based setting[12-15]. The ﬂexibility of our approach allows us to con-struct a genetic association test for standard and complexphenotypes that is non-parametric with respect to thephenotype. The class of tests is very general. It includesmoststandardassociationtestsandcanbeappliedtomul-tivariate traits and phenotypes, multiple genetic markers,andcase-controlstudieswherephenotypicinformationisavailableforthecasesbutcorrelated withthecase-controlstatus [16-18].Thegeneralconceptoftheproposedassociation-testingframework is to condition on the phenotype of interestand treat only the genetic data as random [12,13,15]. By assuming that the phenotype data is deterministic, the validity of the approach does not depend on the cor-rectnessofthephenotypicassumptions.Nevertheless,thepower of the approach can be increased by incorporat-ing a plausible model for the phenotype into the teststatistic. Based on theoretical considerations and on sim-ulation studies, we show that the new approach is robustagainstmisspeciﬁcationofphenotypeassumptions.Atthesame time, this approach achieves the same power levelas standard genetic association tests for population-baseddesigns when the phenotype of interest has a normaldistribution or is dichotomous. For studies where a quan-titative trait is correlated with case-control status, oursimulation studies examine the power and signiﬁcancelevels for the proposed approach, which does not requireany adjustment for the ascertainment conditions.We illustrate the practical advantages of NPBAT by an application to the COPDGene study. The COPDGenestudy is a case-control study of the genetics of COPDin current or former smokers with at least 10 pack-yearsof smoking history [19]. We test the genetic associa-tion of single nucleotide polymorphisms (SNPs) in theCHRNA 3/5 region and the Fagerstrom Nicotine Depen-dence score (FNDS). FNDS is a validated instrument of nicotine dependence in current smokers and was mea-sured in the current smokers, but not former smokers inthe COPDGene study. NPBAT, which uses the genotypedata in both current and former smokers, is compared tothe published genetic association of SNPs in the CHRNA3/5 region and FNDS that was performed in currentsmokers only [20].
Methods
In a genetic association study,
n
unrelated study subjectshave been recruited based on a predeﬁned ascertainmentcondition. Let
X
i
denote the genotype of the individual
i
. The speciﬁc value of
X
i
will depend upon the geneticmodel under consideration. For instance, for an additivemodel,
X
i
=
0,1,2 for 0,1,2 disease alleles, respectively.
X
i
may also be a vector in order to test several allelessimultaneously. Let
T
i
denote the numerical trait infor-mation for individual
i
. For example,
T
i
could equal onefor aﬀected individuals and
T
i
could equal zero for unaf-fected individuals. Diﬀerent coding functions are applieddepending on the phenotype of interest. For binary andcontinuous traits, we will discuss eﬃcient coding schemesbelow. First, we deﬁne a general class of test statistics as
S
=
n
i
=
1
(
X
i
−
E
x
)
T
i
(1)Note that
E
(
S
)
=
0 under the null hypothesis of noassociation between the genotype
X
and the phenotype
Y
.Constructing a conditional score test in which the geno-type
X
i
is the dependent variable and we condition uponthe numerical trait information
T
i
, the NPBAT statistichas the following form:
Stat
NPBAT
=
S
−
E
[
S
]
√
var
(
S
)
=
n
i
=
1
(
X
i
−
E
x
)
T
i
n
i
=
1
T
2
i
ni
=
1
(
X
i
−
E
x
)
2
n
−
1
(2)where
E
x
denotes the expectation of the marker score/genotype
X
under the null-hypothesis of no genetic asso-ciation between the phenotype. The marker locus.
E
x
canbe estimated based on the sample mean of the genotypes.The asymptotic distribution of the NPBAT statistic underthe null-hypothesis depends on the estimation of
E
x
andon the speciﬁcation of the trait information
T
i
, and isderived in the Appendix.There are various ways to code the phenotype of inter-est and deﬁne the coding function
T
i
. For the analysis of aﬀection status, one could specify the coding function tobe
T
i
=
1or
T
i
=
0,dependingonthediseasestatusoftheproband. However, as we show in the Appendix A, a moreeﬃcient way is to set
T
i
=
1
−
#
casesn
for the cases, and
T
i
=
0
−
#
casesn
for the controls. Then the NPBAT statis-tic is approximately the same as the Cochran-ArmitageTrend test.If the phenotype
Y
i
is in fact normally distributed and
T
i
=
Y
i
− ˆ
Y
i
where
ˆ
Y
i
denotes the ﬁtted values of regress-ing the phenotype
Y
on any covariates, then the NPBATstatistic is approximately the same as a t-statistic from alinear regression. In general, if the phenotype
Y
i
is a con-tinuous phenotype, we recommend
T
i
=
Y
i
−
µ
y
where
µ
y
is the phenotypic mean in the general population.
Lutz
etal. BMC Genetics
2013,
14
:13 Page 3 of 8http://www.biomedcentral.com/1471-2156/14/13
While it is appealing that the NPBAT statistic is com-parable to standard methods in these simple scenarios,the real appeal of the NPBAT statistic is when there isonly phenotype information available for some subjectsbut there is genetic information available for all subjects.For example, in case control studies, an additional quan-titative phenotype may be available for the cases but notthe controls. When testing for a genetic association withthis additional quantitative phenotype, the NPBAT statis-tic uses the genotype of both the cases and the controlswith the optimal coded phenotype
T
i
=
Y
i
−
Y
oﬀset
where
Y
oﬀset
is a constant. The choice of this constant isdescribed in detail in the simulations sub-section and theasymptotic distribution of the NPBAT statistic is derivedin the Appendix. Using this optimal oﬀset choice, theNPBAT statistic has a substantial increase in power overothermethods such asthe NPBAT statisticwhenanoﬀsetchoice of
T
i
=
Y
i
− ¯
Y
or the improved score test, whichis uniformly more powerful than score tests based on thegeneralized linear model such as the Cochran-Armitagetrendtest,theallelic
χ
2
testandthegenotypic
χ
2
test[21].
Adjustmentsforpopulationadmixture
The NPBAT statistic can be adjusted for populationadmixture by using standard methods such as prin-cipal components analysis or genomic control [22,23].For example, to account for population admixture, onecan treat the principal components as additional covari-ate representing population information, and incorporatethem into the test statistic in equation (2) by taking
T
i
=
Y
i
− ˆ
Y
i
where
ˆ
Y
i
denotes the ﬁtted values of regress-ing the phenotype
Y
on the top principal componentsthat explain the greatest amount of variability in the data.Note the above approach requires that the phenotype Y isdichotomous or roughly normally distributed.
Extensiontomultiplephenotypes
The NPBAT statistic can be extended to
m
phenotypes totest the null hypothesis that a marker locus is not linkedto any disease-susceptibility locus for any of
m
selectedphenotypes. Then the test statistic becomes
S
=
n
i
=
1
(
X
i
−
E
x
)
T
i
(3)Note that
E
(
S
)
=
0 as is the case for the univariate version above. But here
T
i
is the
m
×
1 vector for the mphenotypes and
X
i
is just one marker. So S is
m
×
1. The
m
×
m
variance matrix is the following
V
S
= ˆ
σ
2
X n
i
=
1
T
i
T
t i
(4)where
ˆ
σ
2
X
is the variance for marker X based on sample.Then the NPBAT statistic is the following
χ
2NPBAT
=
S
t
V
−
1
S
S
(5)Due to the estimation of
E
x
based on the sample, thisstatistic does not have a chi square distribution and apermutation test needs to be used to assess signiﬁcancelevels, which can be done by using the NPBAT softwarepackage (https://sites.google.com/site/genenpbat/).
Simulations
In genetic association case-control studies, only the casesmayhaveadditionalphenotypicinformationavailable.Forinstance, in a case-control study where the cases haveasthma (the primary phenotype), only the cases may haveFEV measurements (the secondary phenotype). In thisscenario, the secondary phenotype FEV will be moresevere than it would be in the general population and theanalysis of this secondary phenotype can be misleadingdue to the ascertainment of subjects basedon the primary phenotype, asthma. To simulate this scenario, we gener-ated the genotype X for 500 cases and 500 controls and asecondaryphenotypeYforonlythe500casesfromatrun-cated normal distribution with standard deviation
σ
=
1,mean
aX
under the alternative and mean 0 under the nullandcutoﬀsuchthatthesecondaryphenotypeinthetop50percent of the normal distribution. We consider an allelefrequency of
p
=
20% and
a
is chosen such that the her-itability
h
[24] equals 1%,2%,3%,5%. The solving for a,
a
=
σ
h
/
2
p
(
1
−
p
)(
1
−
h
)
.WecomputetheNPBATstatisticwiththecodedpheno-type
T
i
=
Y
i
−
Y
oﬀset
where
Y
oﬀset
isaconstantthatrangesfrom -5 to 15 and
E
x
is the sample mean of the genotypesin the cases. We also compute the NPBAT statistic with
E
x
equal to the sample mean of the genotypes in the con-trols and
E
x
equal to the sample mean of the genotypes inthecasesandthecontrols.Wecomparethepowerofthesethree NPBAT statistics to the Improved Score Test, whichis uniformly more powerful than score tests based on thegeneralized linear model such as the Cochran-Armitagetrendtest,theallelic
χ
2
testandthegenotypic
χ
2
test[21].We also compare the power of the NPBAT approach to astandard linear regression.Under the null hypothesis, the NPBAT method main-tains a signiﬁcance level of approximately 5% or less asseen in Figure 1 whether
E
x
is the sample mean of thecases or the controls or both. Figure 1 also depicts thepower results of these simulations. Note that the spike ordrop in all the plots occurs where
Y
oﬀset
≈ ¯
Y
, the samplemean of the secondary phenotype for the cases since thesecondary phenotype is not available for the controls inthis scenario. The power of the NPBAT approach is max-imized when
E
x
is based on the genotype of the controlsand
Y
oﬀset
is signiﬁcantly diﬀerent than the phenotypic
Lutz
etal. BMC Genetics
2013,
14
:13 Page 4 of 8http://www.biomedcentral.com/1471-2156/14/13
Figure1
PowerandSigniﬁcancelevelsforNPBAT,theImprovedScoreTestandtheLikelihoodRatioTest(LRT).
This plot compares thepower and type-1 error rate of the NPBAT method using
E
x
based on the sample mean of the cases, the controls and both the cases and controls. The power and signiﬁcance levels of this method is compared to the improved score test and a standard linear regression. Note that the spike ordrop in all the plots occurs where
Y
oﬀset
≈ ¯
Y
, the sample mean of the secondary phenotype for the cases since the secondary phenotype is notavailable for the controls in this scenario. The power of the NPBAT approach is maximized when
E
x
is based on the genotype of the controls and
Y
oﬀset
is signiﬁcantly diﬀerent than the phenotypic mean of the cases. When
E
x
is based on the genotype of the cases, the power of the NPBAT approach is similar to the improved score test and the regression. Note that the power of NPBAT approach when
E
x
is based on the genotype of both the cases and the controls is best for high values of heritability.
Lutz
etal. BMC Genetics
2013,
14
:13 Page 5 of 8http://www.biomedcentral.com/1471-2156/14/13
mean of the cases. When
E
x
is based on the genotype of the cases, the power of the NPBAT approach is similar tothe improved score test and the regression. Note that thepower of NPBAT approach when
E
x
is based on the geno-type of both the cases and the controls is best for high values of heritability.These simulations show that for case-control studieswhen analyzing secondary phenotypes correlated withcase-control status, we recommend to set
Y
oﬀset
to a con-stant signiﬁcantly diﬀerent from the phenotypic mean of thesampleand
E
x
equaltothegenotypicmeanofthecon-trols. In this situation, a robust and eﬃcient choice forthe oﬀset
Y
oﬀset
is the phenotypic mean in the generalpopulation. Note that the results of these simulations areanalogous to the FBAT statistic in family studies where itwas found that when ascertaining cases only from a quan-titative distribution, one needed to choose an oﬀset thatwas outside the range of the case’s phenotypic values [15].
Dataanalysis
WeappliedtheNPBATmethodtotheGeneticEpidemiol-ogy of COPD (COPDGene) Study which is a multi-centercase/control study designed to identify genetic factorsassociated with COPD and to characterize COPD-relatedphenotypes [19]. The study recruited COPD cases andsmoking controls who were non-Hispanic whites andAfrican Americans ages 45 to 80 with at least 10 pack- years of smoking history. The study also collected theFagerstrom Test for Nicotine Dependence (FTND) toassess nicotine dependence, but the FTND score was only available for cases and controls who were current smok-ers at study enrollment. This data analysis represents thescenario where the secondary phenotype (FTND score) isavailable only in current smokers but the genotypic infor-mation is available for both current and former smokers.In the ﬁrst 1,000 Non-Hispanic White (NHW) individ-uals, the FTND score controlling for age and genderwas tested for an association with SNPs in the CHRNA3/5 region for COPD cases and controls who are cur-rent smokers and association was found for rs1051730 orrs8034191[20].WeappliedtheNPBATstatistictotheﬁrst1000 NHW using the genotype of both current (307 indi- viduals)andformersmokers(669individuals),controllingfor age and gender and obtained the results shown inTable 1 for these 2 SNPs. Note that the NPBAT statisticperformed better than both the Improved Score Test andthe regression controlling for age and gender.
Resultsanddiscussion
NPBAT is a new statistical framework for populationbased genetic association tests that does not requiremaking speciﬁc assumptions about the distribution of the phenotype. By conditioning on the phenotype,NPBAT is robust against violations of phenotypic modelassumptions. The practical implications of NPBAT aredemonstrated when applied to the COPDGene Study.FNDS, a measure of nicotine dependence, was assessed incurrent smokers that represent 31% of study participantsin COPDGene. We analyzed SNPs shown to be associ-ated with FNDS [20]. NPBAT identiﬁed the same SNPs asconventional methods but with slightly greater statisticalsigniﬁcance than a linear regression for FNDS control-ling for age and gender or the improved score test. Otherexamples of applications of NPBAT are
1. when a sample is ascertained based on case/controlstatus and the phenotype of interest is correlatedwith case status2. in a cohort study in which prevalent cases areexcluded (i.e. the classic epidemiologic cohort study)and the phenotype of interest is correlated with thedisease of interest3. a pharmacogenetics study using a randomizedclinical trial when participants are ascertained basedon the levels of the target of therapy
The broad application of NPBAT is to scenarios wheresamples are ascertained based on selection criteria thatare correlated with the phenotype of interest.
Conclusions
In conclusion, the key advantage that deﬁnes the attrac-tion of the proposed approach is its robustness againstmodel speciﬁcation of the phenotypes. This enablesextensions to diﬀerent types of traits and the integrationof complex statistical models for the phenotype. While, atthe same time, the validity of the approach is not com-promised by such generalization. Though the power issensitive to the oﬀset choice, NPBAT is valid regardlessof the oﬀset. As with all population-based association
Table1 Thistabledisplaysthep-valuesfortheassociationbetweentheFagerstromTestforNicotineDependence(FTND)andthemarkerslistedaboveforthediﬀerentstatisticaltests:theNPBATwhere
E
x
= ¯
x
c
isthegenotypicmeanofthecurrentsmokers,NPBATwhere
E
x
= ¯
x
f
isthegenotypicmeanoftheformersmokers,theImprovedScoreTestandalinearregression
Method NPBAT:E
x
= ¯
x
c
NPBAT:E
x
= ¯
x
f
ImprovedScoreTest Regression
rs1051730 0.00134 0.00138 0.00227 0.00259rs8034191 0.00386 0.00391 0.00694 0.00744

Search

Similar documents

Related Search

Qualitative Approach to the Study of PoliticaThe analysis of student’s problem in identifyA Phenomenological Approach to the RelationshA Mirror Belonging to the Lady-of-Uruk2007. An Ichnofabric Approach to the DepositiA Practical Method for the Analysis of GenetiA family systems/lifespan approach to consideInsecurity As a Threat to the Development of Journey To The Center Of The EarthThe Analysis of Cross-Sectional Time Series D

We Need Your Support

Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks