A general semi-parametric approach to the analysis of genetic association studies in population-based designs

A general semi-parametric approach to the analysis of genetic association studies in population-based designs
of 8
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
  Lutz  etal. BMC Genetics  2013, 14 :13 METHODOLOGY ARTICLE OpenAccess A general semi-parametric approach to theanalysis of genetic association studies inpopulation-based designs Sharon Lutz 1,3,4* , Wai-Ki Yip 3,4 , John Hokanson 2 , Nan Laird 3,4 and Christoph Lange 3,4,5,6 Abstract Background:  For genetic association studies in designs of unrelated individuals, current statistical methodologytypically models the phenotype of interest as a function of the genotype and assumes a known statistical model forthe phenotype. In the analysis of complex phenotypes, especially in the presence of ascertainment conditions, thespecification of such model assumptions is not straight-forward and is error-prone, potentially causing misleadingresults. Results:  In this paper, we propose an alternative approach that treats the genotype as the random variable andconditions upon the phenotype. Thereby, the validity of the approach does not depend on the correctness of assumptions about the phenotypic model. Misspecification of the phenotypic model may lead to reduced statisticalpower. Theoretical derivations and simulation studies demonstrate both the validity and the advantages of theapproach over existing methodology. In the COPDGene study (a GWAS for Chronic Obstructive Pulmonary Disease(COPD)), we apply the approach to a secondary, quantitative phenotype, the Fagerstrom nicotine dependence score,that is correlated with COPD affection status. The software package that implements this method is available. Conclusions:  The flexibility of this approach enables the straight-forward application to quantitative phenotypes andbinary traits in ascertained and unascertained samples. In addition to its robustness features, our method provides theplatform for the construction of complex statistical models for longitudinal data, multivariate data, multi-marker tests,rare-variant analysis, and others. Keywords:  Genetic associations studies, Secondary phenotypes, Case-control, Ascertainment, Semi-parametric Background In genetic association studies, individuals are oftenrecruited based on case-control ascertainment conditionsof the primary phenotype [1]. For the analysis of sec-ondary phenotypes, this recruitment-scheme can becomeproblematic.Ifthesecondaryphenotypeiscorrelatedwiththe primary phenotype in a case-control study, the distri-bution of the secondary phenotype can be fundamentally different from the general population. For example, in agenetic association study of COPD in which all cases have *Correspondence: sharon.lutz@ucdenver.edu1Department of Biostatistics, University of Colorado Anschutz MedicalCampus, Aurora, USA3Department of Biostatistics, Harvard School of Public Health, Boston, USAFull list of author information is available at the end of the article COPD and control subjects have normal pulmonary func-tion, the distribution of quantitative lung phenotypes candeviate substantially from their distribution in the gen-eral population. For samples that are ascertained in thisfashion, standard statistical methods may lead to mis-leading results or may lack statistical power to identify true genotype phenotype associations. There are severalmethods to accurately estimate the odds ratio of genetic variants for binary secondary phenotypes associated withcase-control status [2-10], but these methods cannot eas-ily accommodate continuous secondary phenotypes. Forthe special case that the secondary phenotype is normally distributed or binary, Lin & Zeng (2009) proposed anadjusted score test that incorporates genetic associationswith affection status into the test statistic [11]. © 2013 Lutz et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the CreativeCommons Attribution License (, which permits unrestricted use, distribution, andreproduction in any medium, provided the srcinal work is properly cited.  Lutz  etal. BMC Genetics  2013, 14 :13 Page 2 of 8 We present a more general approach that does notrequire any distribution assumptions for the secondary phenotype. We refer to the approach as the non-parametric population-based association test (NPBAT).The approach has a form similar to the Family BasedAssociation Test (FBAT), a non-parametric test statis-tic that is frequently used in the family based setting[12-15]. The flexibility of our approach allows us to con-struct a genetic association test for standard and complexphenotypes that is non-parametric with respect to thephenotype. The class of tests is very general. It includesmoststandardassociationtestsandcanbeappliedtomul-tivariate traits and phenotypes, multiple genetic markers,andcase-controlstudieswherephenotypicinformationisavailableforthecasesbutcorrelated withthecase-controlstatus [16-18].Thegeneralconceptoftheproposedassociation-testingframework is to condition on the phenotype of interestand treat only the genetic data as random [12,13,15]. By assuming that the phenotype data is deterministic, the validity of the approach does not depend on the cor-rectnessofthephenotypicassumptions.Nevertheless,thepower of the approach can be increased by incorporat-ing a plausible model for the phenotype into the teststatistic. Based on theoretical considerations and on sim-ulation studies, we show that the new approach is robustagainstmisspecificationofphenotypeassumptions.Atthesame time, this approach achieves the same power levelas standard genetic association tests for population-baseddesigns when the phenotype of interest has a normaldistribution or is dichotomous. For studies where a quan-titative trait is correlated with case-control status, oursimulation studies examine the power and significancelevels for the proposed approach, which does not requireany adjustment for the ascertainment conditions.We illustrate the practical advantages of NPBAT by an application to the COPDGene study. The COPDGenestudy is a case-control study of the genetics of COPDin current or former smokers with at least 10 pack-yearsof smoking history [19]. We test the genetic associa-tion of single nucleotide polymorphisms (SNPs) in theCHRNA 3/5 region and the Fagerstrom Nicotine Depen-dence score (FNDS). FNDS is a validated instrument of nicotine dependence in current smokers and was mea-sured in the current smokers, but not former smokers inthe COPDGene study. NPBAT, which uses the genotypedata in both current and former smokers, is compared tothe published genetic association of SNPs in the CHRNA3/5 region and FNDS that was performed in currentsmokers only [20]. Methods In a genetic association study,  n  unrelated study subjectshave been recruited based on a predefined ascertainmentcondition. Let  X  i  denote the genotype of the individual i . The specific value of   X  i  will depend upon the geneticmodel under consideration. For instance, for an additivemodel,  X  i  =  0,1,2 for 0,1,2 disease alleles, respectively.  X  i  may also be a vector in order to test several allelessimultaneously. Let  T  i  denote the numerical trait infor-mation for individual  i . For example,  T  i  could equal onefor affected individuals and  T  i  could equal zero for unaf-fected individuals. Different coding functions are applieddepending on the phenotype of interest. For binary andcontinuous traits, we will discuss efficient coding schemesbelow. First, we define a general class of test statistics as S  = n  i = 1 (  X  i −  E   x ) T  i  (1)Note that  E  ( S  )  =  0 under the null hypothesis of noassociation between the genotype  X   and the phenotype  Y  .Constructing a conditional score test in which the geno-type  X  i  is the dependent variable and we condition uponthe numerical trait information  T  i , the NPBAT statistichas the following form: Stat   NPBAT   =  S  −  E  [ S  ] √  var  ( S  ) = n  i = 1 (  X  i −  E   x ) T  i    n  i = 1 T  2 i   ni = 1 (  X  i −  E   x ) 2 n − 1  (2)where  E   x  denotes the expectation of the marker score/genotype  X   under the null-hypothesis of no genetic asso-ciation between the phenotype. The marker locus.  E   x  canbe estimated based on the sample mean of the genotypes.The asymptotic distribution of the NPBAT statistic underthe null-hypothesis depends on the estimation of   E   x  andon the specification of the trait information  T  i , and isderived in the Appendix.There are various ways to code the phenotype of inter-est and define the coding function  T  i . For the analysis of affection status, one could specify the coding function tobe T  i  = 1or T  i  = 0,dependingonthediseasestatusoftheproband. However, as we show in the Appendix A, a moreefficient way is to set  T  i  =  1  −  # casesn  for the cases, and T  i  =  0 −  # casesn  for the controls. Then the NPBAT statis-tic is approximately the same as the Cochran-ArmitageTrend test.If the phenotype  Y  i  is in fact normally distributed and T  i  = Y  i − ˆ Y  i  where  ˆ Y  i  denotes the fitted values of regress-ing the phenotype  Y   on any covariates, then the NPBATstatistic is approximately the same as a t-statistic from alinear regression. In general, if the phenotype  Y  i  is a con-tinuous phenotype, we recommend  T  i  =  Y  i  − µ  y  where µ  y  is the phenotypic mean in the general population.  Lutz  etal. BMC Genetics  2013, 14 :13 Page 3 of 8 While it is appealing that the NPBAT statistic is com-parable to standard methods in these simple scenarios,the real appeal of the NPBAT statistic is when there isonly phenotype information available for some subjectsbut there is genetic information available for all subjects.For example, in case control studies, an additional quan-titative phenotype may be available for the cases but notthe controls. When testing for a genetic association withthis additional quantitative phenotype, the NPBAT statis-tic uses the genotype of both the cases and the controlswith the optimal coded phenotype  T  i  =  Y  i  −  Y  offset where  Y  offset  is a constant. The choice of this constant isdescribed in detail in the simulations sub-section and theasymptotic distribution of the NPBAT statistic is derivedin the Appendix. Using this optimal offset choice, theNPBAT statistic has a substantial increase in power overothermethods such asthe NPBAT statisticwhenanoffsetchoice of   T  i  =  Y  i  − ¯ Y   or the improved score test, whichis uniformly more powerful than score tests based on thegeneralized linear model such as the Cochran-Armitagetrendtest,theallelic χ 2 testandthegenotypic χ 2 test[21]. Adjustmentsforpopulationadmixture The NPBAT statistic can be adjusted for populationadmixture by using standard methods such as prin-cipal components analysis or genomic control [22,23].For example, to account for population admixture, onecan treat the principal components as additional covari-ate representing population information, and incorporatethem into the test statistic in equation (2) by taking  T  i  = Y  i  − ˆ Y  i  where  ˆ Y  i  denotes the fitted values of regress-ing the phenotype  Y   on the top principal componentsthat explain the greatest amount of variability in the data.Note the above approach requires that the phenotype Y isdichotomous or roughly normally distributed. Extensiontomultiplephenotypes The NPBAT statistic can be extended to  m  phenotypes totest the null hypothesis that a marker locus is not linkedto any disease-susceptibility locus for any of   m  selectedphenotypes. Then the test statistic becomes S  = n  i = 1 (  X  i −  E   x ) T  i  (3)Note that  E  ( S  )  =  0 as is the case for the univariate version above. But here  T  i  is the  m × 1 vector for the mphenotypes and  X  i  is just one marker. So S is  m × 1. The m × m  variance matrix is the following V  S   = ˆ σ  2  X n  i = 1 T  i T  t i  (4)where  ˆ σ  2  X   is the variance for marker X based on sample.Then the NPBAT statistic is the following χ 2NPBAT  = S  t  V  − 1 S   S   (5)Due to the estimation of   E   x  based on the sample, thisstatistic does not have a chi square distribution and apermutation test needs to be used to assess significancelevels, which can be done by using the NPBAT softwarepackage ( Simulations In genetic association case-control studies, only the casesmayhaveadditionalphenotypicinformationavailable.Forinstance, in a case-control study where the cases haveasthma (the primary phenotype), only the cases may haveFEV measurements (the secondary phenotype). In thisscenario, the secondary phenotype FEV will be moresevere than it would be in the general population and theanalysis of this secondary phenotype can be misleadingdue to the ascertainment of subjects basedon the primary phenotype, asthma. To simulate this scenario, we gener-ated the genotype X for 500 cases and 500 controls and asecondaryphenotypeYforonlythe500casesfromatrun-cated normal distribution with standard deviation  σ   =  1,mean  aX   under the alternative and mean 0 under the nullandcutoffsuchthatthesecondaryphenotypeinthetop50percent of the normal distribution. We consider an allelefrequency of   p  =  20% and  a  is chosen such that the her-itability   h  [24] equals 1%,2%,3%,5%. The solving for a, a = σ    h / 2  p ( 1 −  p )( 1 − h ) .WecomputetheNPBATstatisticwiththecodedpheno-type T  i  = Y  i − Y  offset  where Y  offset  isaconstantthatrangesfrom -5 to 15 and  E   x  is the sample mean of the genotypesin the cases. We also compute the NPBAT statistic with  E   x  equal to the sample mean of the genotypes in the con-trols and  E   x  equal to the sample mean of the genotypes inthecasesandthecontrols.Wecomparethepowerofthesethree NPBAT statistics to the Improved Score Test, whichis uniformly more powerful than score tests based on thegeneralized linear model such as the Cochran-Armitagetrendtest,theallelic χ 2 testandthegenotypic χ 2 test[21].We also compare the power of the NPBAT approach to astandard linear regression.Under the null hypothesis, the NPBAT method main-tains a significance level of approximately 5% or less asseen in Figure 1 whether  E   x  is the sample mean of thecases or the controls or both. Figure 1 also depicts thepower results of these simulations. Note that the spike ordrop in all the plots occurs where  Y  offset  ≈ ¯ Y  , the samplemean of the secondary phenotype for the cases since thesecondary phenotype is not available for the controls inthis scenario. The power of the NPBAT approach is max-imized when  E   x  is based on the genotype of the controlsand  Y  offset  is significantly different than the phenotypic  Lutz  etal. BMC Genetics  2013, 14 :13 Page 4 of 8 Figure1 PowerandSignificancelevelsforNPBAT,theImprovedScoreTestandtheLikelihoodRatioTest(LRT).  This plot compares thepower and type-1 error rate of the NPBAT method using  E   x   based on the sample mean of the cases, the controls and both the cases and controls. The power and significance levels of this method is compared to the improved score test and a standard linear regression. Note that the spike ordrop in all the plots occurs where  Y  offset  ≈ ¯ Y  , the sample mean of the secondary phenotype for the cases since the secondary phenotype is notavailable for the controls in this scenario. The power of the NPBAT approach is maximized when  E   x   is based on the genotype of the controls and Y  offset  is significantly different than the phenotypic mean of the cases. When  E   x   is based on the genotype of the cases, the power of the NPBAT approach is similar to the improved score test and the regression. Note that the power of NPBAT approach when  E   x   is based on the genotype of both the cases and the controls is best for high values of heritability.  Lutz  etal. BMC Genetics  2013, 14 :13 Page 5 of 8 mean of the cases. When  E   x  is based on the genotype of the cases, the power of the NPBAT approach is similar tothe improved score test and the regression. Note that thepower of NPBAT approach when  E   x  is based on the geno-type of both the cases and the controls is best for high values of heritability.These simulations show that for case-control studieswhen analyzing secondary phenotypes correlated withcase-control status, we recommend to set  Y  offset  to a con-stant significantly different from the phenotypic mean of thesampleand  E   x  equaltothegenotypicmeanofthecon-trols. In this situation, a robust and efficient choice forthe offset  Y  offset  is the phenotypic mean in the generalpopulation. Note that the results of these simulations areanalogous to the FBAT statistic in family studies where itwas found that when ascertaining cases only from a quan-titative distribution, one needed to choose an offset thatwas outside the range of the case’s phenotypic values [15]. Dataanalysis WeappliedtheNPBATmethodtotheGeneticEpidemiol-ogy of COPD (COPDGene) Study which is a multi-centercase/control study designed to identify genetic factorsassociated with COPD and to characterize COPD-relatedphenotypes [19]. The study recruited COPD cases andsmoking controls who were non-Hispanic whites andAfrican Americans ages 45 to 80 with at least 10 pack- years of smoking history. The study also collected theFagerstrom Test for Nicotine Dependence (FTND) toassess nicotine dependence, but the FTND score was only available for cases and controls who were current smok-ers at study enrollment. This data analysis represents thescenario where the secondary phenotype (FTND score) isavailable only in current smokers but the genotypic infor-mation is available for both current and former smokers.In the first 1,000 Non-Hispanic White (NHW) individ-uals, the FTND score controlling for age and genderwas tested for an association with SNPs in the CHRNA3/5 region for COPD cases and controls who are cur-rent smokers and association was found for rs1051730 orrs8034191[20].WeappliedtheNPBATstatistictothefirst1000 NHW using the genotype of both current (307 indi- viduals)andformersmokers(669individuals),controllingfor age and gender and obtained the results shown inTable 1 for these 2 SNPs. Note that the NPBAT statisticperformed better than both the Improved Score Test andthe regression controlling for age and gender. Resultsanddiscussion NPBAT is a new statistical framework for populationbased genetic association tests that does not requiremaking specific assumptions about the distribution of the phenotype. By conditioning on the phenotype,NPBAT is robust against violations of phenotypic modelassumptions. The practical implications of NPBAT aredemonstrated when applied to the COPDGene Study.FNDS, a measure of nicotine dependence, was assessed incurrent smokers that represent 31% of study participantsin COPDGene. We analyzed SNPs shown to be associ-ated with FNDS [20]. NPBAT identified the same SNPs asconventional methods but with slightly greater statisticalsignificance than a linear regression for FNDS control-ling for age and gender or the improved score test. Otherexamples of applications of NPBAT are 1. when a sample is ascertained based on case/controlstatus and the phenotype of interest is correlatedwith case status2. in a cohort study in which prevalent cases areexcluded (i.e. the classic epidemiologic cohort study)and the phenotype of interest is correlated with thedisease of interest3. a pharmacogenetics study using a randomizedclinical trial when participants are ascertained basedon the levels of the target of therapy  The broad application of NPBAT is to scenarios wheresamples are ascertained based on selection criteria thatare correlated with the phenotype of interest. Conclusions In conclusion, the key advantage that defines the attrac-tion of the proposed approach is its robustness againstmodel specification of the phenotypes. This enablesextensions to different types of traits and the integrationof complex statistical models for the phenotype. While, atthe same time, the validity of the approach is not com-promised by such generalization. Though the power issensitive to the offset choice, NPBAT is valid regardlessof the offset. As with all population-based association Table1 Thistabledisplaysthep-valuesfortheassociationbetweentheFagerstromTestforNicotineDependence(FTND)andthemarkerslistedaboveforthedifferentstatisticaltests:theNPBATwhere E   x   = ¯  x  c   isthegenotypicmeanofthecurrentsmokers,NPBATwhere E   x   = ¯  x  f   isthegenotypicmeanoftheformersmokers,theImprovedScoreTestandalinearregression Method NPBAT:E x  = ¯ x c  NPBAT:E x  = ¯ x f   ImprovedScoreTest Regression rs1051730 0.00134 0.00138 0.00227 0.00259rs8034191 0.00386 0.00391 0.00694 0.00744
Similar documents
View more...
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks