Art & Photos

A Robust Conflict Measure of Inconsistencies in Bayesian Hierarchical Models

Description
O'Hagan ("Highly Structured Stochastic Systems", Oxford University Press, Oxford, 2003) introduces some
Categories
Published
of 13
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Transcript
  doi:10.1111/j.1467-9469.2007.00560.x ©  Board of the Foundation of the Scandinavian Journal of Statistics 2007. Published by Blackwell Publishing Ltd, 9600 GarsingtonRoad, Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA 02148, USA Vol 34: 816–828, 2007 A Robust Conflict Measure ofInconsistencies in Bayesian HierarchicalModels FREDRIK A. DAHL Health Services Research Unit, Akershus University Hospital, and Department of Mathematics, University of Oslo JØRUND GÅSEMYR and BENT NATVIG Department of Mathematics, University of Oslo ABSTRACT. O’Hagan ( Highly Structured Stochastic Systems , Oxford University Press, Oxford,2003)introducessometoolsforcriticismofBayesianhierarchicalmodelsthatcanbeappliedateachnode of the model, with a view to diagnosing problems of model fit at any point in the model struc-ture. His method relies on computing the posterior median of a conflict index, typically throughMarkov chain Monte Carlo simulations. We investigate a Gaussian model of one-way analysis of variance,andshowthatO’Hagan’sapproachgivesunreliablefalsewarningprobabilities.Weextendand refine the method, especially avoiding double use of data by a data-splitting approach, accom-panied by theoretical justifications from a non-trivial special case. Through extensive numericalexperiments we show that our method detects model mis-specification about as well as the method of O’Hagan, while retaining the desired false warning probability for data generated from the assumedmodel. This also holds for Student’s- t   and uniform distribution versions of the model. Key words:  double use of data, Markov chain Monte Carlo simulations, model evaluation,one-way analysis of variance 1. Introduction Modern computer technology combined with Markov chain Monte Carlo (MCMC) algo-rithms has made it possible to analyse complex Bayesian hierarchical models. The resultingpopularity of complex models has also increased the need for ways of evaluating such models.In a frequentist setting, this is often done by way of   p -values, which quantify how surprisingthe given data set is, under the assumed model. By construction, a frequentist  p -value is pre-experimentally uniformly distributed on the unit interval, where low values are interpretedas surprising. In the present paper, the term  pre-experimental   refers to the distribution underthe assumed model prior to using the data.Several Bayesian  p -values have been suggested over the last few decades. The so-called  prior predictive p-value  of Box (1980), measures the degree of surprise of the data, according tosome metric of choice, under a probability measure defined by the product of the prior andthe likelihood given the model. It therefore differs from a frequentist  p -value through theintroduction of the prior distribution. The prior predictive  p -value is a natural choice in caseswhere the prior of a Bayesian model represents our true beliefs about the distribution of ourparameters prior to seeing data. Usually, however, we apply quite vague priors that repre-sent general uncertainty about parameters that could, in principle, be arbitrarily preciselyestimated with enough data.In these cases, sampling under the prior makes little sense, and is not even defined forimproper priors. Rubin (1984) therefore introduced  posterior predictive p-values  that rely onsampling hypothetical future replications from the posterior distribution. This construction  Scand J Statist 34  Inconsistencies in Bayesian hierarchical models  817also allows metrics that evaluate discrepancies between data and parameter values (see Gel-man  et al. , 1996). However, posterior predictive  p -values use data twice; both directly throughthe discrepancy function, and indirectly by sampling from the posterior distribution.This has been criticized by Dey  et al.  (1998) and Bayarri & Berger (2000), both coming upwith alternative approaches. The former paper introduces a simulation-based approach wherethe posterior distribution given the observed data is compared with a medley of posteriordistributions given replicated data sets generated from the prior distribution. Hence, the ap-proach is essentially in accordance with the prior predictive approach. The latterpaper suggests two variants; the conditional predictive  p -value and the partial posterior pre-dictive  p -value, both designed to avoid the double use of data by eliminating the influence of a chosen test statistic on the posterior distribution.Robins  et al.  (2000) proves that the pre-experimental asymptotic distribution of the pos-terior predictive  p -value is more concentrated around 1 /  2 than a uniform, as opposed to thetwo variants of Bayarri & Berger (2000). Hence, as also pointed out by Meng (1994) andDahl (2006), the posterior predictive  p -values tend to be conservative in the sense that extremevalues get too low probability. Hjort  et al.  (2006) analyses this in depth, and designs a doublesimulation scheme that alleviates the problem. This scheme can be thought of as essentiallytreating the posterior predictive  p -value as a test statistic in itself, and using it in an extensiveprior predictive  p -value computation.In model choice problems the task is to choose the best model from a given set of candi-dates. Bayes factors, see Kass & Raftery (1995), provide a useful methodology for such prob-lems. Information criteria give a different approach to model the choice based on weighingmodel fit against the number of free parameters. The Bayesian information criterion (BIC)was defined by Schwartz (1978) and more recently analysed by Clyde & George (2004). Adifferent information criterion that is used for Bayesian models is the so-called divergenceinformation criterion (DIC) (see Spiegelhalter  et al. , 2002). Although model evaluation andmodel choice are related, these tasks are different, and model choice methods cannot readilybe applied for the purpose of model evaluation.The variants of Bayarri & Berger (2000) work well in some simple cases, and also forthe partial posterior predictive  p -value, for a simple hierarchical model as demonstrated inBayarri & Castellanos (2004). However, it seems difficult to use this method to criticizearbitrary aspects of Bayesian hierarchical models. An approximate cross-validation approachaimed at criticizing such models is given in Marshall & Spiegelhalter (2003). Dey  et al.  (1998)introduced tools for evaluating different parts of such models. Similarly, we extend andrefine in this paper a tool suggested by O’Hagan (2003) for evaluating inconsistencies of a model, through analysis of what he calls  information contributions . This is a flexible toolthat can be used at any node in the model network. However, in the present paper, werestrict our attention to location parameters. Under suitable conditions, our conflict evalu-ation for a given node will pre-experimentally be a squared normal variable. Our mainhypothesis is that this is close to be true for a larger class of models. This gives a surpriseindex which is similar to a frequentist  p -value. In cases where we have domain knowledgethat makes us suspect a given node, we test that one. Otherwise, one should do Bonferroni-like adjustments to the significance level to control the overall false alarm probability. Thisdoes not mean that we advocate basing the model-building process on a formal hypothesistesting scheme alone. Rather, we envisage an informal procedure, where the conflict analysissuggests points in the model structure that might be problematic. However, without reason-able control over the pre-experimental distribution of the conflicts in the model, it wouldbe difficult to use this tool, in practice, without a computationally demanding empiricalnormalization. ©  Board of the Foundation of the Scandinavian Journal of Statistics 2007.  818  F. A. Dahl et al.  Scand J Statist 34 The paper is laid out as follows: in section 2 we explain the srcinal idea of O’Hagan(2003) in the setting of a Gaussian hierarchical model, followed by our modifications of themethod. Our modifications include the splitting of data, so as to avoid double use of it, andthis is discussed further in section 3. Section 4 gives some theoretical results in a special caseof our model. In section 5 we give results from our massive simulation experiments, andsection 6 concludes the article. In the Appendix, we give the proofs of the theoretical resultsin section 4. 2. Measuring conflict O’Hagan (2003) introduces some tools for model criticism that can be applied at each node of a complex hierarchical or graphical model, with a view to diagnosing problems of model fitat any point in the model structure. In general, the model can be supposed to be expressed asa directed acyclic graph. To compare two unimodal densities/likelihoods he suggests the fol-lowing procedure. First, normalize both densities to have unit maximum height. The heightof both curves at their point of intersection is denoted by  z . Then the suggested conflictmeasure is  c 1 = − 2 ln z . In the present paper we consider, as O’Hagan (2003), the simplehierarchical model for several normal samples (one-way analysis of variance) to clarifywhat we see as problematic aspects of his approach. Observations  y ij   for  i  = 1, ... , k   and  j  = 1, ... , n i   are available. The model has the form:  y ij  |  ,  2 ∼ ind  N  (  i  ,  2 ),  i  = 1, ... , k  ;  j  = 1, ... , n i   i  |  ,  2 ∼ ind  N  (  ,  2 ),  i  = 1, ... , k  , (1)where   = (  1 , ... ,  k  ), and is completed by a prior distribution for   2 ,   2 and   .In the model (1), consider the node for parameter   i  . In addition to its parents    and   2 ,it is linked to its child nodes  y i  1 , ... ,  y in i  . The full conditional distribution of    i   is given by:  p (  i  | y ,  − i  ,  2 ,  2 ,  ) ∝  p (  i  |  ,  2 ) n i    j  = 1  p (  y ij  |  i  ,  2 ), (2)where  y = (  y 11 , ... ,  y kn k  ) is the complete set of data, and   − i  = (  1 , ... ,  i  − 1 ,  i  + 1 , ... ,  k  ). Thisshows how each of the  n i  + 1 distributions can be considered as a source of informationabout   i  . When we are considering the possibility of conflict at the   i   node, we must considereach of these contributing distributions as functions of    i  . In the present model, contrastingthe information about   i   from the parent nodes with that from the child nodes, the conflictmeasure simplifies to: c 1  i  =  (  − ¯  y i  ) 2 (  +  /  √  n i  ) 2 , (3)where ¯  y i  =  1 n i n i    j  = 1  y ij  ,noting that the last  n i   factors of (2) can be written as  p ( ¯  y i  |  i  ,  2 ).When the parameters   2 ,  2 and    are given by prior distributions, O’Hagan (2003) suggestsusing MCMC to estimate the quantity c 1,  y , med  i  = M   2 ,   2 ,   | y ( c 1  i  ), (4) ©  Board of the Foundation of the Scandinavian Journal of Statistics 2007.  Scand J Statist 34  Inconsistencies in Bayesian hierarchical models  819where  M   denotes the median under the posterior distribution of    2 ,  2 and   . He claims thata value <1 should be thought of as indicative of no conflict, whereas values of 4 or morewould certainly indicate a conflict to be taken seriously.A first problem with (3) is the somewhat odd-looking denominator. A more natural choiceof normalization seems to be c 2  i  =  (  − ¯  y i  ) 2 (  2 +  2 /n i  ) .  (5)In a simplified situation where   2 ,   2 and    are given numbers,  c 2  i  is  χ 21  distributed pre-experimentally. Hence, in this case we can argue that values of   c 2  i  exceeding 4 do indicate aserious conflict. A second problem with the O’Hagan (2003) approach is that data are usedtwice, first in the computation of the posterior distribution in (4), and then in the evalua-tion of the conflict measure. One way to avoid this is to split the data in one part  y p used toobtain a posterior distribution for the parameters   2 ,   of the parent (p) nodes informationcontribution, and another part  y c used to obtain a posterior distribution for the parameter  2 of the child (c) nodes information contribution. In the evaluation of the conflict, we usea data vector  y c i  , defined as the components of   y i  = (  y i  1 , ... ,  y in i  ) present in  y c .A third problem concerns the way in which the use of the posterior distributions of thenuisance parameters   2 ,  2 and    affect the level of conflict. It is not at all obvious that themedian construction (4) normalizes the conflict in a stable and sensible way. We suggest asan alternative to construct two distributions  g p  and  g c  representing the two different infor-mation sources for   i  ,  N  (  ,  2 ) and  N  ( ¯  y c i  ,  2 /n i  ), integrated over the posterior distributions of   2 ,   given  y p , respectively,   2 given  y c . This explains the integrated posterior distributions(ipd) in the following conflict measure, analogous to (5), between  g p  and  g c : c 2,  y p ,  y c , ipd  i  =  ( E  g p (  i  ) − E  g c (  i  )) 2 (var g p (  i  ) + var g c (  i  )) .  (6)By a conditional expectation and variance argument, for the two information sources men-tioned, this simplifies to c 2,  y p ,  y c , ipd  i  =  ( E  (  | y p ) − ¯  y c i  ) 2 E  (  2 | y p ) + var(  | y p ) + E  (  2 | y c ) /n i  .  (7)Note the additional denominator term var(  | y p ) of (7) as opposed to (5).The ipd construction can be generalized to conflicts concerning arbitrary nodes in thehierarchical network. When   2 ,  2 and    are fixed, the posterior distributions are degener-ate. Then (7) for  y c i   = y i   coincides with (5) and is hence suitably normalized. However, whenthese parameters are random, the variance terms in the denominator of   c 2,  y , med  i  capture onlypart of the pre-experimental variability of the numerator, while the integration over the pos-terior distributions ensures that the variance terms in the denominator of (7) approximatelyreflect the different sources of pre-experimental variability of the numerator.The two basic conflict measures (3) and (5), the various data splittings, and the possi-bility to choose between the median and the ipd approach, give a large number of possibilitiesfor assessing conflict. In this paper, we investigate these possibilities through MCMC simu-lations, both with respect to a false warning probability or significance level, and with respectto calibrated detection probabilities. We consider those methods that hit reasonably close toan intended false warning probability as greatly preferable, since these methods may make acomputationally costly empirical normalization step unnecessary. Among these methods, weprefer those that have the highest detection probability. ©  Board of the Foundation of the Scandinavian Journal of Statistics 2007.  820  F. A. Dahl et al.  Scand J Statist 34 3. Data-splitting approaches The consequences of double use of data can be impossible to assess. This motivates the intro-duction of different data-splitting approaches, designed to avoid this.Visualize the model (1) with the nodes of the parameters   2 ,   2 and    in the first row, thenodes of    i  , i  = 1, ... , k  , in the second row, and the nodes of the transposed of   y i  , i  = 1, ... , k  as columns in the third row. A horizontal splitting of the data  y  would be achieved by letting y p = (  y 11 , ... ,  y 1 m 1 , ... ,  y k  1 , ... ,  y km k  ) y c = (  y 1 m 1 + 1 , ... ,  y 1 n 1 , ... ,  y km k  + 1 , ... ,  y kn k  ), (8)where1 ≤ m i   < n i   for i  = 1, ... , k  .Let y c i   = (  y im i  + 1 , ... ,  y in i  ), i  = 1, ...k  .Furthermore,let c  i  ((  2 ,  );(  2 , y i  )) be any of the two conflict measures  c 1  i  and  c 2  i  given by (3) and (5). To avoid thedoubleuseofdatatheapproachof(4)couldbereplacedby c y p ,  y c , med  i  = M  (  2 ,   | y p ) × (  2 | y c ) ( c  i  ((  2 ,  );(  2 , y c i  ))) .  (9)By running a suitable MCMC algorithm twice, to obtain posterior samples of the parameters  2 ,  2 , and    given  y p and  y c , respectively, we could estimate all  k   conflicts  c y p ,  y c , med  i  , i  = 1, ... , k  . Also note that when using (8),  y p and  y c can be interchanged, and the correspond-ing quantity be estimated from the same two posterior samples of    2 ,  2 and   . If, for  n i   even, m i  = n i  /  2, i  = 1, ... , k  ,equalweightsshouldbeallocatedtothesetwoparallelestimates.A vertical splitting avoiding the double use of data would be achieved by letting, for1 ≤ l <k  y p = (  y 11 , ... ,  y 1 n 1 , ... ,  y l  1 , ... ,  y ln l  ) y c = (  y l  + 11 , ... ,  y l  + 1 n l  + 1 , ... ,  y k  1 , ... ,  y kn k  ) .  (10)Let  y c i   = y i  , i  = l  + 1, ... , k  . By applying this splitting in (9), we can estimate from twoposterior samples the conflicts  c y p ,  y c , med  i  , i  = l  + 1, ... , k  . The remaining conflicts  c y p ,  y c , med  i  , i  = 1, ... , l   are estimated from the same two posterior samples by interchanging  y p and  y c . As-sumethatweareespeciallyinterestedinapossibleconflictataspecific  node,andthatweneedmaximumdatatoarriveattheposteriordistributionoftheparameters  and  2 .Wethendenotethisnodeby  k  ,andchoose l  = k  − 1. 4. Theoretical comparisons We start this section by giving some needed notation. In our simulation experiments, whichwe will return to in section 5, we will compare the O’Hagan (2003) conflict measure (CM) c 1  k  given by (3), in the following denoted by CM 1 , with  c 2  k  given by (5) (CM 2 ). Secondly,we will compare his posterior approach (PA) of evaluating the median posterior conflict, asin (4) (PA 1 ), with the ipd approach presented in (7) (PA 2 ). We will also give results regardingdata splitting ( S  ) by comparing O’Hagan’s ‘splitting approach’ of not splitting data ( S  1 ) withtwo horizontal splitting schemes. The first one ( S  2 ) is based on (8), with  m i  = n i  /  2, i  = 1, ... , k  for  n i   even. The second one ( S  3 ) relies on splitting the data in the same way, but attemptsto make better use of the data: first, the conflict  c  under splitting  S  2  above is computed.Then the roles of   y p and  y c are reversed, and a second  S  2  conflict  c ′  is computed. The  S  3 approach is to use ( c + c ′ ) /  2 as the conflict estimate. We also compare with two vertical split-ting approaches, based on (10), with  l  = k/  2 ( S  4 ), and  l  = k  − 1 ( S  5 ) respectively. Note that  S  4 is only defined for even  k  . Altogether we have 2 × 2 × 5 = 20 combinations.What we call the test model is the model (1) with the nuisance parameters set at their priorexpected values,   2 =  20 ,   2 =  20 , and   =  0 . Denote the actual significance level by   , using a ©  Board of the Foundation of the Scandinavian Journal of Statistics 2007.
Search
Similar documents
View more...
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks