doi:10.1111/j.14679469.2007.00560.x
©
Board of the Foundation of the Scandinavian Journal of Statistics 2007. Published by Blackwell Publishing Ltd, 9600 GarsingtonRoad, Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA 02148, USA Vol 34: 816–828, 2007
A Robust Conflict Measure ofInconsistencies in Bayesian HierarchicalModels
FREDRIK A. DAHL
Health Services Research Unit, Akershus University Hospital, and Department of Mathematics, University of Oslo
JØRUND GÅSEMYR and BENT NATVIG
Department of Mathematics, University of Oslo
ABSTRACT. O’Hagan (
Highly Structured Stochastic Systems
, Oxford University Press, Oxford,2003)introducessometoolsforcriticismofBayesianhierarchicalmodelsthatcanbeappliedateachnode of the model, with a view to diagnosing problems of model ﬁt at any point in the model structure. His method relies on computing the posterior median of a conﬂict index, typically throughMarkov chain Monte Carlo simulations. We investigate a Gaussian model of oneway analysis of variance,andshowthatO’Hagan’sapproachgivesunreliablefalsewarningprobabilities.Weextendand reﬁne the method, especially avoiding double use of data by a datasplitting approach, accompanied by theoretical justiﬁcations from a nontrivial special case. Through extensive numericalexperiments we show that our method detects model misspeciﬁcation about as well as the method of O’Hagan, while retaining the desired false warning probability for data generated from the assumedmodel. This also holds for Student’s
t
and uniform distribution versions of the model.
Key words:
double use of data, Markov chain Monte Carlo simulations, model evaluation,oneway analysis of variance
1. Introduction
Modern computer technology combined with Markov chain Monte Carlo (MCMC) algorithms has made it possible to analyse complex Bayesian hierarchical models. The resultingpopularity of complex models has also increased the need for ways of evaluating such models.In a frequentist setting, this is often done by way of
p
values, which quantify how surprisingthe given data set is, under the assumed model. By construction, a frequentist
p
value is preexperimentally uniformly distributed on the unit interval, where low values are interpretedas surprising. In the present paper, the term
preexperimental
refers to the distribution underthe assumed model prior to using the data.Several Bayesian
p
values have been suggested over the last few decades. The socalled
prior predictive pvalue
of Box (1980), measures the degree of surprise of the data, according tosome metric of choice, under a probability measure deﬁned by the product of the prior andthe likelihood given the model. It therefore differs from a frequentist
p
value through theintroduction of the prior distribution. The prior predictive
p
value is a natural choice in caseswhere the prior of a Bayesian model represents our true beliefs about the distribution of ourparameters prior to seeing data. Usually, however, we apply quite vague priors that represent general uncertainty about parameters that could, in principle, be arbitrarily preciselyestimated with enough data.In these cases, sampling under the prior makes little sense, and is not even deﬁned forimproper priors. Rubin (1984) therefore introduced
posterior predictive pvalues
that rely onsampling hypothetical future replications from the posterior distribution. This construction
Scand J Statist 34
Inconsistencies in Bayesian hierarchical models
817also allows metrics that evaluate discrepancies between data and parameter values (see Gelman
et al.
, 1996). However, posterior predictive
p
values use data twice; both directly throughthe discrepancy function, and indirectly by sampling from the posterior distribution.This has been criticized by Dey
et al.
(1998) and Bayarri & Berger (2000), both coming upwith alternative approaches. The former paper introduces a simulationbased approach wherethe posterior distribution given the observed data is compared with a medley of posteriordistributions given replicated data sets generated from the prior distribution. Hence, the approach is essentially in accordance with the prior predictive approach. The latterpaper suggests two variants; the conditional predictive
p
value and the partial posterior predictive
p
value, both designed to avoid the double use of data by eliminating the inﬂuence of a chosen test statistic on the posterior distribution.Robins
et al.
(2000) proves that the preexperimental asymptotic distribution of the posterior predictive
p
value is more concentrated around 1
/
2 than a uniform, as opposed to thetwo variants of Bayarri & Berger (2000). Hence, as also pointed out by Meng (1994) andDahl (2006), the posterior predictive
p
values tend to be conservative in the sense that extremevalues get too low probability. Hjort
et al.
(2006) analyses this in depth, and designs a doublesimulation scheme that alleviates the problem. This scheme can be thought of as essentiallytreating the posterior predictive
p
value as a test statistic in itself, and using it in an extensiveprior predictive
p
value computation.In model choice problems the task is to choose the best model from a given set of candidates. Bayes factors, see Kass & Raftery (1995), provide a useful methodology for such problems. Information criteria give a different approach to model the choice based on weighingmodel ﬁt against the number of free parameters. The Bayesian information criterion (BIC)was deﬁned by Schwartz (1978) and more recently analysed by Clyde & George (2004). Adifferent information criterion that is used for Bayesian models is the socalled divergenceinformation criterion (DIC) (see Spiegelhalter
et al.
, 2002). Although model evaluation andmodel choice are related, these tasks are different, and model choice methods cannot readilybe applied for the purpose of model evaluation.The variants of Bayarri & Berger (2000) work well in some simple cases, and also forthe partial posterior predictive
p
value, for a simple hierarchical model as demonstrated inBayarri & Castellanos (2004). However, it seems difﬁcult to use this method to criticizearbitrary aspects of Bayesian hierarchical models. An approximate crossvalidation approachaimed at criticizing such models is given in Marshall & Spiegelhalter (2003). Dey
et al.
(1998)introduced tools for evaluating different parts of such models. Similarly, we extend andreﬁne in this paper a tool suggested by O’Hagan (2003) for evaluating inconsistencies of a model, through analysis of what he calls
information contributions
. This is a ﬂexible toolthat can be used at any node in the model network. However, in the present paper, werestrict our attention to location parameters. Under suitable conditions, our conﬂict evaluation for a given node will preexperimentally be a squared normal variable. Our mainhypothesis is that this is close to be true for a larger class of models. This gives a surpriseindex which is similar to a frequentist
p
value. In cases where we have domain knowledgethat makes us suspect a given node, we test that one. Otherwise, one should do Bonferronilike adjustments to the signiﬁcance level to control the overall false alarm probability. Thisdoes not mean that we advocate basing the modelbuilding process on a formal hypothesistesting scheme alone. Rather, we envisage an informal procedure, where the conﬂict analysissuggests points in the model structure that might be problematic. However, without reasonable control over the preexperimental distribution of the conﬂicts in the model, it wouldbe difﬁcult to use this tool, in practice, without a computationally demanding empiricalnormalization.
©
Board of the Foundation of the Scandinavian Journal of Statistics 2007.
818
F. A. Dahl et al.
Scand J Statist 34
The paper is laid out as follows: in section 2 we explain the srcinal idea of O’Hagan(2003) in the setting of a Gaussian hierarchical model, followed by our modiﬁcations of themethod. Our modiﬁcations include the splitting of data, so as to avoid double use of it, andthis is discussed further in section 3. Section 4 gives some theoretical results in a special caseof our model. In section 5 we give results from our massive simulation experiments, andsection 6 concludes the article. In the Appendix, we give the proofs of the theoretical resultsin section 4.
2. Measuring conﬂict
O’Hagan (2003) introduces some tools for model criticism that can be applied at each node of a complex hierarchical or graphical model, with a view to diagnosing problems of model ﬁtat any point in the model structure. In general, the model can be supposed to be expressed asa directed acyclic graph. To compare two unimodal densities/likelihoods he suggests the following procedure. First, normalize both densities to have unit maximum height. The heightof both curves at their point of intersection is denoted by
z
. Then the suggested conﬂictmeasure is
c
1
=
−
2 ln
z
. In the present paper we consider, as O’Hagan (2003), the simplehierarchical model for several normal samples (oneway analysis of variance) to clarifywhat we see as problematic aspects of his approach. Observations
y
ij
for
i
=
1,
...
,
k
and
j
=
1,
...
,
n
i
are available. The model has the form:
y
ij

,
2
∼
ind
N
(
i
,
2
),
i
=
1,
...
,
k
;
j
=
1,
...
,
n
i
i

,
2
∼
ind
N
(
,
2
),
i
=
1,
...
,
k
, (1)where
=
(
1
,
...
,
k
), and is completed by a prior distribution for
2
,
2
and
.In the model (1), consider the node for parameter
i
. In addition to its parents
and
2
,it is linked to its child nodes
y
i
1
,
...
,
y
in
i
. The full conditional distribution of
i
is given by:
p
(
i

y
,
−
i
,
2
,
2
,
)
∝
p
(
i

,
2
)
n
i
j
=
1
p
(
y
ij

i
,
2
), (2)where
y
=
(
y
11
,
...
,
y
kn
k
) is the complete set of data, and
−
i
=
(
1
,
...
,
i
−
1
,
i
+
1
,
...
,
k
). Thisshows how each of the
n
i
+
1 distributions can be considered as a source of informationabout
i
. When we are considering the possibility of conﬂict at the
i
node, we must considereach of these contributing distributions as functions of
i
. In the present model, contrastingthe information about
i
from the parent nodes with that from the child nodes, the conﬂictmeasure simpliﬁes to:
c
1
i
=
(
−
¯
y
i
)
2
(
+
/
√
n
i
)
2
, (3)where
¯
y
i
=
1
n
i n
i
j
=
1
y
ij
,noting that the last
n
i
factors of (2) can be written as
p
(
¯
y
i

i
,
2
).When the parameters
2
,
2
and
are given by prior distributions, O’Hagan (2003) suggestsusing MCMC to estimate the quantity
c
1,
y
, med
i
=
M
2
,
2
,

y
(
c
1
i
), (4)
©
Board of the Foundation of the Scandinavian Journal of Statistics 2007.
Scand J Statist 34
Inconsistencies in Bayesian hierarchical models
819where
M
denotes the median under the posterior distribution of
2
,
2
and
. He claims thata value <1 should be thought of as indicative of no conﬂict, whereas values of 4 or morewould certainly indicate a conﬂict to be taken seriously.A ﬁrst problem with (3) is the somewhat oddlooking denominator. A more natural choiceof normalization seems to be
c
2
i
=
(
−
¯
y
i
)
2
(
2
+
2
/n
i
)
.
(5)In a simpliﬁed situation where
2
,
2
and
are given numbers,
c
2
i
is
χ
21
distributed preexperimentally. Hence, in this case we can argue that values of
c
2
i
exceeding 4 do indicate aserious conﬂict. A second problem with the O’Hagan (2003) approach is that data are usedtwice, ﬁrst in the computation of the posterior distribution in (4), and then in the evaluation of the conﬂict measure. One way to avoid this is to split the data in one part
y
p
used toobtain a posterior distribution for the parameters
2
,
of the parent (p) nodes informationcontribution, and another part
y
c
used to obtain a posterior distribution for the parameter
2
of the child (c) nodes information contribution. In the evaluation of the conﬂict, we usea data vector
y
c
i
, deﬁned as the components of
y
i
=
(
y
i
1
,
...
,
y
in
i
) present in
y
c
.A third problem concerns the way in which the use of the posterior distributions of thenuisance parameters
2
,
2
and
affect the level of conﬂict. It is not at all obvious that themedian construction (4) normalizes the conﬂict in a stable and sensible way. We suggest asan alternative to construct two distributions
g
p
and
g
c
representing the two different information sources for
i
,
N
(
,
2
) and
N
(
¯
y
c
i
,
2
/n
i
), integrated over the posterior distributions of
2
,
given
y
p
, respectively,
2
given
y
c
. This explains the integrated posterior distributions(ipd) in the following conﬂict measure, analogous to (5), between
g
p
and
g
c
:
c
2,
y
p
,
y
c
, ipd
i
=
(
E
g
p
(
i
)
−
E
g
c
(
i
))
2
(var
g
p
(
i
)
+
var
g
c
(
i
))
.
(6)By a conditional expectation and variance argument, for the two information sources mentioned, this simpliﬁes to
c
2,
y
p
,
y
c
, ipd
i
=
(
E
(

y
p
)
−
¯
y
c
i
)
2
E
(
2

y
p
)
+
var(

y
p
)
+
E
(
2

y
c
)
/n
i
.
(7)Note the additional denominator term var(

y
p
) of (7) as opposed to (5).The ipd construction can be generalized to conﬂicts concerning arbitrary nodes in thehierarchical network. When
2
,
2
and
are ﬁxed, the posterior distributions are degenerate. Then (7) for
y
c
i
=
y
i
coincides with (5) and is hence suitably normalized. However, whenthese parameters are random, the variance terms in the denominator of
c
2,
y
, med
i
capture onlypart of the preexperimental variability of the numerator, while the integration over the posterior distributions ensures that the variance terms in the denominator of (7) approximatelyreﬂect the different sources of preexperimental variability of the numerator.The two basic conﬂict measures (3) and (5), the various data splittings, and the possibility to choose between the median and the ipd approach, give a large number of possibilitiesfor assessing conﬂict. In this paper, we investigate these possibilities through MCMC simulations, both with respect to a false warning probability or signiﬁcance level, and with respectto calibrated detection probabilities. We consider those methods that hit reasonably close toan intended false warning probability as greatly preferable, since these methods may make acomputationally costly empirical normalization step unnecessary. Among these methods, weprefer those that have the highest detection probability.
©
Board of the Foundation of the Scandinavian Journal of Statistics 2007.
820
F. A. Dahl et al.
Scand J Statist 34
3. Datasplitting approaches
The consequences of double use of data can be impossible to assess. This motivates the introduction of different datasplitting approaches, designed to avoid this.Visualize the model (1) with the nodes of the parameters
2
,
2
and
in the ﬁrst row, thenodes of
i
,
i
=
1,
...
,
k
, in the second row, and the nodes of the transposed of
y
i
,
i
=
1,
...
,
k
as columns in the third row. A horizontal splitting of the data
y
would be achieved by letting
y
p
=
(
y
11
,
...
,
y
1
m
1
,
...
,
y
k
1
,
...
,
y
km
k
)
y
c
=
(
y
1
m
1
+
1
,
...
,
y
1
n
1
,
...
,
y
km
k
+
1
,
...
,
y
kn
k
), (8)where1
≤
m
i
< n
i
for
i
=
1,
...
,
k
.Let
y
c
i
=
(
y
im
i
+
1
,
...
,
y
in
i
),
i
=
1,
...k
.Furthermore,let
c
i
((
2
,
);(
2
,
y
i
)) be any of the two conﬂict measures
c
1
i
and
c
2
i
given by (3) and (5). To avoid thedoubleuseofdatatheapproachof(4)couldbereplacedby
c
y
p
,
y
c
, med
i
=
M
(
2
,

y
p
)
×
(
2

y
c
)
(
c
i
((
2
,
);(
2
,
y
c
i
)))
.
(9)By running a suitable MCMC algorithm twice, to obtain posterior samples of the parameters
2
,
2
, and
given
y
p
and
y
c
, respectively, we could estimate all
k
conﬂicts
c
y
p
,
y
c
, med
i
,
i
=
1,
...
,
k
. Also note that when using (8),
y
p
and
y
c
can be interchanged, and the corresponding quantity be estimated from the same two posterior samples of
2
,
2
and
. If, for
n
i
even,
m
i
=
n
i
/
2,
i
=
1,
...
,
k
,equalweightsshouldbeallocatedtothesetwoparallelestimates.A vertical splitting avoiding the double use of data would be achieved by letting, for1
≤
l <k
y
p
=
(
y
11
,
...
,
y
1
n
1
,
...
,
y
l
1
,
...
,
y
ln
l
)
y
c
=
(
y
l
+
11
,
...
,
y
l
+
1
n
l
+
1
,
...
,
y
k
1
,
...
,
y
kn
k
)
.
(10)Let
y
c
i
=
y
i
,
i
=
l
+
1,
...
,
k
. By applying this splitting in (9), we can estimate from twoposterior samples the conﬂicts
c
y
p
,
y
c
, med
i
,
i
=
l
+
1,
...
,
k
. The remaining conﬂicts
c
y
p
,
y
c
, med
i
,
i
=
1,
...
,
l
are estimated from the same two posterior samples by interchanging
y
p
and
y
c
. Assumethatweareespeciallyinterestedinapossibleconﬂictataspeciﬁc
node,andthatweneedmaximumdatatoarriveattheposteriordistributionoftheparameters
and
2
.Wethendenotethisnodeby
k
,andchoose
l
=
k
−
1.
4. Theoretical comparisons
We start this section by giving some needed notation. In our simulation experiments, whichwe will return to in section 5, we will compare the O’Hagan (2003) conﬂict measure (CM)
c
1
k
given by (3), in the following denoted by CM
1
, with
c
2
k
given by (5) (CM
2
). Secondly,we will compare his posterior approach (PA) of evaluating the median posterior conﬂict, asin (4) (PA
1
), with the ipd approach presented in (7) (PA
2
). We will also give results regardingdata splitting (
S
) by comparing O’Hagan’s ‘splitting approach’ of not splitting data (
S
1
) withtwo horizontal splitting schemes. The ﬁrst one (
S
2
) is based on (8), with
m
i
=
n
i
/
2,
i
=
1,
...
,
k
for
n
i
even. The second one (
S
3
) relies on splitting the data in the same way, but attemptsto make better use of the data: ﬁrst, the conﬂict
c
under splitting
S
2
above is computed.Then the roles of
y
p
and
y
c
are reversed, and a second
S
2
conﬂict
c
′
is computed. The
S
3
approach is to use (
c
+
c
′
)
/
2 as the conﬂict estimate. We also compare with two vertical splitting approaches, based on (10), with
l
=
k/
2 (
S
4
), and
l
=
k
−
1 (
S
5
) respectively. Note that
S
4
is only deﬁned for even
k
. Altogether we have 2
×
2
×
5
=
20 combinations.What we call the test model is the model (1) with the nuisance parameters set at their priorexpected values,
2
=
20
,
2
=
20
, and
=
0
. Denote the actual signiﬁcance level by
, using a
©
Board of the Foundation of the Scandinavian Journal of Statistics 2007.