A graphical selection method for parametric models in noisy inhomogeneous regression

A graphical selection method for parametric models in noisy inhomogeneous regression
of 9
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
    a  r   X   i  v  :  a  s   t  r  o  -  p   h   /   0   2   0   5   5   3   6  v   1   3   0   M  a  y   2   0   0   2 Mon. Not. R. Astron. Soc.  000 , 000–000 (0000) Printed 15 January 2014 (MN L A TEX style file v1.4) A graphical selection method for parametric models innoisy inhomogeneous regression Nicolai Bissantz and Axel Munk Institut f¨ ur Mathematische Stochastik, Universit¨ at G¨ ottingen, Lotzestr. 13, 37083 G¨ ottingen, Germany  15 January 2014 ABSTRACT A common problem in physics is to fit regression data by a parametric class of func-tions, and to decide whether a certain functional form allows for a good fit of the data.Common goodness of fit methods are based on the calculation of the distribution of certain statistical quantities under the assumption that the model under consideration holds true  . This proceeding bears methodological flaws, e.g. a good “fit” - albeit themodel is wrong - might be due to over-fitting, or to the fact that the chosen statisti-cal criterion is not powerful enough against the present particular deviation betweenmodel and true regression function. This causes particular difficulties when modelswith different numbers of parameters are to be compared. Therefore the number of parameters is often penalised additionally. We provide a methodology which circum-vents these problems to some extent. It is based on the consideration of the errordistribution of the goodness of fit criterion under a broad range of possible models- and not only under the assumption that a given model holds true. We present agraphical method to decide for the most evident model from a range of parametricmodels of the data. The method allows to quantify statistical evidence  for   the model(up to some distance between model and true regression function) and not only  ab-sence of evidence   against, as common goodness of fit methods do. Finally we applyour method to the problem of recovering the luminosity density of the Milky Wayfrom a de-reddened  COBE/DIRBE   L-band map. We present statistical evidence forflaring of the stellar disc inside the solar circle. Key words:  methods: data analysis - methods: statistical - Galaxy: disc - Galaxy:structure. 1 INTRODUCTION Often one is confronted with the problem to reconstructan unknown function  f  ( t i ) from noisy observations  y i  = y ( t i ) ,i  = 1 ,...,N  . Astrophysical examples include rever-beration mapping of gas in active galactic nuclei and recov-ery of the spatial (three-dimensional) luminosity density of a galaxy from blurred observations of its surface brightness.See e.g. Lucy (1994) for more examples of astronomical in-verse problems. In this paper we are concerned with a newmethod to compare several competing parametric modelsfor the regression function  f  .Due to the noisy measurements it is tempting to assumethat  y i  =  f  ( t i )+ ε i ,  where the  ε i  denote some random noiseand  f  ( t i ) the expected value of   y i , i.e.  E  [ y i ] =  f  ( t i ) .  Inparticular we allow for different error distributions of the  ε i ,which entails inhomogeneous variance patterns, viz.  V  [ ε i ] = σ 2 i , as will be the case in our example of de-projecting thede-reddened  COBE/DIRBE   L-band surface brightness mapof Spergel et al. (1996), as discussed by Bissantz & Munk(2001, [BM1]).It is a common proceeding to fit a class of functions  U   = { f  ϑ  :  ϑ  ∈  Θ }  (parametric model) to the data  y i . Theparametric model may depend on a parameter  ϑ , where ϑ  ∈  Θ  ⊆  IR d . A popular method to select a “best-fitting” ϑ  from Θ is to minimise the empirical mean squared error(MSE) Q 2 N  ( ϑ ) := N   i =1 ( y i − f  ϑ ( t i )) 2 (1)or weighted variants of it. This gives ˆ ϑ , the least squares c  0000 RAS  2  N. Bissantz and A. Munk  estimator (LSE) of  ˆ ϑ . Other measures of goodness of fit aree.g.  L 1 -error criteria, where the absolute deviation between y i  and  f  ϑ ( t i ) is considered (Seber & Wild 1989). A smallvalue of   Q 2 N  (ˆ ϑ ) often is used as an indication for a goodexplanation of the observations by the model  f  ˆ ϑ . Note thatin regression models where the noise is  inhomogeneous   thequantity  Q 2 N  (ˆ ϑ ) is often useless (cf. [BM1] for an explana-tion) and more subtle methods have to be applied.An advantage of the parametric fitting methodology in con-trast to nonparametric curve estimation, i.e. approximatingthe data by arbitrary functions (e.g. by splines, orthogonalseries or wavelets, cf. Efromovich, 1999 or Hart, 1997) is thefact that often physical reasoning resulting from a theorysuggests such a class of functions  U  . Furthermore, subse-quent data analysis and interpretation becomes very simpleif once a proper  f  ˆ ϑ  is selected. Hence it is an important taskto pre-specify  U   correctly in order to obtain a reasonable fit.Therefore in this paper we discuss the problem of evaluat-ing the goodness of fit of a parametric model  U  . Moreover,we offer a graphical method which allows to select a propermodel  U   from a class of different models  U   =  { U  k } k =1 ,...,l ,say.A common proceeding is to assume that the model holds,and to test if the observed data give reason to reject themodel. This type of goodness of fit tests is performed by eval-uating the probability distribution of a pre-specified measureof discrepancy, such as  Q 2 N  (ˆ ϑ ). This is done under the as-sumption that  U   holds true. Then, when this measure ex-ceeds a certain quantity, the model  U   is rejected.One problem of such methods is that a large data set leadsessentially to rejection of any model  U   (an illustrative dis-cussion can be found in Berger, 1985), because the “realworld” is never exactly described by such a model and asthe number of observations increases, statistical methodswill always detect these deviations between the model and“reality”. Conversely, the selected statistical criterion maylead to a decision in favour of   U   (albeit wrong), because it isnot capable to detect important deviations from  U   or the de-cision is affected by quantities which are not captured in themodel  U   (e.g. correlation between the  y i ). Another problemcan be over-fitting of data by models with a too large num-ber of parameters. Therefore, various methods have beensuggested which penalise the number of parameters, i.e. thecomplexity of a model (Akaike, 1974, Burnham et al., 1998,or Schwarz, 1978).In this paper, we suggest a methodology which aims to avoidthese problems by considering the distribution of a discrep-ancy measure such as  Q 2 N  (ˆ ϑ ) under all “possible” functions f  . This extends the method given in [BM1] to the more re-alistic case where the “true” function  f   is not restricted tobe in  U  . Furthermore, a graphical method will be presentedwhich allows to select the most appropriate between severalcompeting models  U  i . With our method, this is still possibleif these models have different numbers of parameters.In the next section we will describe the method and its al-gorithmic implementation, the wild bootstrap. Based on thetheory presented in sect. 2, we suggest in sect. 3 a graph-ical method to assess the validity of   U   as well as to com-pare between different models. This method is denoted as  p -value curve analysis. In sect. 4 our method is applied toa near-infrared [NIR] L-band map of the Milky Way [MW]and two different models of the spatial luminosity distribu-tion are compared. One of the models includes a flaring disccomponent. We analyse the models’  p -value curves, and findthat flaring in the disc improves the fit to the data. 2 A NEW METHOD OF MODEL SELECTION In section 2.1 we briefly recall the methodology suggestedin [BM1] and extend it to the situation where  f   is not inthe model  U  . This will be used to compute  p -value curves,a graphical method of model diagnostics, which was intro-duced by Munk & Czado (1998) in a different context. Insect. 2.2 we describe the practical application of the method. 2.1 Basic theory of the method We begin with an introduction to the basic principles of ourmethod. As mentioned above  Q 2 N  (ˆ ϑ ) fails to be a valid crite-rion for goodness of fit in inhomogeneous models [BM1]. In-stead we replace the pure residuals  y i − f  ˆ ϑ ( t i ) with smoothedresiduals, to allow for a valid statistical analysis. For thesmoothing step we require an  injective   linear integral oper-ator with kernel  T  , viz. g ( w ) =  T ( f  )( w ) =    T  ( w,v ) f  ( v ) dv which maps the function  f   to be recovered onto  g . In princi-ple any injective operator  T is a valid option for the smooth-ing, however a good choice is driven by aspects such as effi-ciency and simplicity. In our example (cf. sect. 4) we intro-duce “cumulative smoothing” with  T  ( w,v ) = min( w,v ). Anextensive simulation study by Munk & Ruymgaart (1999)revealed this smoothing kernel as a reasonable choice whichyields a procedure capable to detect a broad range of devi-ations from the class of functions  U   = { f  ϑ  :  ϑ ∈ Θ } .A measure of the discrepancy between the “true”  f   and  U  is the transformed distance D 2 ( f  ) = min ϑ ∈ Θ || T ( f   − f  ϑ ) || 2 (2)where the norm refers to some  L 2 -norm. Now assume thatthe minimum in eq. 2 is achieved at a parameter vector  ϑ ∗ = ϑ ∗ ( g )  ∈  Θ. Because  ϑ ∗ is unknown it has to be estimatedfrom the data. This can be done by numerical minimisationof the empirical counterpart of the r.h.s. of eq. 2,ˆ D 2 := min ϑ ∈ Θ || T f  ϑ − ˆ g || 2 c  0000 RAS, MNRAS  000 , 000–000  Parametric models in noisy inhomogeneous regression problems   3 whereˆ g  =  N  − 1 N   i =1 y i T  ( u,t i )is an estimation of   g  using the noisy data  y i . The reasoningbehind this approach is that for sufficiently large number  N  of observations (in our example  N   = 4800) it can be shownthat ˆ g  converges in probability to the true (but unknown)function  g , independently whether a parametric model  U   isvalid or not. On the other hand the empirical minimiser ˆ ϑ T  estimates the best possible fit to ˆ g  by the model  U  . Theresulting estimator is denoted as a smoothed minimum dis-tance estimator ˆ ϑ T  (SMDE) and has the property that, if the true function  f   =  g ϑ ∗  is in  U  , ˆ ϑ T  →  ϑ ∗ as the sam-ple size increases. For detailed proofs we refer to Munk &Ruymgaart (1999).Note that  ϑ ∗ is the “true” best-fitting parameter vector,which could only be determined if the data would be free of noise, whereas ˆ ϑ T  is an estimation of the best-fitting param-eter vector using the noisy data. Here and in the following,quantities with a hat, “ˆ”, are estimated from the noisydata, whereas such without a hat are the “true” functionsto be recovered.Munk & Ruymgaart (1999) showed that the probabilisticlimiting behaviour of  ˆ D 2 depends on whether  f   belongs tothe model  U   under investigation. More precisely when  f  belongs to  U   the distribution of   N   ˆ D 2 is for large  N   approx-imately that of  ∞  i =1 λ i χ 2 i  (3)where  χ 2 i  denotes a sequence of independent squares of stan-dard normal random variables and  λ i  ≥  0 is a sequence of real numbers, s.t.  ∞ i =1 λ 2 i  <  ∞ , which depend on  ϑ ∗ , thedistribution  L  of errors  ε i  and the operator  T  .In contrast if   f   does not belong to  U  , we have  D 2 >  0 and N  12  ˆ D 2 − D 2   tends for large  N   to a centred normal distri-bution with variance  σ 2 T , L ,ϑ ∗ , depending on  T ,  ϑ ∗ , and  L .Observe, that we obtain two different types of distributions,accordingly to the situation whether the “true” (unknown)function  f   is in the model  U   or not. Because of the com-plicated dependency of the ( λ i ) i ∈ IN   and  σ 2 T , L ,ϑ ∗  on  T ,  ϑ ∗ ,and  L  a resampling algorithm should be applied in order toapproximate these limiting distributions. Stute et al. (1998)presented a wild bootstrap algorithm which can be used toapproximate the law  N   ˆ D 2 . Munk (1999) showed that thisalgorithm is also valid when  f   does not belong to  U  , whichis crucial for our paper. This algorithm will be carefully ex-plained in the next paragraph. Recall that the subsequentbootstrap algorithm allows to determine the probability dis-tribution of the quantity of interest ˆ D 2 . The general strategyof our method will be the following. Because ˆ D 2 measuresthe distance between the model  U   and the estimator ˆ g  fromnoisy data, knowledge of the probability distribution of  ˆ D 2 Figure 1.  Binary probability distribution required in step 2 of the wild bootstrap algorithm. The ordinate gives the probabil-ity of the random number to be  − ( √  5 + 1) / 2 and ( √  5 + 1) / 2,respectively. (which will be determined by the subsequent bootstrap al-gorithm) allows us to quantify whether an observed value of ˆ D 2 for a model  U   is more likely than for a competing model U  ′ , say. Even, when none of these models is completely true(which is always the case in the real world) ˆ D 2 quantifies thebest possible  approximation  of   g  by  U   or  U  ′ respectively. 2.2 Practical application of the method We now introduce the resampling algorithm to approximatethe law  N   ˆ D 2 . The algorithm starts with the determinationof the SMDE ˆ ϑ T  and the smoothed residuals between thismodel and the data (step 1). Then in step 2-5 the resamplingpart of the algorithm follows. The same algorithm is used in[BM1]. Step 1:  ( Generate residuals  ). Compute residualsˆ ε i  :=  y i − f  ˆ ϑ T ( t i ) , i  = 1 , ··· ,n where ˆ ϑ T  denotes a solution of the minimisation of ˆ D 2 :=  χ 2 (ˆ ϑ T ) := min ϑ ∈ Θ  ˆ g − T f  ϑ  2 . Step 2:  ( The ”wild” part  ). Generate new random vari-ables  c ∗ i , i  = 1 ,...,n , which do  not   depend on the data,where each  c ∗ i  is distributed to a distribution which assignsprobability ( √  5 + 1) / 2 √  5 to the value ( −√  5  −  1) / 2 and( √  5 − 1) / 2 √  5 to the value ( √  5+1) / 2. See fig. 1 for a visu-alisation of this probability distribution. Step 3:  ( Bootstrapping residuals  ). Compute  ε ∗ i  := ˆ ε i c ∗ i  and y ∗ i  =  f  ˆ ϑ T + ε ∗ i . This gives a new data vector ( y ∗ i ,t i ) i =1 ,...,n . Step 4:  ( Compute the target  ). Compute ˆ D 2 ∗  with( y ∗ i ,t i ) i =1 ,...,n Step 5:  ( Bootstrap replication  ). Repeat step 1-4  B   timeswhich gives values ˆ D 21 , ∗ ,...,  ˆ D 2 B, ∗ .  B  is a large number, typ-ically  B  = 500 or  B  = 1000 is sufficient. c  0000 RAS, MNRAS  000 , 000–000  4  N. Bissantz and A. Munk  From the bootstrap replications ˆ D 21 , ∗ ,...,  ˆ D 2 B, ∗  we computethe quantities x 1  = √  N   ˆ D 21 , ∗ −  ˆ D 2  ,...,x B  = √  N   ˆ D 2 B, ∗ −  ˆ D 2  , using the number of data points  N  . The  x 1 ,...,x B  arerealisations of the random quantity  X   = √  N   ˆ D 2 ∗ −  ˆ D 2  .It can be proved that the empirical distribution functionof  ˆ D 21 , ∗ ,...,  ˆ D 2 B, ∗  yields an approximation to the true dis-tribution of  ˆ D 2 after a proper re-centring, i.e. the cu-mulative probability distribution function  F  ∗ B  of   X   = √  N   ˆ D 2 ∗ −  ˆ D 2   is close to the cumulative distribution func-tion of  √  N   ˆ D 2 − D 2   for any  D 2 >  0 (Munk, 1999).An important application of this result is to determinean approximation to the probability  p ( t,D 2 ) that ˆ D 2 isbelow a certain value  t , provided the distance betweentrue function  f   and the model is  D 2 . To this end we usethat  F  ∗ B , found from the bootstrap replications, approxi-mates the (unknown) cumulative probability distribution of  √  N  ( ˆ D 2 − D 2 ). The latter distribution allows to determine  p ( t,D 2 ). Hence we are in the position to compare the proba-bility that the observed value of  ˆ D 2 is achieved in all ”possi-ble worlds”, i.e. for any possible  f  . In fact, it turns out thatthis probability does only depend on  f   via  D 2 ( f  ), which al-lows a nice geometric interpretation as we will illustrate inthe following.We will use the asymptotic similarity of the two cumulativeprobability distributions in the following section to estimatethe probability  p ( t,D 2 ). From this we then define the  P  -value curve  α N  (Π), which can be regarded as a measure of evidence for  D 2 ≤ Π, given ˆ D 2 and  F  ∗ B . Thus these quanti-ties allow to constrain  D 2 for a parametric model of a givenset of data. 3  P  -VALUE CURVES The main methodology we propose in this paper is the com-putation of a  p -value curve as a graphical tool for illustratingthe evidence of a model. To this end we plot the function α N  (Π) =  F  ∗ B  √  N   ˆ D 2 − Π   for Π  >  0, i.e. the value of  α N  (Π) is given by the probability that the random quantity X   = √  N   ˆ D 2 ∗ −  ˆ D 2   is smaller than √  N   ˆ D 2 − Π  . Notethat this implies that for Π increasing  α N  (Π) decreases, be-cause we then evaluate the cumulative distribution function F  ∗ B ( x ) for decreasing  x , and in particular, if   α N  (Π) is small,at the left tail of   F  ∗ B .The interpretation of the function  α N  (Π) is as follows. As-sume the true distance between model  U   and  f   (i.e. the dis-tance between the minimising  f  ϑ ∗  and the “true” function f  ) is  D 2 = Π. If this holds, the probability that √  N  ( ˆ D 2 − D 2 )is smaller than some value  t  is given as P  D 2 =Π  √  N   ˆ D 2 − D 2  ≤ t  ≈ F  ∗ B ( t ) (4)where the r.h.s. denotes the bootstrap approximation to thetrue distribution function on the l.h.s. Now we reject thehypotheses  H   :  D 2 >  Π (vs. alternative  K   :  D 2 ≤ Π) when-ever  α N  (Π)  ≤  α  for a given level of significance  α . Hence1 −  α N  (Π) can be regarded as the estimated evidence infavour of the model  U   (up to a distance between model anddata  D 2 ≤ Π).Note that this approach highlights the fact that finally theastrophysicist has to decide whether a value of   D 2 = Πshould be regarded as scientifically negligible or as devia-tion from the model  U   which is considered as too large byastrophysical reasons. We mention that the classical good-ness of fit tests do not offer the scientist the specification of such a value Π.How can an upper bound for a just acceptable  D 2 be de-termined? One simple suggestion is to compute the distance˜ D 2 =  || T f  ˆ ϑ T − T ˜ f  ˆ ϑ T || 2 between the best model  f  ˆ ϑ T and”test models” ˜ f  ˆ ϑ T . Such test models should then be con-structed from  f  ˆ ϑ T by adding (systematic) deviations to themodel, which are still considered as scientifically negligibledifferences to the best model. Then, if   D 2 is not larger thanthe average over the test models  <  ˜ D 2 > , computed from anumber of such test models, it is considered as scientificallynegligible.Observe that with our proposed method the statistical typeone error is the error to decide for the model (or more precisefor a neighbourhood  D 2 ≤ Π of the model) although it is notvalid. Classical goodness of fit tests are only able to controlthe error of rejecting the model albeit it holds, i.e. they arebased on testing  H  0  :  D 2 = 0 vs.  K  0  :  D 2 >  0.Fixing ˆ D 2 , a small value of   α N  (Π) indicates large probabilityfor  D 2 ≤ Π and a large value (close to 1) of   α N  (Π) indicatesa large probability for  D 2 >  Π. It is important to note thatthe interesting regions of the resulting curves  α N  (Π) arethose values of Π where  α N  (Π) is rather large (larger than0 . 9 say) and rather small (smaller than 0 . 1) in accordancewith the usual choice of levels of significance. In contrastdecisions based on  α N   in regions where  α N  (Π) ≈ 0 . 5 wouldcorrespond to flipping a coin in order to decide whether D 2 ≤ Π or not.As an important advantage of   p -value curves we find thatit gives us not only an estimated probability (  p -value) thatwe would observe a test statistic (such as  Q 2 N  (ˆ ϑ ) or ˆ D 2 )provided the assumption that  U   underlies the data is true.Rather we obtain simultaneously all scenarios over the entirerange of “possible worlds” which are parametrised by  D 2 . Inparticular this implies that models with a large number of parameters are penalised in an automatic way. As the num-ber of parameters increases the variability of the statisticˆ D 2 increases and hence the variability of   F  ∗ B , i.e. the rangeof values for  X  , for which  F  ∗ B  differs significantly from 0and 1, is larger. On the other hand the bias is reduced. Asthe number of parameters decrease the opposite will be thecase. This leads to a curve  α n (Π) which slowly decreases tozero if the variance is too large or if the bias is too large. c  0000 RAS, MNRAS  000 , 000–000  Parametric models in noisy inhomogeneous regression problems   5 Figure 2.  Typical cases for  p -value curve comparison of two parametric models. The vertical lines at Π = 0 . 08 in the graphs indicatethe observed value of  ˆ D 2 = 0 . 08. In graph 1, model 1 fits better, as well as in graph 2. However in graph 2 is additionally strong evidencethat model 1 does not hold. Graph 3 is again an example with model 1 the better model. Finally in graph 4 the situation depends onthe assumption of the distance between the parametric model  U   and the true regression function  f   (cf. sect. 2)  D 2 . Hence evidence for a small Π can only be claimed if thesetwo quantities are balanced.In other words a  p -value curve reflects automatically thetradeoff between variance and bias in a regression. Here thebias of the regression functions can be viewed as the dif-ference between the “true” expectation value  E  [ y i ] and thevalue of the regression function  f  ( t i ). The variance providesan estimate of the uncertainty of the best-fitting parametersˆ ϑ  or ˆ ϑ T .Before we analyse two competing models for the structure of the MW we illustrate in an artifical example typical featuresof   p -value curves. In fig. 2 various scenarios are displayed. Ingraph 1 model 1 beats model 2 at all fronts. The estimatedevidence for  D 2 ≤ Π is uniformly larger for any Π  >  0. Thiscoincides with “classical testing” because also the classical  p -value for testing H: D 2 = 0 is larger. Observe that theclassical  p -value corresponds in this graph to 1 − α N  (0).Graph 2 is similar, observe however, that a classical analysiswould indicate that here is additionally strong evidence thatmodel 1 does not hold ( α N  (0) ∼ >  0 . 9), although it yields abetter fit as model 2, exactly as in graph 1. Here the valueof Π where  α N  (Π) = 0 . 1 (i.e. where Π ≈ 0 . 7), say, becomesimportant because it gives an idea of the order of magnitudebetween model  U   and the true regression. Hence it has tobe decided for the particular problem whether a distance of Π ≈ 0 . 7 is considered as “large” or scientifically irrelevant.Graph 3 represents a typical case of over-fitting by model2. Classical reasoning would prefer model 2 because  α N  (0)is smaller and hence the classical  p -value larger. However,we see that this is due to a lack of power of the used teststatistic, because the slope of the curve is very flat due toa large variability of the test statistic. Hence there is notmuch support for the decision  D 2 ≤ 0 . 5, say, ( α N  (0 . 5) ≈ 0 . 3)whereas model 1 yields  α N  (0 . 5) ≈ 0 . 03. Thus there is strongevidence that the distance between model 1 and the trueregression curve is smaller than 0 . 5, say.Finally in graph 4 both models are acceptable with slightpreference to model 2 provided a distance of 0 . 2 (the pointof intersection of both curves) is considered as an acceptabledistance between  U   and  f  . If a larger distance, Π = 0 . 5, sayis considered to be tolerable, however model 1 has to bepreferred. c  0000 RAS, MNRAS  000 , 000–000
Similar documents
View more...
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks