Outliers and Influential Observations

Published

Outliers and Inﬂuential Observations August 16, 2014 1 Motivation Refer to graphs presented in class to distinguish between outliers (observa-tions with ”large” residuals) and inﬂuential observations (observations thatmay or may not be outliers, but which inﬂuence a subset/all coeﬃcients, ﬁts,or variances, in a ”substantial” way. 2 Single-row Diagnostics All tests consider what happens if an observation is dropped–what happensto ﬁt, the estimated coeﬃcients, t-ratios, etc.Let the model with all observations be denoted as usual as: y  =  Xβ   + ε  and the OLS estimator  b  = ( X   X  ) − 1 X   y .Denote the  t th diagonal of the projection matrix  P   =  X  ( X   X  ) − 1 X   as  h t ,and the  t th row of ( X   X  ) − 1 x t  as  c t , whose  k th element is  c kt . Note that  h t can also be written as  x  t ( X   X  ) − 1 x t .Say the  t th observation is dropped. Denote the corresponding dependentvariable as  y [ t ], the  X   matrix as  X  [ t ], the residual vector as  e [ t ] etc.The  t th observation can be considered to be inﬂuential if its omission has alarge impact on parameter estimates, ﬁt of the model etc. This is determinedby using some rules of thumb:1. DFBETA: As shown below: b k − b k [ t ] =  c kt e t 1 − h t 1  Proof: Without loss of generality let the  t th observation be placed last.I.e write the data matrices in partitioned form as follows: X   = [ X   [ t ]  x t ];  y  = [ y  [ t ]  y t ]where  X   is ( nXK  ),  X  [ t ] is (( n − 1) XK  ) and  x  t  is (1 XK  ).  y t  is a scalar,and  y [ t ] is (( n − 1) X  1). ⇒ X   X   =  X   [ t ] X  [ t ] + x t x  t ;  or X   [ t ] X  [ t ] = ( X   X  ) − x t x  t ⇒ X   y  =  X   [ t ] y [ t ] + x t y t ;  or X   [ t ] X  [ t ] =  X   y − x t y t Given that for any matrix  A  and vector  c ( A − cc  ) − 1 =  A − 1 + A − 1 c ( I  − c  Ac ) − 1 c  A − 1 Substitute ( X   X  ) for  A  and  c  =  x t .( X   [ t ] X  [ t ]) − 1 =  ( X   X  ) − 1 + ( X   X  ) − 1 x t (1 − x  t ( X   X  ) − 1 x t ) − 1 x  t ( X   X  ) − 1  Substituting  h t  =  x  t ( X   X  ) − 1 x t , a scalar,=  ( X   X  ) − 1 + ( X   X  ) − 1 x t x  t ( X   X  ) − 1 1 − h t  ⇒ b [ t ] = ( X   [ t ] X  [ t ]) − 1 X   [ t ] y [ t ] =  ( X   X  ) − 1 + ( X   X  ) − 1 x t x  t ( X   X  ) − 1 1 − h t  ( X   y − x t y t )= ( X   X  ) − 1 X   y − ( X   X  ) − 1 x t y t +( X   X  ) − 1 x t x  t ( X   X  ) − 1 X   y 1 − h t − ( X   X  ) − 1 x t x  t ( X   X  ) − 1 x t y t 1 − h t =  b − ( X   X  ) − 1 x t y t  + ( X   X  ) − 1 x t x  t b 1 − h t −  ( X   X  ) − 1 x t h t y t 1 − h t ⇒ b − b [ t ] = ( X   X  ) − 1 x t y t (1 − h t ) − ( X   X  ) − 1 x t x  t b + ( X   X  ) − 1 x t h t y t 1 − h t Recognizing that  h t  and  y t  are scalars, and that  x  t b  = ˆ y  so that  y t − x  t b  =  e t , after cancellation we get b − b [ t ] = ( X   X  ) − 1 x t ( y t − x  t b )1 − h t =  c t e t 1 − h t 2  Focusing only on the  k th coeﬃcient, we get the expression above b k − b k [ t ] =  c kt e t 1 − h t Some standardization is necessary to determine cut-oﬀs: DFBETA k  =  b k − b k [ t ] s [ t ]   Σ c 2 kt Cutoff   : ±  2 √  n 2. DFFITS: It can be shown that:ˆ y t − ˆ y t [ t ] =  x t [ b − b [ t ]] =  h t e t 1 − h t With standardization: DFFIT  t  = ˆ y t − ˆ y t [ t ] s [ t ] √  h t Cutoff   : ± 2 √  K  √  n This was the impact of deleting the  t th observation on the  t th predictedvalue. Can analogously consider ˆ y  j − ˆ y  j [ t ]3. RSTUDENT: RSTUDENT   =  e t s [ t ] √  1 − h t Cutoff   : ± 24. COVRATIO: COVRATIO  =  | s 2 [ t ]( X  [ t ]  X  [ t ]) − 1 || s 2 ( X   X  ) − 1 | Cutoff   : <  1 −  3 K n  → ” bad ”; >  1 + 3 K n  → ” good ” 3 Multiple-row Diagnostics If there is a cluster of more than one outlier, it is clear that single-row di-agnostics will not be able to identify inﬂuential observations because of themasking eﬀect, demonstrated in class.3  Multiple-row diagnostics can. Let  m  denote the subset of   m  deletedobservationsThe measures deﬁned above can be analogously determined: DFBETA  =  b k − b k [ m ] Var ( b k ) MDFIT   = ( b − b [ m ])  ( X  [ m ]  X  [ m ])( b − b [ m ]) VARRATIO  =  | s 2 ( X  [ m ]  X  [ m ]) − 1 || s 2 ( X   X  ) − 1 | This is, however, not practical, although there are packages that canconsider every permutation of 2, 3, 4,.... data points, and also methods tohelp identify  m . 3.1 Partial Regression Plots In a simple regression model (with one independent variable), inﬂuentialobservations–be they single or multiple–are easy to detect visually. But whatabout a multiple regression model? One easy and practical solution is tocollapse a multiple regression model to a series of single-regressions using theFWL Theorem.For example, say there are four explanatory variables:  y  =  β  1  +  X  2 β  2  + ... + X  4 β  4  + ε To know if there are observations inﬂuencing the estimated  b 2 .1. Regress  y  on  X  3  and  X  4  and obtain the residual ˆ u .2. Regress  X  2  on  X  3  and  X  4  and obtain the residual ˆ w .By the FWL Theorem, we know that the regression of ˆ u  on ˆ w  yields theOLS slope coeﬃcient for  X  2 . So, a plot of ˆ u  on ˆ w  enables us to collapsemulti-dimentional problem into a two-dimensional one.Visual inspection along the lines presented earlier of such partial regres-sion plots for each of the key parameters of interest can identify inﬂuentialobservations–singly or as a cluster. 4 What to do The point is that an inﬂuential observation/set of observations is/are notnecessarily to be jettisoned. A cluster of inﬂuential observations could wellbe an indication of structural change, for example.4

## Multivariate Statistics

