Outliers and Inﬂuential Observations
August 16, 2014
1 Motivation
Refer to graphs presented in class to distinguish between outliers (observations with ”large” residuals) and inﬂuential observations (observations thatmay or may not be outliers, but which inﬂuence a subset/all coeﬃcients, ﬁts,or variances, in a ”substantial” way.
2 Singlerow Diagnostics
All tests consider what happens if an observation is dropped–what happensto ﬁt, the estimated coeﬃcients, tratios, etc.Let the model with all observations be denoted as usual as:
y
=
Xβ
+
ε
and the OLS estimator
b
= (
X
X
)
−
1
X
y
.Denote the
t
th
diagonal of the projection matrix
P
=
X
(
X
X
)
−
1
X
as
h
t
,and the
t
th
row of (
X
X
)
−
1
x
t
as
c
t
, whose
k
th
element is
c
kt
. Note that
h
t
can also be written as
x
t
(
X
X
)
−
1
x
t
.Say the
t
th
observation is dropped. Denote the corresponding dependentvariable as
y
[
t
], the
X
matrix as
X
[
t
], the residual vector as
e
[
t
] etc.The
t
th
observation can be considered to be inﬂuential if its omission has alarge impact on parameter estimates, ﬁt of the model etc. This is determinedby using some rules of thumb:1. DFBETA: As shown below:
b
k
−
b
k
[
t
] =
c
kt
e
t
1
−
h
t
1
Proof: Without loss of generality let the
t
th
observation be placed last.I.e write the data matrices in partitioned form as follows:
X
= [
X
[
t
]
x
t
];
y
= [
y
[
t
]
y
t
]where
X
is (
nXK
),
X
[
t
] is ((
n
−
1)
XK
) and
x
t
is (1
XK
).
y
t
is a scalar,and
y
[
t
] is ((
n
−
1)
X
1).
⇒
X
X
=
X
[
t
]
X
[
t
] +
x
t
x
t
;
or X
[
t
]
X
[
t
] = (
X
X
)
−
x
t
x
t
⇒
X
y
=
X
[
t
]
y
[
t
] +
x
t
y
t
;
or X
[
t
]
X
[
t
] =
X
y
−
x
t
y
t
Given that for any matrix
A
and vector
c
(
A
−
cc
)
−
1
=
A
−
1
+
A
−
1
c
(
I
−
c
Ac
)
−
1
c
A
−
1
Substitute (
X
X
) for
A
and
c
=
x
t
.(
X
[
t
]
X
[
t
])
−
1
=
(
X
X
)
−
1
+ (
X
X
)
−
1
x
t
(1
−
x
t
(
X
X
)
−
1
x
t
)
−
1
x
t
(
X
X
)
−
1
Substituting
h
t
=
x
t
(
X
X
)
−
1
x
t
, a scalar,=
(
X
X
)
−
1
+ (
X
X
)
−
1
x
t
x
t
(
X
X
)
−
1
1
−
h
t
⇒
b
[
t
] = (
X
[
t
]
X
[
t
])
−
1
X
[
t
]
y
[
t
] =
(
X
X
)
−
1
+ (
X
X
)
−
1
x
t
x
t
(
X
X
)
−
1
1
−
h
t
(
X
y
−
x
t
y
t
)= (
X
X
)
−
1
X
y
−
(
X
X
)
−
1
x
t
y
t
+(
X
X
)
−
1
x
t
x
t
(
X
X
)
−
1
X
y
1
−
h
t
−
(
X
X
)
−
1
x
t
x
t
(
X
X
)
−
1
x
t
y
t
1
−
h
t
=
b
−
(
X
X
)
−
1
x
t
y
t
+ (
X
X
)
−
1
x
t
x
t
b
1
−
h
t
−
(
X
X
)
−
1
x
t
h
t
y
t
1
−
h
t
⇒
b
−
b
[
t
] = (
X
X
)
−
1
x
t
y
t
(1
−
h
t
)
−
(
X
X
)
−
1
x
t
x
t
b
+ (
X
X
)
−
1
x
t
h
t
y
t
1
−
h
t
Recognizing that
h
t
and
y
t
are scalars, and that
x
t
b
= ˆ
y
so that
y
t
−
x
t
b
=
e
t
, after cancellation we get
b
−
b
[
t
] = (
X
X
)
−
1
x
t
(
y
t
−
x
t
b
)1
−
h
t
=
c
t
e
t
1
−
h
t
2
Focusing only on the
k
th
coeﬃcient, we get the expression above
b
k
−
b
k
[
t
] =
c
kt
e
t
1
−
h
t
Some standardization is necessary to determine cutoﬀs:
DFBETA
k
=
b
k
−
b
k
[
t
]
s
[
t
]
Σ
c
2
kt
Cutoff
:
±
2
√
n
2. DFFITS: It can be shown that:ˆ
y
t
−
ˆ
y
t
[
t
] =
x
t
[
b
−
b
[
t
]] =
h
t
e
t
1
−
h
t
With standardization:
DFFIT
t
= ˆ
y
t
−
ˆ
y
t
[
t
]
s
[
t
]
√
h
t
Cutoff
:
±
2
√
K
√
n
This was the impact of deleting the
t
th
observation on the
t
th
predictedvalue. Can analogously consider ˆ
y
j
−
ˆ
y
j
[
t
]3. RSTUDENT:
RSTUDENT
=
e
t
s
[
t
]
√
1
−
h
t
Cutoff
:
±
24. COVRATIO:
COVRATIO
=

s
2
[
t
](
X
[
t
]
X
[
t
])
−
1

s
2
(
X
X
)
−
1

Cutoff
:
<
1
−
3
K n
→
”
bad
”;
>
1 + 3
K n
→
”
good
”
3 Multiplerow Diagnostics
If there is a cluster of more than one outlier, it is clear that singlerow diagnostics will not be able to identify inﬂuential observations because of themasking eﬀect, demonstrated in class.3
Multiplerow diagnostics can. Let
m
denote the subset of
m
deletedobservationsThe measures deﬁned above can be analogously determined:
DFBETA
=
b
k
−
b
k
[
m
]
Var
(
b
k
)
MDFIT
= (
b
−
b
[
m
])
(
X
[
m
]
X
[
m
])(
b
−
b
[
m
])
VARRATIO
=

s
2
(
X
[
m
]
X
[
m
])
−
1

s
2
(
X
X
)
−
1

This is, however, not practical, although there are packages that canconsider every permutation of 2, 3, 4,.... data points, and also methods tohelp identify
m
.
3.1 Partial Regression Plots
In a simple regression model (with one independent variable), inﬂuentialobservations–be they single or multiple–are easy to detect visually. But whatabout a multiple regression model? One easy and practical solution is tocollapse a multiple regression model to a series of singleregressions using theFWL Theorem.For example, say there are four explanatory variables:
y
=
β
1
+
X
2
β
2
+
...
+
X
4
β
4
+
ε
To know if there are observations inﬂuencing the estimated
b
2
.1. Regress
y
on
X
3
and
X
4
and obtain the residual ˆ
u
.2. Regress
X
2
on
X
3
and
X
4
and obtain the residual ˆ
w
.By the FWL Theorem, we know that the regression of ˆ
u
on ˆ
w
yields theOLS slope coeﬃcient for
X
2
. So, a plot of ˆ
u
on ˆ
w
enables us to collapsemultidimentional problem into a twodimensional one.Visual inspection along the lines presented earlier of such partial regression plots for each of the key parameters of interest can identify inﬂuentialobservations–singly or as a cluster.
4 What to do
The point is that an inﬂuential observation/set of observations is/are notnecessarily to be jettisoned. A cluster of inﬂuential observations could wellbe an indication of structural change, for example.4