The ARIMA Procedure

Detecting Outliers

Subsections:

Modeling in the Presence of Outliers

You can use the OUTLIER statement to detect changes in the level of the response series that are not accounted for by the estimated model. The types of changes considered are additive outliers (AO), level shifts (LS), and temporary changes (TC).

Let $\eta _ t$ be a regression variable that describes some type of change in the mean response. In time series literature $\eta _ t$ is called a shock signature. An additive outlier at some time point s corresponds to a shock signature $\eta _ t$ such that $\eta _ s = 1.0$ and $\eta _ t$ is 0.0 at all other points. Similarly a permanent level shift that originates at time s has a shock signature such that $\eta _ t$ is 0.0 for $t < s$ and 1.0 for $t \geq s$ . A temporary level shift of duration d that originates at time s has $\eta _ t$ equal to 1.0 between s and $s+d$ and 0.0 otherwise.

Suppose that you are estimating the ARIMA model

$D(B) Y_ t = \mu _ t + \frac{\theta (B)}{\phi (B)} a_ t$

where $Y_ t$ is the response series, $D(B)$ is the differencing polynomial in the backward shift operator B (possibly identity), $\mu _ t$ is the transfer function input, $\phi (B)$ and $\theta (B)$ are the AR and MA polynomials, respectively, and $a_ t$ is the Gaussian white noise series.

The problem of detection of level shifts in the OUTLIER statement is formulated as a problem of sequential selection of shock signatures that improve the model in the ESTIMATE statement. This is similar to the forward selection process in the stepwise regression procedure. The selection process starts with considering shock signatures of the type specified in the TYPE= option, originating at each nonmissing measurement. This involves testing $H_{0}\colon \beta = 0$ versus $H_{a}\colon \beta \neq 0$ in the model

$D(B) ( Y_ t - \beta \eta _ t ) = \mu _ t + \frac{\theta (B)}{\phi (B)} a_ t$

for each of these shock signatures. The most significant shock signature, if it also satisfies the significance criterion in ALPHA= option, is included in the model. If no significant shock signature is found, then the outlier detection process stops; otherwise this augmented model, which incorporates the selected shock signature in its transfer function input, becomes the null model for the subsequent selection process. This iterative process stops if at any stage no more significant shock signatures are found or if the number of iterations exceeds the maximum search number that results due to the MAXNUM= and MAXPCT= settings. In all these iterations, the parameters of the ARIMA model in the ESTIMATE statement are held fixed.

The precise details of the testing procedure for a given shock signature $\eta _ t$ are as follows:

The preceding testing problem is equivalent to testing $H_{0}\colon \beta = 0$ versus $H_{a}\colon \beta \neq 0$ in the following "regression with ARMA errors" model

$N_ t = \beta \zeta _ t + \frac{\theta (B)}{\phi (B)} a_ t$

where $N_ t = ( D(B) Y_ t - \mu _ t )$ is the "noise" process and $\zeta _ t = D(B)\eta _ t$ is the "effective" shock signature.

In this setting, under $H_0,$ $N = ( N_1, N_2, \ldots , N_ n )^ T$ is a mean zero Gaussian vector with variance covariance matrix $\sigma ^2 \bOmega$ . Here $\sigma ^2$ is the variance of the white noise process $a_ t$ and $\bOmega$ is the variance-covariance matrix associated with the ARMA model. Moreover, under $H_ a$ , N has $\beta \zeta$ as the mean vector where $\zeta = (\zeta _1, \zeta _2, \ldots , \zeta _ n )^ T$ . Additionally, the generalized least squares estimate of $\beta$ and its variance is given by

$\begin{eqnarray*} \hat{\beta } & = & \delta / \kappa \\ \textrm{Var} ( \hat{\beta } ) & = & \sigma ^2 /\kappa \end{eqnarray*}$

where $\delta = \zeta ^ T \bOmega ^{-1} N$ and $\kappa = \zeta ^ T \bOmega ^{-1} \zeta$ . The test statistic $\tau ^2 = \delta ^2/ (\sigma ^{2} \kappa )$ is used to test the significance of $\beta$ , which has an approximate chi-squared distribution with 1 degree of freedom under $H_0$ . The type of estimate of $\sigma ^2$ used in the calculation of $\tau ^2$ can be specified by the SIGMA= option. The default setting is SIGMA=ROBUST, which corresponds to a robust estimate suggested in an outlier detection procedure in X-12-ARIMA, the Census Bureau’s time series analysis program; see Findley et al. (1998) for additional information. The robust estimate of $\sigma ^2$ is computed by the formula

$\hat{\sigma }^2 = ( 1.49 \times \mr{Median} ( | \hat{a}_ t | ) )^2$

where $\hat{a}_ t$ are the standardized residuals of the null ARIMA model. The setting SIGMA=MSE corresponds to the usual mean squared error estimate (MSE) computed the same way as in the ESTIMATE statement with the NODF option.

The quantities $\delta$ and $\kappa$ are efficiently computed by a method described in De Jong and Penzer (1998); see also Kohn and Ansley (1985).

Modeling in the Presence of Outliers

In practice, modeling and forecasting time series data in the presence of outliers is a difficult problem for several reasons. The presence of outliers can adversely affect the model identification and estimation steps. Their presence close to the end of the observation period can have a serious impact on the forecasting performance of the model. In some cases, level shifts are associated with changes in the mechanism that drives the observation process, and separate models might be appropriate to different sections of the data. In view of all these difficulties, diagnostic tools such as outlier detection and residual analysis are essential in any modeling process.

The following modeling strategy, which incorporates level shift detection in the familiar Box-Jenkins modeling methodology, seems to work in many cases:

Proceed with model identification and estimation as usual. Suppose this results in a tentative ARIMA model, say M.
Check for additive and permanent level shifts unaccounted for by the model M by using the OUTLIER statement. In this step, unless there is evidence to justify it, the number of level shifts searched should be kept small.
Augment the original dataset with the regression variables that correspond to the detected outliers.
Include the first few of these regression variables in M, and call this model M1. Reestimate all the parameters of M1. It is important not to include too many of these outlier variables in the model in order to avoid the danger of over-fitting.
Check the adequacy of M1 by examining the parameter estimates, residual analysis, and outlier detection. Refine it more if necessary.