The AUTOREG Procedure

Autoregressive Error Model

The regression model with autocorrelated disturbances is as follows:

\[  y_{t} = \mb{x} _{\mi{t} }’ {\beta }+{\nu }_{t}  \]
\[  {\nu }_{t}= {\epsilon }_{t} - {\varphi }_{1} {\nu }_{t-1}-{\ldots } - {\varphi }_{m} {\nu }_{t-m}  \]
\[  {\epsilon }_{t} ~  \mr{N}(0,{\sigma }^{2})  \]

In these equations, ${y_{t}}$ are the dependent values, ${\mb{x} _{t}}$ is a column vector of regressor variables, ${\beta }$ is a column vector of structural parameters, and ${{\epsilon }_{t}}$ is normally and independently distributed with a mean of 0 and a variance of ${\sigma }$$^{2}$. Note that in this parameterization, the signs of the autoregressive parameters are reversed from the parameterization documented in most of the literature.

PROC AUTOREG offers four estimation methods for the autoregressive error model. The default method, Yule-Walker (YW) estimation, is the fastest computationally. The Yule-Walker method used by PROC AUTOREG is described in Gallant and Goebel (1976). Harvey (1981) calls this method the two-step full transform method. The other methods are iterated YW, unconditional least squares (ULS), and maximum likelihood (ML). The ULS method is also referred to as nonlinear least squares (NLS) or exact least squares (ELS).

You can use all of the methods with data containing missing values, but you should use ML estimation if the missing values are plentiful. See the section Alternative Autocorrelation Correction Methods later in this chapter for further discussion of the advantages of different methods.

The Yule-Walker Method

Let $\bm {\varphi }$ represent the vector of autoregressive parameters,

\[  {\bm {\varphi }}=({\varphi }_{1},{\varphi }_{2}, {\ldots }, {\varphi }_{m})’  \]

and let the variance matrix of the error vector ${\bm {\nu }=( {\nu }_{1},{\ldots }, {\nu }_{N})’}$ be $\bSigma $,

\[  {E}( {\nu }{{\nu }’} ) = {{\bSigma }} = {\sigma }^{2}\mb{V}  \]

If the vector of autoregressive parameters $\bm {\varphi }$ is known, the matrix $\mb{V}$ can be computed from the autoregressive parameters. $\bSigma $ is then ${{\sigma }^{2}\mb{V} }$. Given $\bSigma $, the efficient estimates of regression parameters $\beta $ can be computed using generalized least squares (GLS). The GLS estimates then yield the unbiased estimate of the variance $\sigma ^2$,

The Yule-Walker method alternates estimation of $\beta $ using generalized least squares with estimation of $\bm {\varphi }$ using the Yule-Walker equations applied to the sample autocorrelation function. The YW method starts by forming the OLS estimate of $\beta $. Next, $\bm {\varphi }$ is estimated from the sample autocorrelation function of the OLS residuals by using the Yule-Walker equations. Then $\mb{V}$ is estimated from the estimate of $\bm {\varphi }$, and $\bSigma $ is estimated from $\mb{V}$ and the OLS estimate of $\sigma ^2$. The autocorrelation corrected estimates of the regression parameters $\beta $ are then computed by GLS, using the estimated $\bSigma $ matrix. These are the Yule-Walker estimates.

If the ITER option is specified, the Yule-Walker residuals are used to form a new sample autocorrelation function, the new autocorrelation function is used to form a new estimate of $\bm {\varphi }$ and $\mb{V}$, and the GLS estimates are recomputed using the new variance matrix. This alternation of estimates continues until either the maximum change in the $\widehat{\varphi }$ estimate between iterations is less than the value specified by the CONVERGE= option or the maximum number of allowed iterations is reached. This produces the iterated Yule-Walker estimates. Iteration of the estimates may not yield much improvement.

The Yule-Walker equations, solved to obtain $\widehat{\varphi }$ and a preliminary estimate of ${\sigma }^{2}$, are

\[  \mb{R}\hat{\varphi }=-\mb{r}  \]

Here ${\mb{r} =(r_{1}, {\ldots } , r_{m})’}$, where ${r_{i}}$ is the lag i sample autocorrelation. The matrix $\mb{R}$ is the Toeplitz matrix whose i,jth element is ${ r_{{|i-j|}}}$. If you specify a subset model, then only the rows and columns of $\mb{R}$ and $\mb{r}$ corresponding to the subset of lags specified are used.

If the BACKSTEP option is specified, for purposes of significance testing, the matrix $[\mb{R} ~  \mb{r} ]$ is treated as a sum-of-squares-and-crossproducts matrix arising from a simple regression with ${N-k}$ observations, where k is the number of estimated parameters.

The Unconditional Least Squares and Maximum Likelihood Methods

Define the transformed error, ${\mb{e} }$, as

\[  \mb{e} = \mb{L} ^{-1}\mb{n}  \]

where ${\mb{n} = \mb{y} - \mb{X} {\beta }}$.

The unconditional sum of squares for the model, $S$, is

\[  S = \mb{n} ’ \mb{V} ^{-1}\mb{n} = \mb{e} ’\mb{e}  \]

The ULS estimates are computed by minimizing ${S}$ with respect to the parameters ${\beta }$ and ${{\varphi }_{i}}$.

The full log likelihood function for the autoregressive error model is

\[  l = - \frac{N}{2}{\ln }(2 {\pi }) - \frac{N}{2}{\ln }({\sigma }^{2})- \frac{1}{2}{\ln }({|\mb{V} |})-\frac{S}{2{\sigma }^{2}}  \]

where ${{|\mb{V} |}}$ denotes determinant of ${\mb{V} }$. For the ML method, the likelihood function is maximized by minimizing an equivalent sum-of-squares function.

Maximizing l with respect to ${{\sigma }^{2}}$ (and concentrating ${{\sigma }^{2}}$ out of the likelihood) and dropping the constant term ${-\frac{N}{2}{{\ln }(2 {\pi })+1-{\ln }(N)}}$ produces the concentrated log likelihood function

\[  l_{c} = -\frac{N}{2}{\ln }(S{|\mb{V} |}^{1/N})  \]

Rewriting the variable term within the logarithm gives

\[  S_{ml} = {|\mb{L} |}^{1/N}\mb{e} ’\mb{e} {|\mb{L} |}^{1/N}  \]

PROC AUTOREG computes the ML estimates by minimizing the objective function ${S_{ml} ={|\mb{L} |}^{1/N}\mb{e} ’\mb{e} {|\mb{L} |}^{1/N}}$.

The maximum likelihood estimates may not exist for some data sets (Anderson and Mentz, 1980). This is the case for very regular data sets, such as an exact linear trend.

Computational Methods

Sample Autocorrelation Function

The sample autocorrelation function is computed from the structural residuals or noise ${\mb{n_{t}} =y_{t}-\mb{x} _{\mi{t} }’\mb{b} }$, where $\mb{b}$ is the current estimate of $\beta $. The sample autocorrelation function is the sum of all available lagged products of ${\mb{n} _{t}}$ of order j divided by ${{\ell }+j}$, where ${\ell }$ is the number of such products.

If there are no missing values, then ${{\ell }+j=N}$, the number of observations. In this case, the Toeplitz matrix of autocorrelations, $\mb{R}$, is at least positive semidefinite. If there are missing values, these autocorrelation estimates of r can yield an $\mb{R}$ matrix that is not positive semidefinite. If such estimates occur, a warning message is printed, and the estimates are tapered by exponentially declining weights until $\mb{R}$ is positive definite.

Data Transformation and the Kalman Filter

The calculation of $\mb{V}$ from $\bm {\varphi }$ for the general AR$(m)$ model is complicated, and the size of $\mb{V}$ depends on the number of observations. Instead of actually calculating $\mb{V}$ and performing GLS in the usual way, in practice a Kalman filter algorithm is used to transform the data and compute the GLS results through a recursive process.

In all of the estimation methods, the original data are transformed by the inverse of the Cholesky root of $\mb{V}$. Let $\mb{L}$ denote the Cholesky root of $\mb{V}$ — that is, ${\mb{V} =\mb{LL'} }$ with $\mb{L}$ lower triangular. For an AR$(m)$ model, $\mb{L} ^{-1}$ is a band diagonal matrix with m anomalous rows at the beginning and the autoregressive parameters along the remaining rows. Thus, if there are no missing values, after the first ${m} -1$ observations the data are transformed as

\[  z_{t}= x_{t}+ \hat{{\varphi }}_{1} x_{t-1} +{\ldots }+ \hat{{\varphi }}_{m} x_{t-m}  \]

The transformation is carried out using a Kalman filter, and the lower triangular matrix $\mb{L}$ is never directly computed. The Kalman filter algorithm, as it applies here, is described in Harvey and Phillips (1979) and Jones (1980). Although $\mb{L}$ is not computed explicitly, for ease of presentation the remaining discussion is in terms of $\mb{L}$. If there are missing values, then the submatrix of $\mb{L}$ consisting of the rows and columns with nonmissing values is used to generate the transformations.

Gauss-Newton Algorithms

The ULS and ML estimates employ a Gauss-Newton algorithm to minimize the sum of squares and maximize the log likelihood, respectively. The relevant optimization is performed simultaneously for both the regression and AR parameters. The OLS estimates of ${\beta }$ and the Yule-Walker estimates of $\bm {\varphi }$ are used as starting values for these methods.

The Gauss-Newton algorithm requires the derivatives of $\mb{e}$ or ${{|\mb{L} |}^{1/N}\mb{e} }$ with respect to the parameters. The derivatives with respect to the parameter vector $\beta $ are

\[  \frac{{\partial }\mb{e} }{{\partial }{\beta }'} = - \mb{L} ^{-1}\mb{X}  \]
\[  \frac{{\partial }{|\mb{L} |}^{1/N}\mb{e} }{ {\partial }{\bbeta }'} = - {|\mb{L} |}^{1/N} \mb{L} ^{-1}\mb{X}  \]

These derivatives are computed by the transformation described previously. The derivatives with respect to $\bm {\varphi }$ are computed by differentiating the Kalman filter recurrences and the equations for the initial conditions.

Variance Estimates and Standard Errors

For the Yule-Walker method, the estimate of the error variance, ${s^{2}}$, is the error sum of squares from the last application of GLS, divided by the error degrees of freedom (number of observations N minus the number of free parameters).

The variance-covariance matrix for the components of $\mb{b}$ is taken as ${s^{2}(\mb{X} ’ \mb{V} ^{-1} \mb{X} )^{-1}}$ for the Yule-Walker method. For the ULS and ML methods, the variance-covariance matrix of the parameter estimates is computed as ${s^{2} (\mb{J'J} )^{-1}}$. For the ULS method, $\mb{J}$ is the matrix of derivatives of $\mb{e}$ with respect to the parameters. For the ML method, $\mb{J}$ is the matrix of derivatives of ${{|\mb{L} |}^{1/N}\mb{e} }$ divided by ${{|\mb{L} |}^{1/N}}$. The estimate of the variance-covariance matrix of $\mb{b}$ assuming that $\bm {\varphi }$ is known is ${s^{2} (\mb{X} ’ \mb{V} ^{-1}\mb{X} )^{-1}}$. For OLS model, the estimate of the variance-covariance matrix is ${s^{2} (\mb{X} ’ \mb{X} )^{-1}}$.

Park and Mitchell (1980) investigated the small sample performance of the standard error estimates obtained from some of these methods. In particular, simulating an AR(1) model for the noise term, they found that the standard errors calculated using GLS with an estimated autoregressive parameter underestimated the true standard errors. These estimates of standard errors are the ones calculated by PROC AUTOREG with the Yule-Walker method.

The estimates of the standard errors calculated with the ULS or ML method take into account the joint estimation of the AR and the regression parameters and may give more accurate standard-error values than the YW method. At the same values of the autoregressive parameters, the ULS and ML standard errors will always be larger than those computed from Yule-Walker. However, simulations of the models used by Park and Mitchell (1980) suggest that the ULS and ML standard error estimates can also be underestimates. Caution is advised, especially when the estimated autocorrelation is high and the sample size is small.

High autocorrelation in the residuals is a symptom of lack of fit. An autoregressive error model should not be used as a nostrum for models that simply do not fit. It is often the case that time series variables tend to move as a random walk. This means that an AR(1) process with a parameter near one absorbs a great deal of the variation. See Example 8.3, which fits a linear trend to a sine wave.

For ULS or ML estimation, the joint variance-covariance matrix of all the regression and autoregression parameters is computed. For the Yule-Walker method, the variance-covariance matrix is computed only for the regression parameters.

Lagged Dependent Variables

The Yule-Walker estimation method is not directly appropriate for estimating models that include lagged dependent variables among the regressors. Therefore, the maximum likelihood method is the default when the LAGDEP or LAGDEP= option is specified in the MODEL statement. However, when lagged dependent variables are used, the maximum likelihood estimator is not exact maximum likelihood but is conditional on the first few values of the dependent variable.