The UCM Procedure

Example 34.8 ARIMA Modeling

This example shows how you can use the UCM procedure for ARIMA modeling. The parameter estimates and predictions for ARIMA models obtained by using PROC UCM will be close to those obtained by using PROC ARIMA (in the presence of the ML option in its ESTIMATE statement) if the model is stationary or if the model is nonstationary and there are no missing values in the data. See Chapter 7: The ARIMA Procedure, for additional details about the ARIMA procedure. However, if there are missing values in the data and the model is nonstationary, then the UCM and ARIMA procedures can produce significantly different parameter estimates and predictions. An article by Kohn and Ansley (1986) suggests a statistically sound method of estimation, prediction, and interpolation for nonstationary ARIMA models with missing data. This method is based on an algorithm that is equivalent to the Kalman filtering and smoothing algorithm used in the UCM procedure. The results of an illustrative example in their article are reproduced here using the UCM procedure. In this example an ARIMA(0,1,1)${\times }$(0,1,1)$_{12}$ model is applied to the logarithm of the air series in the sashelp.air data set. Four different missing value patterns are considered to highlight different aspects of the problem:

  • Data1. The full data set of 144 observations.

  • Data2. The set of 78 observations that omit January through November in each of the last 6 years.

  • Data3. The data set with the 5 observations July 1949, June, July, and August 1957, and July 1960 missing.

  • Data4. The data set with all July observations missing and June and August 1957 also missing.

The following DATA steps create these data sets:

data Data1;
   set sashelp.air;
   logair = log(air);
run;
   
data Data2;
   set data1;
   if year(date) >= 1955 and month(date) < 12 then logair = .;
run;

data Data3;
   set data1;
   if (year(date) = 1949 and month(date) = 7) then logair = .;
   if ( year(date) = 1957 and 
       (month(date) = 6 or month(date) = 7 or month(date) = 8))
       then logair = .;
   if (year(date) = 1960 and month(date) = 7) then logair = .;
run;
   
data Data4;
   set data1;
   if month(date) = 7 then logair = .;
   if year(date) = 1957 and (month(date) = 6 or month(date) = 8) 
      then logair = .;
run;

The following statements specify the ARIMA$(0,1,1)\times (0,1,1)_{12}$ model for the logair series in the first data set (Data1):

proc ucm data=Data1;
   id date interval=month;
   model logair;
   irregular q=1 sq=1 s=12;
   deplag lags=(1)(12) phi=1 1 noest;
   estimate outest=est1;
   forecast outfor=for1;
run;

Note that the moving average part of the model is specified by using the Q=, SQ=, and S= options in the IRREGULAR statement and the differencing operator, $(1-B)(1-B^{12})$, is specified by using the DEPLAG statement. The model does not contain an intercept term; therefore no LEVEL statement is needed. The parameter estimates are saved in a data set EST1 by using the OUTEST= option in the ESTIMATE statement and the forecasts and the component estimates are saved in a data set FOR1 by using the OUTFOR= option in the FORECAST statement. The same analysis is performed on the other three data sets, but is not shown here.

Output 34.8.1 resembles Table 1 in Kohn and Ansley (1986). This table is generated by merging the parameter estimates from the four analyses. Only the moving average parameter estimates and their standard errors are reported. The columns EST1 and STD1 correspond to the estimates for Data1. The parameter estimates and their standard errors for other three data sets are similarly named. Note that the parameter estimates closely match the parameter estimates in the article. However, their standard errors differ slightly. This difference could be the result of different ways of computing the Hessian at the optimum. The white noise error variance estimates are not reported here, but they agree quite closely with those in the article.

Output 34.8.1: Data Sets 1–4: Parameter Estimates and Standard Errors

PARAMETER est1 std1 est2 std2 est3 std3 est4 std4
MA_1 0.402 0.090 0.457 0.121 0.408 0.092 0.431 0.091
SMA_1 0.557 0.073 0.758 0.236 0.566 0.075 0.573 0.074


Output 34.8.2 resembles Table 2 in Kohn and Ansley (1986). It contains forecasts and their standard errors for the four data sets. The numbers are very close to those in the article.

Output 34.8.2: Data Sets 1–4: Forecasts and Standard Errors

DATE for1 std1 for2 std2 for3 std3 for4 std4
JAN61 6.110 0.037 6.084 0.052 6.110 0.037 6.111 0.037
FEB61 6.054 0.043 6.091 0.058 6.054 0.043 6.055 0.043
MAR61 6.172 0.048 6.247 0.063 6.173 0.048 6.174 0.048
APR61 6.199 0.053 6.205 0.068 6.199 0.053 6.200 0.052
MAY61 6.233 0.057 6.199 0.072 6.232 0.058 6.233 0.056
JUN61 6.369 0.061 6.308 0.076 6.367 0.062 6.368 0.060
JUL61 6.507 0.065 6.409 0.079 6.497 0.067 . .
AUG61 6.503 0.069 6.414 0.082 6.503 0.069 6.503 0.067
SEP61 6.325 0.072 6.299 0.085 6.325 0.072 6.326 0.071
OCT61 6.209 0.075 6.174 0.087 6.209 0.076 6.209 0.074
NOV61 6.063 0.079 6.043 0.089 6.064 0.079 6.064 0.077
DEC61 6.168 0.082 6.174 0.086 6.168 0.082 6.169 0.080


Output 34.8.3 is based on Data2. It resembles Table 3 in Kohn and Ansley (1986). The columns S_SERIES and VS_SERIES in the OUTFOR= data set contain the interpolated values of logair and their variances. The estimate column in Output 34.8.3 reports interpolated values (which are the same as S_SERIES), and the std column reports their standard errors (which are computed as square root of VS_SERIES) for January–November 1957. The actual logair values for these months, which are missing in Data2, are also provided for comparison. The numbers are very close to those in the article.

Output 34.8.3: Data Set 2: Interpolated Values and Standard Errors

DATE logair estimate std
JAN57 5.753 5.733 0.045
FEB57 5.707 5.738 0.049
MAR57 5.875 5.893 0.052
APR57 5.852 5.850 0.054
MAY57 5.872 5.843 0.055
JUN57 6.045 5.951 0.055
JUL57 6.142 6.051 0.055
AUG57 6.146 6.055 0.054
SEP57 6.001 5.938 0.052
OCT57 5.849 5.812 0.049
NOV57 5.720 5.680 0.045


Output 34.8.4 resembles Table 4 in Kohn and Ansley (1986). These numbers are based on Data3, and they also are very close to those in the article.

Output 34.8.4: Data Set 3: Interpolated Values and Standard Errors

DATE logair estimate std
JUL49 4.997 5.013 0.031
JUN57 6.045 6.024 0.030
JUL57 6.142 6.147 0.031
AUG57 6.146 6.148 0.030
JUL60 6.433 6.409 0.031


Output 34.8.5 resembles Table 5 in Kohn and Ansley (1986). As before, the numbers are very close to those in the article.

Output 34.8.5: Data Set 4: Interpolated Values and Standard Errors

DATE logair estimate std
JUN57 6.045 6.023 0.030
AUG57 6.146 6.147 0.030


The similarity between the outputs in this example and the results shown in Kohn and Ansley (1986) demonstrate that PROC UCM can be effectively used for nonstationary ARIMA models with missing data.