The DATA step provides two functions, LAG and DIF, for accessing previous values of a variable or expression. These functions are useful for computing lags and differences of series.
For example, the following statements add the variables CPILAG and CPIDIF to the USCPI data set. The variable CPILAG contains lagged values of the CPI series. The variable CPIDIF contains the changes of the CPI series from the previous period; that is, CPIDIF is CPI minus CPILAG. The new data set is shown in part in FigureĀ 4.14.
data uscpi; set uscpi; cpilag = lag( cpi ); cpidif = dif( cpi ); run; proc print data=uscpi; run;
Figure 4.14: USCPI Data Set with Lagged and Differenced Series
Plot of USCPI Data |
Obs | date | cpi | cpilag | cpidif |
---|---|---|---|---|
1 | JUN1990 | 129.9 | . | . |
2 | JUL1990 | 130.4 | 129.9 | 0.5 |
3 | AUG1990 | 131.6 | 130.4 | 1.2 |
4 | SEP1990 | 132.7 | 131.6 | 1.1 |
5 | OCT1990 | 133.5 | 132.7 | 0.8 |
6 | NOV1990 | 133.8 | 133.5 | 0.3 |
7 | DEC1990 | 133.8 | 133.8 | 0.0 |
8 | JAN1991 | 134.6 | 133.8 | 0.8 |
9 | FEB1991 | 134.8 | 134.6 | 0.2 |
10 | MAR1991 | 135.0 | 134.8 | 0.2 |
11 | APR1991 | 135.2 | 135.0 | 0.2 |
12 | MAY1991 | 135.6 | 135.2 | 0.4 |
13 | JUN1991 | 136.0 | 135.6 | 0.4 |
14 | JUL1991 | 136.2 | 136.0 | 0.2 |
When used in this simple way, LAG and DIF act as lag and difference functions. However, it is important to keep in mind that, despite their names, the LAG and DIF functions available in the DATA step are not true lag and difference functions.
Rather, LAG and DIF are queuing functions that remember and return argument values from previous calls. The LAG function remembers the value you pass to it and returns as its result the value you passed to it on the previous call. The DIF function works the same way but returns the difference between the current argument and the remembered value. (LAG and DIF return a missing value the first time the function is called.)
A true lag function does not return the value of the argument for the "previous call," as do the DATA step LAG and DIF functions. Instead, a true lag function returns the value of its argument for the "previous observation," regardless of the sequence of previous calls to the function. Thus, for a true lag function to be possible, it must be clear what the "previous observation" is.
If the data are sorted chronologically, then LAG and DIF act as true lag and difference functions. If in doubt, use PROC SORT to sort your data before using the LAG and DIF functions. Beware of missing observations, which can cause LAG and DIF to return values that are not the actual lag and difference values.
The DATA step is a powerful tool that can read any number of observations from any number of input files or data sets, can create any number of output data sets, and can write any number of output observations to any of the output data sets, all in the same program. Thus, in general, it is not clear what "previous observation" means in a DATA step program. In a DATA step program, the "previous observation" exists only if you write the program in a simple way that makes this concept meaningful.
Since, in general, the previous observation is not clearly defined, it is not possible to make true lag or difference functions for the DATA step. Instead, the DATA step provides queuing functions that make it easy to compute lags and differences.
The LAG and DIF functions compute lags and differences provided that the sequence of calls to the function corresponds to the sequence of observations in the output data set. However, any complexity in the DATA step that breaks this correspondence causes the LAG and DIF functions to produce unexpected results.
For example, suppose you want to add the variable CPILAG to the USCPI data set, as in the previous example, and you also want to subset the series to 1991 and later years. You might use the following statements:
data subset; set uscpi; if date >= '1jan1991'd; cpilag = lag( cpi ); /* WRONG PLACEMENT! */ run;
If the subsetting IF statement comes before the LAG function call, the value of CPILAG will be missing for January 1991, even though a value for December 1990 is available in the USCPI data set. To avoid losing this value, you must rearrange the statements to ensure that the LAG function is actually executed for the December 1990 observation.
data subset; set uscpi; cpilag = lag( cpi ); if date >= '1jan1991'd; run;
In other cases, the subsetting statement should come before the LAG and DIF functions. For example, the following statements subset the FOREOUT data set shown in a previous example to select only _TYPE_=RESIDUAL observations and also to compute the variable LAGRESID:
data residual; set foreout; if _type_ = "RESIDUAL"; lagresid = lag( cpi ); run;
Another pitfall of LAG and DIF functions arises when they are used to process time series cross-sectional data sets. For example, suppose you want to add the variable CPILAG to the CPICITY data set shown in a previous example. You might use the following statements:
data cpicity; set cpicity; cpilag = lag( cpi ); run;
However, these statements do not yield the desired result. In the data set produced by these statements, the value of CPILAG for the first observation for the first city is missing (as it should be), but in the first observation for all later cities, CPILAG contains the last value for the previous city. To correct this, set the lagged variable to missing at the start of each cross section, as follows:
data cpicity; set cpicity; by city date; cpilag = lag( cpi ); if first.city then cpilag = .; run;
You can also use the EXPAND procedure to compute lags and differences. For example, the following statements compute lag and difference variables for CPI:
proc expand data=uscpi out=uscpi method=none; id date; convert cpi=cpilag / transform=( lag 1 ); convert cpi=cpidif / transform=( dif 1 ); run;
You can also calculate lags and differences in the DATA step without using LAG and DIF functions. For example, the following statements add the variables CPILAG and CPIDIF to the USCPI data set:
data uscpi; set uscpi; retain cpilag; cpidif = cpi - cpilag; output; cpilag = cpi; run;
The RETAIN statement prevents the DATA step from reinitializing CPILAG to a missing value at the start of each iteration and thus allows CPILAG to retain the value of CPI assigned to it in the last statement. The OUTPUT statement causes the output observation to contain values of the variables before CPILAG is reassigned the current value of CPI in the last statement. This is the approach that must be used if you want to build a variable that is a function of its previous lags.
The preceding discussion of LAG and DIF functions applies to LAG and DIF functions available in the DATA step. However, LAG and DIF functions are also used in the MODEL procedure.
The MODEL procedure LAG and DIF functions do not work like the DATA step LAG and DIF functions. The LAG and DIF functions supported by PROC MODEL are true lag and difference functions, not queuing functions.
Unlike the DATA step, the MODEL procedure processes observations from a single input data set, so the "previous observation" is always clearly defined in a PROC MODEL program. Therefore, PROC MODEL is able to define LAG and DIF as true lagging functions that operate on values from the previous observation. See ChapterĀ 26: The MODEL Procedure, for more information about LAG and DIF functions in the MODEL procedure.