24494 - Fitting transition models to discrete Markov chain data with or without predictor variables

Usage Note 24494: Fitting transition models to discrete Markov chain data with or without predictor variables

Agresti (2002) shows that the transition model for a first-order Markov chain can be fit as a loglinear model. Loglinear models can be fit in PROC GENMOD as a Poisson model using the cell counts of the table as the response and by specifying the DIST=POISSON option. For example, this data indicates the presence (1) or absence (2) of wheezing in children observed yearly at ages 9, 10, 11, and 12. The data is summarized and arranged with one observation per observed pattern of wheezing and a variable that contains the number of children exhibiting that pattern. There are 16 possible patterns and all were observed in some number of the children.

      data resp;
        do age9=1,2; do age10=1,2; do age11=1,2; do age12=1,2;
          input count @@; output;
        end; end; end; end;
        datalines;
      94 30 15 28 14  9 12  63
      19 15 10 44 17 42 35 572
      ;

These statements fit the first-order transition model and allow the response to depend on only the preceding response:

      proc genmod;
        model count = age9|age10 age10|age11 age11|age12 / dist=poisson;
        run;

These statements fit the second-order transition model and allow dependence on the last two responses:

      proc genmod;
        model count = age9|age10|age11 age10|age11|age12 / dist=poisson;
        run;

An alternative way to fit the first-order model begins by rearranging the data to have one observation per transition. The variable y_t is the response at each age (10, 11, or 12) and variable y_t_1 is the response at the preceding age (9, 10, or 11).

      data trans; 
        set resp;
        y_t=age10; y_t_1=age9; output;
        y_t=age11; y_t_1=age10; output;
        y_t=age12; y_t_1=age11; output;
        run;

The first-order model is then easily estimated in PROC LOGISTIC. Additionally, the four transition probabilities are directly provided by the PREDPROBS= option in the OUTPUT statement.

      proc logistic data=trans;
        freq count;
        model y_t(event="1")=y_t_1;
        output out=tprobs predprobs=individual;
        run;

Another example presented by Agresti (2013) examines raw data in the form of a series of 84 monthly changes (y_t) where 1 indicates change greater than average and 0 indicates change less than average. The LAG and LAG2 functions are used to define the change at the year prior (y_t_1) and two years prior (y_t_2).

      data evap;
        input y_t @@;
        y_t_1=lag(y_t);
        y_t_2=lag2(y_t);
        datalines;
      1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 0 0 0 1 0 0 0 0 0 1 1 1 1 1 1
      1 1 0 0 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0
      ;

The following statements fit the second-order model. The SCALE=NONE and AGGREGATE options allow computation of the deviance statistic for evaluating the model. The two years prior variable is found not to be significant (p=0.2266). The first-order model is fit by removing y_t_2. The WHERE statement is used so that the exact same set of observations is used in both models.

      proc logistic;
        model y_t(event="1") = y_t_1 y_t_2;
        run;

      proc logistic;
        where y_t_2 ne .;
        model y_t(event="1") = y_t_1;
        output out=tprobs predprobs=individual;
        run;

The y_t_1 parameter in the first-order model is significant (p< 0.0001) and is estimated to be 2.996. As noted by Agresti, this parameter means that the odds of a greater than average change (y_t=1) is exp(2.996) = 20 times larger if the change in the prior year was greater than average (y_t_1=1) as when the change in the prior year was less than average (y_t_1=0).

Again, the transition probability estimates are the four distinct probabilities provided by the PREDPROBS=INDIVIDUAL option in data set TPROBS.

Agresti (2002) further shows that a transition model that incorporates predictors can easily be fit as a logistic model using previous response values as additional predictors in the model. In Agresti's example, the presence (Y=1) or absence (Y=0) of illness is measured yearly in children from age 7 to 10. Whether the mother smokes (MSMOKE=1) or not (MSMOKE=0) is also recorded. Summarized data is again presented and rearranged so that there are three observations per response pattern—one each for ages 8, 9, and 10. A variable that contains the number of children that exhibit the response pattern (COUNT) and another that contains the prior year's response (YPREV) are added. Observations for year 7 are not created because no prior year's response can be provided and these observations would be ignored in the analysis if included with missing prior year values. If unsummarized data was presented, it would be arranged as one observation for each measurement on each child with the additional prior year's response.

      data resp2;
        array agevars {4} age7-age10;
        do age7=0,1; do age8=0,1; do age9=0,1;
        do msmoke=0,1; do age10=0,1;
          input count @@;
          do i=2 to 4;
            age=i+6; y=agevars{i}; yprev=agevars{i-1};
            output;
          end;
        end; end; end; end; end;
        datalines;
      237 10 118 6
       15  4   8 2
       16  2  11 1
        7  3   6 4
       24  3   7 3
        3  2   3 1
        6  2   4 2
        5 11   4 7
      ;

These statements fit the transition model, using the prior year's response as a predictor along with maternal smoking and age. Because the data is summarized, the FREQ statement is used.

      proc logistic data=resp2;
        model y(event="1") = msmoke age yprev / scale=none aggregate;
        freq count;
        run;

Transition models are also discussed and illustrated in Molenberghs and Verbeke (2005).

Operating System and Release Information

Product Family	Product	System	SAS Release
			Reported	Fixed*
SAS System	SAS/STAT	All	n/a

* For software releases that are not yet generally available, the Fixed Release is the software release in which the problem is planned to be fixed.

Type:	Usage Note
Priority:	low
Topic:	SAS Reference ==> Procedures ==> LOGISTIC SAS Reference ==> Procedures ==> GENMOD Analytics ==> Categorical Data Analysis Analytics ==> Longitudinal Analysis

Date Modified:	2006-04-14 08:38:10
Date Created:	2006-03-01 16:04:05

Support

Usage Note 24494: Fitting transition models to discrete Markov chain data with or without predictor variables

Operating System and Release Information