PROC PLM: By-Group Processing :: SAS/STAT(R) 9.3 User's Guide

Example 68.5 By-Group Processing

This example uses a data set on a study of the analgesic effects of treatments on elderly patients with neuralgia. The purpose of this example is to show how PROC PLM behaves under different situations when By-group processing is present. Two test treatments and a placebo are compared to test whether the patient reported pain or not. For each patient, the information of age, gender, and the duration of complaint before the treatment began were recorded. The following DATA step creates the data set named Neuralgia:

Data Neuralgia;
   input Treatment $ Sex $ Age Duration Pain $ @@;
   datalines;
P F 68  1 No  B M 74 16 No  P F 67 30 No
P M 66 26 Yes B F 67 28 No  B F 77 16 No
A F 71 12 No  B F 72 50 No  B F 76  9 Yes
A M 71 17 Yes A F 63 27 No  A F 69 18 Yes
B F 66 12 No  A M 62 42 No  P F 64  1 Yes
A F 64 17 No  P M 74  4 No  A F 72 25 No
P M 70  1 Yes B M 66 19 No  B M 59 29 No
A F 64 30 No  A M 70 28 No  A M 69  1 No
B F 78  1 No  P M 83  1 Yes B F 69 42 No
B M 75 30 Yes P M 77 29 Yes P F 79 20 Yes
A M 70 12 No  A F 69 12 No  B F 65 14 No
B M 70  1 No  B M 67 23 No  A M 76 25 Yes
P M 78 12 Yes B M 77  1 Yes B F 69 24 No
P M 66  4 Yes P F 65 29 No  P M 60 26 Yes
A M 78 15 Yes B M 75 21 Yes A F 67 11 No
P F 72 27 No  P F 70 13 Yes A M 75  6 Yes
B F 65  7 No  P F 68 27 Yes P M 68 11 Yes
P M 67 17 Yes B M 70 22 No  A M 65 15 No
P F 67  1 Yes A M 67 10 No  P F 72 11 Yes
A F 74  1 No  B M 80 21 Yes A F 69  3 No
;

The data set contains five variables. Treatment is a classification variable that has three levels: A and B represent the two test treatments, and P represents the placebo treatment. Sex is a classification variable that indicates each patient’s gender. Age is a continuous variable that indicates the age in years of each patient when a treatment began. Duration is a continuous variable that indicates the duration of complaint in months. The last variable Pain is the response variable with two levels: ‘Yes’ if pain was reported, ‘No’ if no pain was reported.

Suppose there is some preliminary belief that the dependency of pain on the explanatory variables is different for male and female patients, leading to separate models between genders. You believe there might be redundant information for predicting the probability of Pain. Thus, you want to perform model selection to eliminate unnecessary effects. You can use the following statements:

proc sort data=Neuralgia;
   by sex;
run;

proc logistic data=Neuralgia;
   class Treatment / param=glm;
   model pain = Treatment Age Duration / selection=backward;
   by sex;
   store painmodel;
   title 'Logistic Model on Neuralgia';
run;

PROC SORT is called to sort the data by variable Sex. The LOGISTIC procedure is then called to fit the probability of no pain. Three variables are specified for the full model: Treatment, Age, and Duration. The backward elimination is used as the model selection method. The BY statement specifies that separate models be fitted for male and female patients. Finally, the STORE statement specifies that the fitted results be saved to an item store named painmodel.

Output 68.5.1 lists parameter estimates from the two models after backward elimination is performed. From the model for female patients, Treatment is the only factor that affects the probability of no pain, and Treatment A and B have the same positive effect in predicting the probability of no pain. From the model for male patients, both Treatment and Age are included in the selected model. Treatment A and B have different positive effects, while Age has a negative effect in predicting the probability of no pain.

Output 68.5.1 Parameter Estimates for Male and Female Patients

Logistic Model on Neuralgia

The LOGISTIC Procedure

Sex=F

Analysis of Maximum Likelihood Estimates
Parameter		DF	Estimate	Standard Error	Wald Chi-Square	Pr > ChiSq
Intercept		1	-0.4055	0.6455	0.3946	0.5299
Treatment	A	1	2.6027	1.2360	4.4339	0.0352
Treatment	B	1	2.6027	1.2360	4.4339	0.0352
Treatment	P	0	0	.	.	.

Sex=M

Analysis of Maximum Likelihood Estimates
Parameter		DF	Estimate	Standard Error	Wald Chi-Square	Pr > ChiSq
Intercept		1	20.6178	9.1638	5.0621	0.0245
Treatment	A	1	3.9982	1.7333	5.3208	0.0211
Treatment	B	1	4.5556	1.9252	5.5993	0.0180
Treatment	P	0	0	.	.	.
Age		1	-0.3416	0.1408	5.8869	0.0153

Now the fitted models are saved to the item store painmodel. Suppose you want to use it to score several new observations. The following DATA steps create three data sets for scoring:

data score1;
   input Treatment $ Sex $ Age;
   datalines;
A F 20
B F 30
P F 40
A M 20
B M 30
P M 40
;

data score2; 
   set score1(drop=sex);
run;

data score3; 
   set score2(drop=Age);
run;

The first score data set score1 contains six observations and all the variables that are specified in the full model. The second score data set score2 is a duplicate of score1 except that Sex is dropped. The third score data set score3 is a duplicate of score2 except that Age is dropped. You can use the following statements to score the three data sets:

proc plm source=painmodel;
   score data=score1 out=score1out predicted;
   score data=score2 out=score2out predicted;
   score data=score3 out=score3out predicted;
run;

Output 68.5.2 lists the store information that PROC PLM reads from the item store painmodel. The “Model Effects” entry lists all three variables that are specified in the full model before the By-group processing.

Output 68.5.2 Item Store Information for painmodel

Logistic Model on Neuralgia

The PLM Procedure

Store Information
Item Store	WORK.PAINMODEL
Data Set Created From	WORK.NEURALGIA
Created By	PROC LOGISTIC
Date Created	18FEB11:10:49:28
By Variable	Sex
Response Variable	Pain
Link Function	Logit
Distribution	Binary
Class Variables	Treatment Pain
Model Effects	Intercept Treatment Age Duration

With the three SCORE statements, three data sets are thus produced: score1out, score2out, and score3out. They contain the linear predictors in addition to all original variables. The data set score1out contains the values shown in Output 68.5.3:

Output 68.5.3 Values of Data Set score1out

Logistic Model on Neuralgia

Obs	Treatment	Sex	Age	Predicted
1	A	F	20	2.1972
2	B	F	30	2.1972
3	P	F	40	-0.4055
4	A	M	20	17.7850
5	B	M	30	14.9269
6	P	M	40	6.9557

Linear predictors are computed for all six observations. Because the BY variable Sex is available in score1, PROC PLM uses separate models to score observations of male and female patients. So an observation with the same Treatment and Age has different linear predictors for different genders.

The data set score2out contains the values shown in Output 68.5.4:

Output 68.5.4 Values of Data Set score2out

Logistic Model on Neuralgia

Obs	Sex	Treatment	Age	Predicted
1	F	A	20	2.1972
2	F	B	30	2.1972
3	F	P	40	-0.4055
4	F	A	20	2.1972
5	F	B	30	2.1972
6	F	P	40	-0.4055
7	M	A	20	17.7850
8	M	B	30	14.9269
9	M	P	40	6.9557
10	M	A	20	17.7850
11	M	B	30	14.9269
12	M	P	40	6.9557

The second score data set score2 does not contain the BY variable Sex. PROC PLM continues to score the full data set two times. Each time the scoring is based on the fitted model for each corresponding By-group. In the output data set, Sex is added at the first column as the By-group indicator. The first six entries correspond to the model for female patients, and the next six entries correspond to the model for male patients. Age is not included in the first model, and Treatment A and B have the same parameter estimates, so observations 1, 2, 4, and 5 have the same linear predicted value.
The data set score3out contains the values shown in Output 68.5.5:

Output 68.5.5 Values of Data Set score3out

Logistic Model on Neuralgia

Obs	Sex	Treatment	Predicted
1	F	A	2.19722
2	F	B	2.19722
3	F	P	-0.40547
4	F	A	2.19722
5	F	B	2.19722
6	F	P	-0.40547
7	M	A	.
8	M	B	.
9	M	P	.
10	M	A	.
11	M	B	.
12	M	P	.

The third score data set score3 does not contain the BY variable Sex. PROC PLM scores the full data twice with separate models. Furthermore, it does not contain the variable Age, which is a selected variable for predicting the probability of no pain for male patients. Thus, PROC PLM computes linear predictor values for score3 by using the first model for female patients, and sets the linear predictor to missing when using the second model for male patients to score the data set.