This example uses a data set on a study of the analgesic effects of treatments on elderly patients with neuralgia. The purpose of this example is to show how PROC PLM behaves under different situations when By-group processing is present. Two test treatments and a placebo are compared to test whether the patient reported pain or not. For each patient, the information of age, gender, and the duration of complaint before the treatment began were recorded. The following DATA step creates the data set named Neuralgia:
Data Neuralgia; input Treatment $ Sex $ Age Duration Pain $ @@; datalines; P F 68 1 No B M 74 16 No P F 67 30 No P M 66 26 Yes B F 67 28 No B F 77 16 No A F 71 12 No B F 72 50 No B F 76 9 Yes A M 71 17 Yes A F 63 27 No A F 69 18 Yes B F 66 12 No A M 62 42 No P F 64 1 Yes A F 64 17 No P M 74 4 No A F 72 25 No P M 70 1 Yes B M 66 19 No B M 59 29 No A F 64 30 No A M 70 28 No A M 69 1 No B F 78 1 No P M 83 1 Yes B F 69 42 No B M 75 30 Yes P M 77 29 Yes P F 79 20 Yes A M 70 12 No A F 69 12 No B F 65 14 No B M 70 1 No B M 67 23 No A M 76 25 Yes P M 78 12 Yes B M 77 1 Yes B F 69 24 No P M 66 4 Yes P F 65 29 No P M 60 26 Yes A M 78 15 Yes B M 75 21 Yes A F 67 11 No P F 72 27 No P F 70 13 Yes A M 75 6 Yes B F 65 7 No P F 68 27 Yes P M 68 11 Yes P M 67 17 Yes B M 70 22 No A M 65 15 No P F 67 1 Yes A M 67 10 No P F 72 11 Yes A F 74 1 No B M 80 21 Yes A F 69 3 No ;
The data set contains five variables. Treatment is a classification variable that has three levels: A and B represent the two test treatments, and P represents the placebo treatment. Sex is a classification variable that indicates each patient’s gender. Age is a continuous variable that indicates the age in years of each patient when a treatment began. Duration is a continuous variable that indicates the duration of complaint in months. The last variable Pain is the response variable with two levels: ‘Yes’ if pain was reported, ‘No’ if no pain was reported.
Suppose there is some preliminary belief that the dependency of pain on the explanatory variables is different for male and female patients, leading to separate models between genders. You believe there might be redundant information for predicting the probability of Pain. Thus, you want to perform model selection to eliminate unnecessary effects. You can use the following statements:
proc sort data=Neuralgia; by sex; run; proc logistic data=Neuralgia; class Treatment / param=glm; model pain = Treatment Age Duration / selection=backward; by sex; store painmodel; title 'Logistic Model on Neuralgia'; run;
PROC SORT is called to sort the data by variable Sex. The LOGISTIC procedure is then called to fit the probability of no pain. Three variables are specified for the full model: Treatment, Age, and Duration. The backward elimination is used as the model selection method. The BY statement specifies that separate models be fitted for male and female patients. Finally, the STORE statement specifies that the fitted results be saved to an item store named painmodel.
Output 68.5.1 lists parameter estimates from the two models after backward elimination is performed. From the model for female patients, Treatment is the only factor that affects the probability of no pain, and Treatment A and B have the same positive effect in predicting the probability of no pain. From the model for male patients, both Treatment and Age are included in the selected model. Treatment A and B have different positive effects, while Age has a negative effect in predicting the probability of no pain.
Logistic Model on Neuralgia |
Analysis of Maximum Likelihood Estimates | ||||||
---|---|---|---|---|---|---|
Parameter | DF | Estimate | Standard Error |
Wald Chi-Square |
Pr > ChiSq | |
Intercept | 1 | -0.4055 | 0.6455 | 0.3946 | 0.5299 | |
Treatment | A | 1 | 2.6027 | 1.2360 | 4.4339 | 0.0352 |
Treatment | B | 1 | 2.6027 | 1.2360 | 4.4339 | 0.0352 |
Treatment | P | 0 | 0 | . | . | . |
Analysis of Maximum Likelihood Estimates | ||||||
---|---|---|---|---|---|---|
Parameter | DF | Estimate | Standard Error |
Wald Chi-Square |
Pr > ChiSq | |
Intercept | 1 | 20.6178 | 9.1638 | 5.0621 | 0.0245 | |
Treatment | A | 1 | 3.9982 | 1.7333 | 5.3208 | 0.0211 |
Treatment | B | 1 | 4.5556 | 1.9252 | 5.5993 | 0.0180 |
Treatment | P | 0 | 0 | . | . | . |
Age | 1 | -0.3416 | 0.1408 | 5.8869 | 0.0153 |
data score1; input Treatment $ Sex $ Age; datalines; A F 20 B F 30 P F 40 A M 20 B M 30 P M 40 ; data score2; set score1(drop=sex); run; data score3; set score2(drop=Age); run;
The first score data set score1 contains six observations and all the variables that are specified in the full model. The second score data set score2 is a duplicate of score1 except that Sex is dropped. The third score data set score3 is a duplicate of score2 except that Age is dropped. You can use the following statements to score the three data sets:
proc plm source=painmodel; score data=score1 out=score1out predicted; score data=score2 out=score2out predicted; score data=score3 out=score3out predicted; run;
Output 68.5.2 lists the store information that PROC PLM reads from the item store painmodel. The “Model Effects” entry lists all three variables that are specified in the full model before the By-group processing.
Logistic Model on Neuralgia |
Store Information | |
---|---|
Item Store | WORK.PAINMODEL |
Data Set Created From | WORK.NEURALGIA |
Created By | PROC LOGISTIC |
Date Created | 18FEB11:10:49:28 |
By Variable | Sex |
Response Variable | Pain |
Link Function | Logit |
Distribution | Binary |
Class Variables | Treatment Pain |
Model Effects | Intercept Treatment Age Duration |
With the three SCORE statements, three data sets are thus produced: score1out, score2out, and score3out. They contain the linear predictors in addition to all original variables. The data set score1out contains the values shown in Output 68.5.3:
Logistic Model on Neuralgia |
Obs | Treatment | Sex | Age | Predicted |
---|---|---|---|---|
1 | A | F | 20 | 2.1972 |
2 | B | F | 30 | 2.1972 |
3 | P | F | 40 | -0.4055 |
4 | A | M | 20 | 17.7850 |
5 | B | M | 30 | 14.9269 |
6 | P | M | 40 | 6.9557 |
Linear predictors are computed for all six observations. Because the BY variable Sex is available in score1, PROC PLM uses separate models to score observations of male and female patients. So an observation with the same Treatment and Age has different linear predictors for different genders.
The data set score2out contains the values shown in Output 68.5.4:
Logistic Model on Neuralgia |
Obs | Sex | Treatment | Age | Predicted |
---|---|---|---|---|
1 | F | A | 20 | 2.1972 |
2 | F | B | 30 | 2.1972 |
3 | F | P | 40 | -0.4055 |
4 | F | A | 20 | 2.1972 |
5 | F | B | 30 | 2.1972 |
6 | F | P | 40 | -0.4055 |
7 | M | A | 20 | 17.7850 |
8 | M | B | 30 | 14.9269 |
9 | M | P | 40 | 6.9557 |
10 | M | A | 20 | 17.7850 |
11 | M | B | 30 | 14.9269 |
12 | M | P | 40 | 6.9557 |
The second score data set score2 does not contain the BY variable Sex. PROC PLM continues to score the full data set two times. Each time the scoring is based on the fitted model for each corresponding By-group. In the output data set, Sex is added at the first column as the By-group indicator. The first six entries correspond to the model for female patients, and the next six entries correspond to the model for male patients. Age is not included in the first model, and Treatment A and B have the same parameter estimates, so observations 1, 2, 4, and 5 have the same linear predicted value.
The data set score3out contains the values shown in Output 68.5.5:
Logistic Model on Neuralgia |
Obs | Sex | Treatment | Predicted |
---|---|---|---|
1 | F | A | 2.19722 |
2 | F | B | 2.19722 |
3 | F | P | -0.40547 |
4 | F | A | 2.19722 |
5 | F | B | 2.19722 |
6 | F | P | -0.40547 |
7 | M | A | . |
8 | M | B | . |
9 | M | P | . |
10 | M | A | . |
11 | M | B | . |
12 | M | P | . |
The third score data set score3 does not contain the BY variable Sex. PROC PLM scores the full data twice with separate models. Furthermore, it does not contain the variable Age, which is a selected variable for predicting the probability of no pain for male patients. Thus, PROC PLM computes linear predictor values for score3 by using the first model for female patients, and sets the linear predictor to missing when using the second model for male patients to score the data set.