SUPPORT / SAMPLES & SAS NOTES
 

Support

Usage Note 37108: Setting reference levels for CLASS predictor variables

DetailsAboutRate It

Many modeling procedures provide options in their CLASS statements (or in other statements) which allow you to specify reference levels for categorical predictor variables. See the first section below that shows how you can specify the reference level in a procedure offering the REF= option in its CLASS statement. Note that the REF= option for setting reference levels was added to the GLM, MIXED, GLIMMIX, and ORTHOREG beginning in SAS 9.3 TS1M2. Also in that release, the REF= option was made available for use with the GLM parameterization in procedures where it had only been available with other parameterizations. In releases prior to SAS 9.3 TS1M2, and in later releases of some procedures such as PROBIT, LIFEREG, and GAM, the REF= option in the CLASS statement is not available. These procedures always use the last level (after the levels are sorted) of a CLASS variable as the reference level. You can use either of the last two approaches below to make the last level your desired reference level.

Some procedures offer several ways to parameterize (code) the multiple design variables that the CLASS statement creates to represent a categorical predictor in the model. All parameterizations produce equivalent models but impose different interpretations on the model parameters. See the section "Parameterization of Model Effects" in the Shared Concepts and Topics chapter of the SAS/STAT Users GuideThis note lists the procedures offering multiple parameterizations and shows how a parameterization can be selected.

To set the reference level of a response variable that is categorical (such as in a logistic regression model), see this note.

Use a procedure offering the REF= option in the CLASS statement

Suppose Gender, with levels "M" and "F", is a predictor in your model and you want "F" to be the reference level. In a procedure such as GLIMMIXNote which provides the REF= option in the CLASS statement, you can explicitly set the reference level for this and any other CLASS predictor. In the CLASS statement below, the REF="F" option specifies that Gender="F" is to be the reference level. If you have additional variables in the CLASS statement, you can specify the REF= option in parentheses following each variable to set its reference level. For instance, suppose you have an additional numeric variable, Trt with values 0 and 1, for which you want Trt=0 to be the reference level. Note that quotes are used around REF= values whether the value is numeric or character, formatted or unformatted.

      proc glimmix data=Heights;
         class Gender(ref="F") Trt(ref="0");
         model Response(event="0") = Gender Height Trt / dist=binary link=probit solution ddfm=none;
         run;

If formats are used, specify the formatted value of the reference level in the REF= option. For example:

      proc format;
         value $genfmt
               'F' = 'Female'
               'M' = 'Male';
         run;
   
      proc glimmix data=Heights;
         format Gender $genfmt.;
         class Gender (ref="Female");
         model Response(event="0") = Gender Height / dist=binary link=probit solution ddfm=none;
         run;

If the error message Invalid reference value appears in the log, see this note for common causes. The most common cause is specifying the unformatted value when a format is associated with the variable.

Sort and specify the ORDER=DATA option

Consider a CLASS variable, X, with values 0 and 1. By default, these values are arranged in ascending alphanumeric order which results in 1 being the last level, and therefore the reference level. However, if the data are arranged so the value 1 appears before the value 0 as you read down the data set, and if you specify the ORDER=DATA option in the PROC statement, then the levels of X will stay in the order encountered in the data set. Then 0 is the last level found and it becomes the reference level. One way to get the values of X in this order is to sort your data set by X using the DESCENDING option.

For example, in the following data set, the Gender variable has levels F and M. Since F occurs before M in ascending alphanumeric sorting, M will be the reference level by default.

      data Heights; 
         input Response Gender$ Height @@; 
         datalines; 
      1 F 67   0 F 66   1 F 64   1 M 71   1 M 72   0 F 63 
      1 F 63   0 F 67   1 M 69   0 M 68   1 M 70   1 F 63 
      0 M 64   1 F 67   1 F 66   0 M 67   0 M 67   0 M 69 
      ;
      
      proc probit data=Heights; 
         class Gender; 
         model Response = Gender Height; 
         run;

The "Class Level Information" table shows that M is the last level of Gender.

Class Level Information
Name Levels Values
Gender 2 F M
Response 2 0 1

In the "Analysis of Maximum Likelihood Parameter Estimates" table, M is the reference level since it is the last level shown and has its parameter estimate and degrees of freedom set to zero.

Analysis of Maximum Likelihood Parameter Estimates
Parameter   DF Estimate Standard Error 95% Confidence Limits Chi-Square Pr > ChiSq
Intercept   1 20.1903 12.1830 -3.6879 44.0685 2.75 0.0975
Gender F 1 -1.6454 0.9390 -3.4859 0.1950 3.07 0.0797
Gender M 0 0.0000 . . . . .
Height   1 -0.2917 0.1768 -0.6383 0.0548 2.72 0.0990

However, if you sort the data by descending Gender, then M will precede F in the sorted data set (New). By specifying the ORDER=DATA option, this ordering is preserved and F becomes the reference level.

      proc sort data=Heights out=New;
         by Response descending Gender;
         run;
      proc probit data=New order=data; 
         class Gender; 
         model Response = Gender Height; 
         run;

Now, F is the last level in the "Class Level Information" table, and the "Analysis of Maximum Likelihood Parameter Estimates" table shows that F is the reference level.

Class Level Information
Name Levels Values
Gender 2 M F
Response 2 0 1
 
Analysis of Maximum Likelihood Parameter Estimates
Parameter   DF Estimate Standard Error 95% Confidence Limits Chi-Square Pr > ChiSq
Intercept   1 18.5448 11.4952 -3.9853 41.0750 2.60 0.1067
Gender M 1 1.6454 0.9390 -0.1950 3.4859 3.07 0.0797
Gender F 0 0.0000 . . . . .
Height   1 -0.2917 0.1768 -0.6383 0.0548 2.72 0.0990

Create a format and specify the ORDER=FORMATTED option

An alternative to reordering or sorting the data is to assign formatted values to the levels such that the last formatted value in ascending alphanumeric order is the desired reference level. Formatted values are used when you specify the ORDER=FORMATTED option in the PROC statement, though this is usually the default when a format exists for the variable.

In the following example, the Group variable indicates use of one of two types of pain reliever. It is desired to have Group=1 be the reference level. By default, Group=2 would be the reference level since it is the last sorted value.

      data Headache;
         input Minutes Group Censor @@;
         datalines;
      11  1  0   12  1  0   19  1  0   19  1  0
      19  1  0   19  1  0   21  1  0   20  1  0
      21  1  0   21  1  0   20  1  0   21  1  0
      20  1  0   21  1  0   25  1  0   27  1  0
      30  1  0   21  1  1   24  1  1   14  2  0
      16  2  0   16  2  0   21  2  0   21  2  0
      23  2  0   23  2  0   23  2  0   23  2  0
      25  2  1   23  2  0   24  2  0   24  2  0
      26  2  1   32  2  1   30  2  1   30  2  0
      32  2  1   20  2  1
      ;

By assigning the following formats to the levels, Group=1 has the last formatted value ('Old') after sorting, so it becomes the reference level when the ORDER=FORMATTED option is in effect.

      proc format;
         value grpfmt
               1 = 'Old'
               2 = 'Improved';
         run;
   
      proc lifereg data=Headache order=formatted;
         format Group grpfmt.;
         class Group;
         model Minutes*Censor(1)=Group;
         run;
Class Level Information
Name Levels Values
Group 2 Improved Old
 
Analysis of Maximum Likelihood Parameter Estimates
Parameter   DF Estimate Standard Error 95% Confidence Limits Chi-Square Pr > ChiSq
Intercept   1 3.1158 0.0520 3.0139 3.2178 3588.92 <.0001
Group Improved 1 0.1933 0.0786 0.0393 0.3473 6.05 0.0139
Group Old 0 0.0000 . . . . .
Scale   1 0.2122 0.0304 0.1603 0.2809    
Weibull Shape   1 4.7128 0.6742 3.5604 6.2381    

__________

Note: The REF= option for setting reference levels was added to the GLM, MIXED, GLIMMIX, and ORTHOREG beginning in SAS 9.3 TS1M2. Also in that release, the REF= option was made available for use with the GLM parameterization in procedures where it had only been available with other parameterizations.



Operating System and Release Information

Product FamilyProductSystemSAS Release
ReportedFixed*
SAS SystemSAS/STATz/OS
OpenVMS VAX
Microsoft® Windows® for 64-Bit Itanium-based Systems
Microsoft Windows Server 2003 Datacenter 64-bit Edition
Microsoft Windows Server 2003 Enterprise 64-bit Edition
Microsoft Windows XP 64-bit Edition
Microsoft® Windows® for x64
OS/2
Microsoft Windows 95/98
Microsoft Windows 2000 Advanced Server
Microsoft Windows 2000 Datacenter Server
Microsoft Windows 2000 Server
Microsoft Windows 2000 Professional
Microsoft Windows NT Workstation
Microsoft Windows Server 2003 Datacenter Edition
Microsoft Windows Server 2003 Enterprise Edition
Microsoft Windows Server 2003 Standard Edition
Microsoft Windows Server 2008
Microsoft Windows XP Professional
Windows Millennium Edition (Me)
Windows Vista
64-bit Enabled AIX
64-bit Enabled HP-UX
64-bit Enabled Solaris
ABI+ for Intel Architecture
AIX
HP-UX
HP-UX IPF
IRIX
Linux
Linux for x64
Linux on Itanium
OpenVMS Alpha
OpenVMS on HP Integrity
Solaris
Solaris for x64
Tru64 UNIX
* For software releases that are not yet generally available, the Fixed Release is the software release in which the problem is planned to be fixed.