37108 - Setting reference levels for CLASS predictor variables

SUPPORT / SAMPLES & SAS NOTES

Support

Usage Note 37108: Setting reference levels for CLASS predictor variables

Many modeling procedures provide options in their CLASS statements (or in other statements) which allow you to specify reference levels for categorical predictor variables. See the first section below that shows how you can specify the reference level in a procedure offering the REF= option in its CLASS statement. Note that the REF= option for setting reference levels was added to the GLM, MIXED, GLIMMIX, and ORTHOREG beginning in SAS 9.3 TS1M2. Also in that release, the REF= option was made available for use with the GLM parameterization in procedures where it had only been available with other parameterizations. In releases prior to SAS 9.3 TS1M2, and in later releases of some procedures such as PROBIT, LIFEREG, and GAM, the REF= option in the CLASS statement is not available. These procedures always use the last level (after the levels are sorted) of a CLASS variable as the reference level. You can use either of the last two approaches below to make the last level your desired reference level.

Some procedures offer several ways to parameterize (code) the multiple design variables that the CLASS statement creates to represent a categorical predictor in the model. All parameterizations produce equivalent models but impose different interpretations on the model parameters. See the section "Parameterization of Model Effects" in the Shared Concepts and Topics chapter of the SAS/STAT Users Guide. This note lists the procedures offering multiple parameterizations and shows how a parameterization can be selected.

To set the reference level of a response variable that is categorical (such as in a logistic regression model), see this note.

Use a procedure offering the REF= option in the CLASS statement

Suppose Gender, with levels "M" and "F", is a predictor in your model and you want "F" to be the reference level. In a procedure such as GLIMMIX^Note which provides the REF= option in the CLASS statement, you can explicitly set the reference level for this and any other CLASS predictor. In the CLASS statement below, the REF="F" option specifies that Gender="F" is to be the reference level. If you have additional variables in the CLASS statement, you can specify the REF= option in parentheses following each variable to set its reference level. For instance, suppose you have an additional numeric variable, Trt with values 0 and 1, for which you want Trt=0 to be the reference level. Note that quotes are used around REF= values whether the value is numeric or character, formatted or unformatted.

      proc glimmix data=Heights;
         class Gender(ref="F") Trt(ref="0");
         model Response(event="0") = Gender Height Trt / dist=binary link=probit solution ddfm=none;
         run;

If formats are used, specify the formatted value of the reference level in the REF= option. For example:

      proc format;
         value $genfmt
               'F' = 'Female'
               'M' = 'Male';
         run;
   
      proc glimmix data=Heights;
         format Gender $genfmt.;
         class Gender (ref="Female");
         model Response(event="0") = Gender Height / dist=binary link=probit solution ddfm=none;
         run;

If the error message Invalid reference value appears in the log, see this note for common causes. The most common cause is specifying the unformatted value when a format is associated with the variable.

Sort and specify the ORDER=DATA option

Consider a CLASS variable, X, with values 0 and 1. By default, these values are arranged in ascending alphanumeric order which results in 1 being the last level, and therefore the reference level. However, if the data are arranged so the value 1 appears before the value 0 as you read down the data set, and if you specify the ORDER=DATA option in the PROC statement, then the levels of X will stay in the order encountered in the data set. Then 0 is the last level found and it becomes the reference level. One way to get the values of X in this order is to sort your data set by X using the DESCENDING option.

For example, in the following data set, the Gender variable has levels F and M. Since F occurs before M in ascending alphanumeric sorting, M will be the reference level by default.

      data Heights; 
         input Response Gender$ Height @@; 
         datalines; 
      1 F 67   0 F 66   1 F 64   1 M 71   1 M 72   0 F 63 
      1 F 63   0 F 67   1 M 69   0 M 68   1 M 70   1 F 63 
      0 M 64   1 F 67   1 F 66   0 M 67   0 M 67   0 M 69 
      ;
      
      proc probit data=Heights; 
         class Gender; 
         model Response = Gender Height; 
         run;

The "Class Level Information" table shows that M is the last level of Gender.

Class Level Information
Name	Levels	Values
Gender	2	F M
Response	2	0 1

In the "Analysis of Maximum Likelihood Parameter Estimates" table, M is the reference level since it is the last level shown and has its parameter estimate and degrees of freedom set to zero.

Analysis of Maximum Likelihood Parameter Estimates
Parameter		DF	Estimate	Standard Error	95% Confidence Limits		Chi-Square	Pr > ChiSq
Intercept		1	20.1903	12.1830	-3.6879	44.0685	2.75	0.0975
Gender	F	1	-1.6454	0.9390	-3.4859	0.1950	3.07	0.0797
Gender	M	0	0.0000	.	.	.	.	.
Height		1	-0.2917	0.1768	-0.6383	0.0548	2.72	0.0990

However, if you sort the data by descending Gender, then M will precede F in the sorted data set (New). By specifying the ORDER=DATA option, this ordering is preserved and F becomes the reference level.

      proc sort data=Heights out=New;
         by Response descending Gender;
         run;
      proc probit data=New order=data; 
         class Gender; 
         model Response = Gender Height; 
         run;

Now, F is the last level in the "Class Level Information" table, and the "Analysis of Maximum Likelihood Parameter Estimates" table shows that F is the reference level.

Class Level Information
Name	Levels	Values
Gender	2	M F
Response	2	0 1

Analysis of Maximum Likelihood Parameter Estimates
Parameter		DF	Estimate	Standard Error	95% Confidence Limits		Chi-Square	Pr > ChiSq
Intercept		1	18.5448	11.4952	-3.9853	41.0750	2.60	0.1067
Gender	M	1	1.6454	0.9390	-0.1950	3.4859	3.07	0.0797
Gender	F	0	0.0000	.	.	.	.	.
Height		1	-0.2917	0.1768	-0.6383	0.0548	2.72	0.0990

Create a format and specify the ORDER=FORMATTED option

An alternative to reordering or sorting the data is to assign formatted values to the levels such that the last formatted value in ascending alphanumeric order is the desired reference level. Formatted values are used when you specify the ORDER=FORMATTED option in the PROC statement, though this is usually the default when a format exists for the variable.

In the following example, the Group variable indicates use of one of two types of pain reliever. It is desired to have Group=1 be the reference level. By default, Group=2 would be the reference level since it is the last sorted value.

      data Headache;
         input Minutes Group Censor @@;
         datalines;
      11  1  0   12  1  0   19  1  0   19  1  0
      19  1  0   19  1  0   21  1  0   20  1  0
      21  1  0   21  1  0   20  1  0   21  1  0
      20  1  0   21  1  0   25  1  0   27  1  0
      30  1  0   21  1  1   24  1  1   14  2  0
      16  2  0   16  2  0   21  2  0   21  2  0
      23  2  0   23  2  0   23  2  0   23  2  0
      25  2  1   23  2  0   24  2  0   24  2  0
      26  2  1   32  2  1   30  2  1   30  2  0
      32  2  1   20  2  1
      ;

By assigning the following formats to the levels, Group=1 has the last formatted value ('Old') after sorting, so it becomes the reference level when the ORDER=FORMATTED option is in effect.

      proc format;
         value grpfmt
               1 = 'Old'
               2 = 'Improved';
         run;
   
      proc lifereg data=Headache order=formatted;
         format Group grpfmt.;
         class Group;
         model Minutes*Censor(1)=Group;
         run;

Class Level Information
Name	Levels	Values
Group	2	Improved Old

Analysis of Maximum Likelihood Parameter Estimates
Parameter		DF	Estimate	Standard Error	95% Confidence Limits		Chi-Square	Pr > ChiSq
Intercept		1	3.1158	0.0520	3.0139	3.2178	3588.92	<.0001
Group	Improved	1	0.1933	0.0786	0.0393	0.3473	6.05	0.0139
Group	Old	0	0.0000	.	.	.	.	.
Scale		1	0.2122	0.0304	0.1603	0.2809
Weibull Shape		1	4.7128	0.6742	3.5604	6.2381

__________

Note: The REF= option for setting reference levels was added to the GLM, MIXED, GLIMMIX, and ORTHOREG beginning in SAS 9.3 TS1M2. Also in that release, the REF= option was made available for use with the GLM parameterization in procedures where it had only been available with other parameterizations.

Operating System and Release Information

Product Family	Product	System	SAS Release
Product Family	Product	System	Reported	Fixed*
SAS System	SAS/STAT	z/OS
		OpenVMS VAX
		Microsoft® Windows® for 64-Bit Itanium-based Systems
		Microsoft Windows Server 2003 Datacenter 64-bit Edition
		Microsoft Windows Server 2003 Enterprise 64-bit Edition
		Microsoft Windows XP 64-bit Edition
		Microsoft® Windows® for x64
		OS/2
		Microsoft Windows 95/98
		Microsoft Windows 2000 Advanced Server
		Microsoft Windows 2000 Datacenter Server
		Microsoft Windows 2000 Server
		Microsoft Windows 2000 Professional
		Microsoft Windows NT Workstation
		Microsoft Windows Server 2003 Datacenter Edition
		Microsoft Windows Server 2003 Enterprise Edition
		Microsoft Windows Server 2003 Standard Edition
		Microsoft Windows Server 2008
		Microsoft Windows XP Professional
		Windows Millennium Edition (Me)
		Windows Vista
		64-bit Enabled AIX
		64-bit Enabled HP-UX
		64-bit Enabled Solaris
		ABI+ for Intel Architecture
		AIX
		HP-UX
		HP-UX IPF
		IRIX
		Linux
		Linux for x64
		Linux on Itanium
		OpenVMS Alpha
		OpenVMS on HP Integrity
		Solaris
		Solaris for x64
		Tru64 UNIX

* For software releases that are not yet generally available, the Fixed Release is the software release in which the problem is planned to be fixed.

Type:	Usage Note
Priority:
Topic:	Analytics ==> Analysis of Variance Analytics ==> Categorical Data Analysis Analytics ==> Mixed Models Analytics SAS Reference ==> Procedures ==> ANOVA SAS Reference ==> Procedures ==> CATMOD SAS Reference ==> Procedures ==> GAM SAS Reference ==> Procedures ==> GENMOD SAS Reference ==> Procedures ==> GLIMMIX SAS Reference ==> Procedures ==> GLM SAS Reference ==> Procedures ==> GLMMOD SAS Reference ==> Procedures ==> GLMPOWER SAS Reference ==> Procedures ==> GLMSELECT SAS Reference ==> Procedures ==> HPMIXED SAS Reference ==> Procedures ==> LIFEREG SAS Reference ==> Procedures ==> LOGISTIC SAS Reference ==> Procedures ==> MIXED SAS Reference ==> Procedures ==> PHREG SAS Reference ==> Procedures ==> PLS SAS Reference ==> Procedures ==> PROBIT SAS Reference ==> Procedures ==> QUANTREG SAS Reference ==> Procedures ==> ROBUSTREG SAS Reference ==> Procedures ==> NESTED SAS Reference ==> Procedures ==> SURVEYLOGISTIC SAS Reference ==> Procedures ==> SURVEYREG SAS Reference ==> Procedures ==> TRANSREG SAS Reference ==> Procedures ==> FMM SAS Reference ==> Procedures ==> ORTHOREG SAS Reference ==> Procedures ==> QUANTLIFE SAS Reference ==> Procedures ==> QUANTSELECT SAS Reference ==> Procedures ==> GAMPL SAS Reference ==> Procedures ==> HPFMM SAS Reference ==> Procedures ==> HPGENSELECT SAS Reference ==> Procedures ==> HPLOGISTIC SAS Reference ==> Procedures ==> HPPLS SAS Reference ==> Procedures ==> HPQUANTSELECT SAS Reference ==> Procedures ==> HPREG SAS Reference ==> Procedures ==> ICPHREG SAS Reference ==> Procedures ==> SURVEYPHREG

Date Modified:	2019-07-12 08:46:28
Date Created:	2009-09-07 11:11:11