Previous Page | Next Page

The TRANSREG Procedure

Box-Cox Transformations

Box-Cox (1964) transformations are used to find potentially nonlinear transformations of a dependent variable. The Box-Cox transformation has the form

     

This family of transformations of the positive dependent variable is controlled by the parameter . Transformations linearly related to square root, inverse, quadratic, cubic, and so on are all special cases. The limit as approaches 0 is the log transformation. More generally, Box-Cox transformations of the following form can be fit:

     

By default, . The parameter can be used to rescale so that it is strictly positive. By default, . Alternatively, can be , where is the geometric mean of .

The BOXCOX transformation in PROC TRANSREG can be used to perform a Box-Cox transformation of the dependent variable. You can specify a list of power parameters by using the LAMBDA= t-option. By default, LAMBDA=–3 TO 3 BY 0.25. The procedure chooses the optimal power parameter by using a maximum likelihood criterion (Draper and Smith 1981, pp. 225–226). You can specify the PARAMETER= transformation option when you want to shift the values of , usually to avoid negatives. To divide by , specify the GEOMETRICMEAN t-option.

Here are three examples of using the LAMBDA= t-option:

   model BoxCox(y / lambda=0) = identity(x1-x5);
   model BoxCox(y / lambda=-2 to 2 by 0.1) = identity(x1-x5);
   model BoxCox(y) = identity(x1-x5);

Here is the first example:

   model BoxCox(y / lambda=0) = identity(x1-x5);

LAMBDA=0 specifies a Box-Cox transformation with a power parameter of 0. Since a single value of 0 was specified for LAMBDA=, there is no difference between the following models:

   model BoxCox(y / lambda=0) = identity(x1-x5);
   model log(y) = identity(x1-x5);

Here is the second example:

   model BoxCox(y / lambda=-2 to 2 by 0.1) = identity(x1-x5);

LAMBDA= specifies a list of power parameters. PROC TRANSREG tries each power parameter in the list and picks the best transformation. A maximum likelihood approach (Draper and Smith 1981, pp. 225–226) is used. With Box-Cox transformations, PROC TRANSREG finds the transformation before the usual iterations begin. Note that this is quite different from PROC TRANSREG’s usual approach of iteratively finding optimal transformations with ordinary and alternating least squares. It is analogous to SMOOTH and PBSPLINE, which also find transformations before the iterations begin based on a criterion other than least squares.

Here is the third example:

   model BoxCox(y) = identity(x1-x5);

The default LAMBDA= list of –3 TO 3 BY 0.25 is used.

The procedure prints the optimal power parameter, a confidence interval on the power parameter (based on the ALPHA= t-option), a "convenient" power parameter (selected from the CLL= t-option list), and the log likelihood for each power parameter tried (see Example 90.2).

To illustrate how Box-Cox transformations work, data were generated from the model

     

where . The transformed data can be fit with a linear model

     

The following statements produce Figure 90.14 through Figure 90.15:

   title 'Basic Box-Cox Example';
   
   data x;
      do x = 1 to 8 by 0.025;
         y = exp(x + normal(7));
         output;
      end;
   run;
   
   ods graphics on;
   
   title2 'Default Options';
   
   proc transreg data=x test;
      model BoxCox(y) = identity(x);
   run;

Figure 90.14 Basic Box-Cox Example, Default Output
Basic Box-Cox Example, Default Output

Figure 90.14 shows that PROC TRANSREG correctly selects the log transformation , with a narrow confidence interval. The plot shows that is at its largest in the vicinity of the optimal Box-Cox transformation.

The rest of the output, which contains the ANOVA results, is shown in Figure 90.15.

Figure 90.15 Basic Box-Cox Example, Default Output

Dependent Variable BoxCox(y)

Number of Observations Read 281
Number of Observations Used 281


The TRANSREG Procedure Hypothesis Tests for BoxCox(y)

Univariate ANOVA Table Based on the Usual Degrees of Freedom
Source DF Sum of Squares Mean Square F Value Liberal p
Model 1 1145.884 1145.884 1053.66 >= <.0001
Error 279 303.421 1.088    
Corrected Total 280 1449.305      
The above statistics are not adjusted for the fact that the dependent variable was transformed and so are generally liberal.

Root MSE 1.04285 R-Square 0.7906
Dependent Mean 4.49653 Adj R-Sq 0.7899
Coeff Var 23.19225 Lambda 0.0000

This next example uses several options. The LAMBDA= t-option specifies power parameters sparsely from –2 to –0.5 and 0.5 to 2 just to get the general shape of the log-likelihood function in that region. Between –0.5 and 0.5, more power parameters are tried. The CONVENIENT t-option is specified so that if a power parameter like or is found in the confidence interval, it is used instead of the optimal power parameter. PARAMETER=2 is specified to add 2 to each before performing the transformations. ALPHA=0.00001 specifies a wide confidence interval.

These next statements perform the Box-Cox analysis and produce Figure 90.16 and Figure 90.17:

   title2 'Several Options Demonstrated';
   
   proc transreg data=x ss2 details
                 plots=(transformation(dependent) scatter
                       observedbypredicted);
      model BoxCox(y / lambda=-2 -1 -0.5 to 0.5 by 0.05 1 2
                       convenient parameter=2 alpha=0.00001) =
            identity(x);
   run;

Figure 90.16 Basic Box-Cox Example, Several Options Demonstrated
Basic Box-Cox Example, Several Options Demonstrated

The results in Figure 90.16 and Figure 90.17 show that the optimal power parameter is –0.1, but 0 is in the confidence interval, and hence a log transformation is chosen. The actual Box-Cox transformation, the original scatter plot, and observed by predicted values plot are shown in Figure 90.17.

Figure 90.17 Basic Box-Cox Example, Several Options Demonstrated

Dependent Variable BoxCox(y)

Number of Observations Read 281
Number of Observations Used 281

Model Statement Specification Details
Type DF Variable Description Value
Dep 1 BoxCox(y) Lambda Used 0
      Lambda -0.1
      Log Likelihood -1280.1
      Conv. Lambda 0
      Conv. Lambda LL -1287.7
      CI Limit -1289.9
      Alpha 0.00001
      Parameter 2
      Options Convenient Lambda Used
Ind 1 Identity(x) DF 1


The TRANSREG Procedure Hypothesis Tests for BoxCox(y)

Univariate ANOVA Table Based on the Usual Degrees of Freedom
Source DF Sum of Squares Mean Square F Value Liberal p
Model 1 999.438 999.4381 1064.82 >= <.0001
Error 279 261.868 0.9386    
Corrected Total 280 1261.306      
The above statistics are not adjusted for the fact that the dependent variable was transformed and so are generally liberal.

Root MSE 0.96881 R-Square 0.7924
Dependent Mean 4.61429 Adj R-Sq 0.7916
Coeff Var 20.99591 Lambda 0.0000

Univariate Regression Table Based on the Usual Degrees of Freedom
Variable DF Coefficient Type II
Sum of
Squares
Mean Square F Value Liberal p
Intercept 1 0.42939328 8.746 8.746 9.32 >= 0.0025
Identity(x) 1 0.92997620 999.438 999.438 1064.82 >= <.0001

The above statistics are not adjusted for the fact that the dependent variable was transformed and so are generally liberal.


trgd9etrgd9e, continuedtrgd9e, continued

The next example shows how to find a Box-Cox transformation without an independent variable. This seeks to normalize the univariate histogram. This example generates 500 random observations from a lognormal distribution. In addition, a constant variable z is created that is all zero. This is because PROC TRANSREG requires some independent variable to be specified, even if it is constant. Two options are specified in the PROC TRANSREG statement. MAXITER=0 is specified because the Box-Cox transformation is performed before any iterations are begun. No iterations are needed since no other work is required. The NOZEROCONSTANT a-option (which can be abbreviated NOZ) is specified so that PROC TRANSREG does not print any warnings when it encounters the constant independent variable. The MODEL statement asks for a Box-Cox transformation of y and an IDENTITY transformation (which does nothing) of the constant variable z. Finally, PROC UNIVARIATE is run to show a histogram of the original variable y, and the Box-Cox transformation, Ty. The following statements fit the univariate Box-Cox model and produce Figure 90.18:


   title 'Univariate Box-Cox';
   
   data x;
      call streaminit(17);
      z = 0;
      do i = 1 to 500;
         y = rand('lognormal');
         output;
      end;
   run;
   
   proc transreg maxiter=0 nozeroconstant;
      model BoxCox(y) = identity(z);
      output;
   run;
   
   proc univariate noprint;
      histogram y ty;
   run;
   
   ods graphics off;

The PROC TRANSREG results in Figure 90.18 show that zero is chosen for lambda, so a log transformation is chosen. The first histogram shows that the original data are skewed, but a log transformation makes the data appear much more nearly normal.

Figure 90.18 Box-Cox with No Independent Variable
Box-Cox with No Independent Variable

trgd9gtrgd9g, continued

Previous Page | Next Page | Top of Page