# The TRANSREG Procedure

### Output Data Set

Subsections:

The OUT= output data set can contain a great deal of information; however, in most cases, the output data set contains a small portion of the entire range of available information.

#### Output Data Set Examples

This section provides three brief examples, illustrating some typical OUT= output data sets. See the section Output Data Set Contents for a complete list of the contents of the OUT= data set.

The first example shows the output data set from a two-way ANOVA model. The following statements produce Figure 104.65:

```title 'ANOVA Output Data Set Example';

data ReferenceCell;
input y x1 \$ x2 \$;
datalines;
11  a  a
12  a  a
10  a  a
4  a  b
5  a  b
3  a  b
5  b  a
6  b  a
4  b  a
2  b  b
3  b  b
1  b  b
;
```
```* Fit Reference Cell Two-Way ANOVA Model;
proc transreg data=ReferenceCell;
model identity(y) = class(x1 | x2);
output coefficients replace predicted residuals;
run;

* Print the Results;
proc print;
run;

proc contents position;
ods select position;
run;
```

Figure 104.65: ANOVA Example Output Data Set Contents

 ANOVA Output Data Set Example

Obs _TYPE_ _NAME_ y Py Ry Intercept x1a x2a x1ax2a x1 x2
1 SCORE ROW1 11 11 0 1 1.0 1 1 a a
2 SCORE ROW2 12 11 1 1 1.0 1 1 a a
3 SCORE ROW3 10 11 -1 1 1.0 1 1 a a
4 SCORE ROW4 4 4 0 1 1.0 0 0 a b
5 SCORE ROW5 5 4 1 1 1.0 0 0 a b
6 SCORE ROW6 3 4 -1 1 1.0 0 0 a b
7 SCORE ROW7 5 5 0 1 0.0 1 0 b a
8 SCORE ROW8 6 5 1 1 0.0 1 0 b a
9 SCORE ROW9 4 5 -1 1 0.0 1 0 b a
10 SCORE ROW10 2 2 0 1 0.0 0 0 b b
11 SCORE ROW11 3 2 1 1 0.0 0 0 b b
12 SCORE ROW12 1 2 -1 1 0.0 0 0 b b
13 M COEFFI y . . . 2 2.0 3 4
14 MEAN y . . . . 7.5 8 11

 ANOVA Output Data Set Example

The CONTENTS Procedure

Variables in Creation Order
# Variable Type Len Label
1 _TYPE_ Char 8
2 _NAME_ Char 32
3 y Num 8
4 Py Num 8 y Predicted Values
5 Ry Num 8 y Residuals
6 Intercept Num 8 Intercept
7 x1a Num 8 x1 a
8 x2a Num 8 x2 a
9 x1ax2a Num 8 x1 a * x2 a
10 x1 Char 32
11 x2 Char 32

The `_TYPE_` variable indicates observation type: score, multiple regression coefficient (parameter estimates), and marginal means. The `_NAME_` variable contains the default observation labels, "ROW1", "ROW2", and so on, and contains the dependent variable name (`y`) for the remaining observations. If you specify an ID statement, `_NAME_` contains the values of the first ID variable for score observations. The `y` variable is the dependent variable, `Py` contains the predicted values, `Ry` contains the residuals, and the variables `Intercept` through `x1ax2a` contain the design matrix. The `x1` and `x2` variables are the original CLASS variables.

The next example shows the contents of the output data set from fitting a curve through a scatter plot. The following statements produce Figure 104.66:

```title 'Output Data Set for Curve Fitting Example';

data a;
do x = 1 to 100;
y = log(x) + sin(x / 10) + normal(7);
output;
end;
run;

proc transreg;
model identity(y) = spline(x / nknots=9);
output predicted out=b;
run;

proc contents position;
ods select position;
run;
```

Figure 104.66: Predicted Values Example Output Data Set Contents

 Output Data Set for Curve Fitting Example

The CONTENTS Procedure

Variables in Creation Order
# Variable Type Len Label
1 _TYPE_ Char 8
2 _NAME_ Char 32
3 y Num 8
4 Ty Num 8 y Transformation
5 Py Num 8 y Predicted Values
6 Intercept Num 8 Intercept
7 x Num 8
8 TIntercept Num 8 Intercept Transformation
9 Tx Num 8 x Transformation

The OUT= data set contains `_TYPE_` and `_NAME_` variables. Since no coefficients or coordinates are requested, all observations are `_TYPE_`=’SCORE’. The `y` variable is the original dependent variable, `Ty` is the transformed dependent variable, `Py` contains the predicted values, `x` is the original independent variable, and `Tx` is the transformed independent variable. The data set also contains an `Intercept` and transformed intercept `TIntercept` variable. (In this case, the transformed intercept is the same as the intercept. However, if you specify the TSTANDARD= and ADDITIVE options, these are not always the same.)

The following example shows the results from specifying METHOD= MORALS when there is more than one dependent variable:

```title 'METHOD=MORALS Output Data Set Example';

data x;
input y1 y2 x1 \$ x2 \$;
datalines;
11 1 a a
10 4 b a
5 2 a b
5 9 b b
4 3 c c
3 6 b a
1 8 a b
;
```
```* Fit Reference Cell Two-Way ANOVA Model;
proc transreg data=x noprint solve;
model spline(y1 y2) = opscore(x1 x2 / name=(n1 n2));
output coefficients predicted residuals;
id x1 x2;
run;

* Print the Results;
proc print;
run;

proc contents position;
ods select position;
run;
```

These statements produce Figure 104.67.

Figure 104.67: METHOD=MORALS Rolled Output Data Set

 METHOD=MORALS Output Data Set Example

Obs _DEPVAR_ _TYPE_ _NAME_ _DEPEND_ T_DEPEND_ P_DEPEND_ R_DEPEND_ Intercept n1 n2 TIntercept Tn1 Tn2 x1 x2
1 Spline(y1) SCORE a 11 13.1600 11.1554 2.00464 1 0 0 1.0000 0.06711 -0.09384 a a
2 Spline(y1) SCORE b 10 6.1931 6.8835 -0.69041 1 1 0 1.0000 1.51978 -0.09384 b a
3 Spline(y1) SCORE a 5 2.4467 4.7140 -2.26724 1 0 1 1.0000 0.06711 1.32038 a b
4 Spline(y1) SCORE b 5 2.4467 0.4421 2.00464 1 1 1 1.0000 1.51978 1.32038 b b
5 Spline(y1) SCORE c 4 4.2076 4.2076 0.00000 1 2 2 1.0000 0.23932 1.32038 c c
6 Spline(y1) SCORE b 3 5.5693 6.8835 -1.31422 1 1 0 1.0000 1.51978 -0.09384 b a
7 Spline(y1) SCORE a 1 4.9766 4.7140 0.26261 1 0 1 1.0000 0.06711 1.32038 a b
8 Spline(y1) M COEFFI y1 . . . . . . . 10.9253 -2.94071 -4.55475 y1 y1
9 Spline(y2) SCORE a 1 -0.5303 -0.5199 -0.01043 1 0 0 1.0000 0.03739 -0.09384 a a
10 Spline(y2) SCORE b 4 5.5487 4.5689 0.97988 1 1 0 1.0000 1.51395 -0.09384 b a
11 Spline(y2) SCORE a 2 3.8940 4.5575 -0.66347 1 0 1 1.0000 0.03739 1.32038 a b
12 Spline(y2) SCORE b 9 9.6358 9.6462 -0.01043 1 1 1 1.0000 1.51395 1.32038 b b
13 Spline(y2) SCORE c 3 5.6210 5.6210 0.00000 1 2 2 1.0000 0.34598 1.32038 c c
14 Spline(y2) SCORE b 6 3.5994 4.5689 -0.96945 1 1 0 1.0000 1.51395 -0.09384 b a
15 Spline(y2) SCORE a 8 5.2314 4.5575 0.67390 1 0 1 1.0000 0.03739 1.32038 a b
16 Spline(y2) M COEFFI y2 . . . . . . . -0.3119 3.44636 3.59024 y2 y2

 METHOD=MORALS Output Data Set Example

The CONTENTS Procedure

Variables in Creation Order
# Variable Type Len Label
1 _DEPVAR_ Char 42 Dependent Variable Transformation(Name)
2 _TYPE_ Char 8
3 _NAME_ Char 32
4 _DEPEND_ Num 8 Dependent Variable
5 T_DEPEND_ Num 8 Dependent Variable Transformation
6 P_DEPEND_ Num 8 Dependent Variable Predicted Values
7 R_DEPEND_ Num 8 Dependent Variable Residuals
8 Intercept Num 8 Intercept
9 n1 Num 8
10 n2 Num 8
11 TIntercept Num 8 Intercept Transformation
12 Tn1 Num 8 n1 Transformation
13 Tn2 Num 8 n2 Transformation
14 x1 Char 32
15 x2 Char 32

If you specify METHOD= MORALS with multiple dependent variables, PROC TRANSREG performs separate univariate analyses and stacks the results in the OUT= data set. For this example, the results of the first analysis are in the partition designated by `_DEPVAR_`=’Spline(`y1`)’ and the results of the second analysis are in the partition designated by `_DEPVAR_`=’Spline(`y2`)’, which are the transformation and dependent variable names. Each partition has `_TYPE_`=’SCORE’ observations for the variables and a `_TYPE_`=’M COEFFI’ observation for the coefficients. In this example, an ID variable is specified, so the `_NAME_` variable contains the formatted values of the first ID variable. Since both dependent variables have to go into the same column, the dependent variable is given a new name, `_DEPEND_`. The dependent variable transformation is named `T_DEPEND_`, the predicted values variable is named `P_DEPEND_`, and the residuals variable is named `R_DEPEND_`.

The independent variables are character OPSCORE variables. By default, PROC TRANSREG replaces character OPSCORE variables with category numbers and discards the original character variables. To avoid this, the input variables are renamed from `x1` and `x2` to `n1` and `n2` and the original `x1` and `x2` are added to the data set as ID variables. The `n1` and `n2` variables contain the initial values for the OPSCORE transformations, and the `Tn1` and `Tn2` variables contain optimal scores. The data set also contains an `Intercept` and transformed intercept `TIntercept` variable. The regression coefficients are in the transformation columns, which also contain the variables to which they apply.

#### Output Data Set Contents

Table 104.6 summarizes the various matrices that can result from PROC TRANSREG processing and that appear in the OUT= data set. The exact contents of an OUT= data set depends on many options.

Table 104.6: PROC TRANSREG OUT= Data Set Contents

_TYPE_

Contents

Options, Default Prefix

SCORE

dependent variables

DREPLACE not specified

SCORE

independent variables

IREPLACE not specified

SCORE

transformed dependent variables

default, TDPREFIX=T

SCORE

transformed independent variables

default, TIPREFIX=T

SCORE

predicted values

PREDICTED , PPREFIX=P

SCORE

residuals

RESIDUALS, RDPREFIX=R

SCORE

leverage

LEVERAGE , LEVERAGE=Leverage

SCORE

lower individual confidence limits

SCORE

upper individual confidence limits

SCORE

lower mean confidence limits

SCORE

upper mean confidence limits

SCORE

dependent canonical variables

SCORE

independent canonical variables

SCORE

redundancy variables

SCORE

ID, CLASS, BSPLINE variables

SCORE

independent variables approximations

M COEFFI

multiple regression coefficients

C COEFFI

canonical coefficients

MEAN

marginal means

M REDUND

multiple redundancy coefficients

R REDUND

multiple redundancy coefficients

M POINT

point coordinates

M EPOINT

elliptical point coordinates

M QPOINT

C POINT

canonical point coordinates

C EPOINT

canonical elliptical point coordinates

C QPOINT

canonical quadratic point coordinates

The independent and dependent variables are created from the original input data. Several potential differences exist between these variables and the actual input data. An intercept variable can be added, new variables can be added for POINT , EPOINT , QPOINT , CLASS , IDENTITY , PSPLINE , and BSPLINE variables, and category numbers are substituted for character OPSCORE variables. These matrices are not always what is input to the first iteration. After the expanded data set is stored for inclusion in the output data set, several things happen to the data before they are input to the first iteration: column means are substituted for missing values; zero-degree SPLINE and MSPLINE variables are transformed so that the iterative algorithms get step-function data as input, which conform to the zero-degree transformation family restrictions; and the nonoptimal transformations are performed.

##### Details for the UNIVARIATE Method

When you specify METHOD= UNIVARIATE (in the MODEL or PROC TRANSREG statement), PROC TRANSREG can perform several analyses, one for each dependent variable. While each dependent variable can be transformed, their independent variables are not transformed. The OUT= data set optionally contains all of the `_TYPE_`=’SCORE’ observations, optionally followed by coefficients or coordinates.

##### Details for the MORALS Method

When you specify METHOD= MORALS (in the MODEL or PROC TRANSREG statement), successive analyses are performed, one for each dependent variable. Each analysis transforms one dependent variable and the entire set of the independent variables. All information for the first dependent variable (scores then, optionally, coefficients) appears first. Then all information for the second dependent variable (scores then, optionally, coefficients) appears next. This arrangement is repeated for all dependent variables.

##### Details for the CANALS and REDUNDANCY Methods

For METHOD= CANALS and METHOD=REDUNDANCY (specified in either the MODEL or PROC TRANSREG statement), one analysis is performed that simultaneously transforms all dependent and independent variables. The OUT= data set optionally contains all of the `_TYPE_`=’SCORE’ observations, optionally followed by coefficients or coordinates.

#### Variable Names

As shown in the preceding examples, some variables in the output data set directly correspond to input variables, and some are created. All original optimal and nonoptimal transformation variable names are unchanged.

The names of the POINT , QPOINT , and EPOINT expansion variables are also left unchanged, but new variables are created. When independent POINT variables are present, the sum-of-squares variable _ISSQ_ is added to the output data set. For each EPOINT and QPOINT variable, a new squared variable is created by appending "_2". For example, `Dim1` and `Dim2` are expanded into `Dim1`, `Dim2`, `Dim1_2`, and `Dim2_2`. In addition, for each pair of QPOINT variables, a new crossproduct variable is created by combining the two names—for example, `Dim1Dim2`.

The names of the CLASS variables are constructed from original variable names and levels. Lengths are controlled by the CPREFIX= a-option. For example, when `x1` and `x2` both have values of ’a’ and ’b’, CLASS(`x1` | `x2` / ZERO=NONE ) creates `x1` main-effect variable names `x1a x1b`, `x2` main-effect variable names `x2a x2b`, and interaction variable names `x1ax2a x1ax2b x1bx2a x1bx2b`.

PROC TRANSREG then uses these variable names when creating the transformed, predicted, and residual variable names by affixing the relevant prefix and dropping extra characters if necessary.

##### METHOD=MORALS Variable Names

When you specify METHOD= MORALS and only one dependent variable is present, the output data set is structured exactly as if METHOD=REDUNDANCY (see the section Details for the CANALS and REDUNDANCY Methods). When more than one dependent variable is present, the dependent variables are output in the variable `_DEPEND_`, transformed dependent variables are output in the variable `T_DEPEND_`, predicted values are output in the variable `P_DEPEND_`, and residuals are output in the variable `R_DEPEND_`. You can partition the data set into BY groups, one per dependent variable, by referring to the character variable `_DEPVAR_`, which contains the original dependent variable names and transformations.

##### Duplicate Variable Names

When the same name is generated from multiple variables in the OUT= data set, new names are created by appending ’2’, ’3’, or ’4’, and so on, until a unique name is created. For 32-character names, the last character is replaced with a numeric suffix until a unique name is created. For example, if there are two output variables that otherwise would be named `x`, then `x` and `x2` are created instead. If there are two output variables that otherwise would be named `ThisIsAThirtyTwoCharacterVarName`, then `ThisIsAThirtyTwoCharacterVarName` and `ThisIsAThirtyTwoCharacterVarNam2` are created instead.