Design Coding

The way the independent effects of the model are interpreted to generate a linear model is called coding. The OPTEX procedure provides for different types of coding. For D-optimality, the type of coding affects only the absolute value of the computed efficiency criteria, not the relative values for two different designs. Thus, different codings do not affect the choice of D-optimal design. In this section, the details and ramifications of the different types of coding are discussed.

Coding the points in a design involves selecting linearly independent columns corresponding to each model term, turning particular values of the factors into a row vector $\mb {x}$. The OPTEX procedure requires a nonsingular coding for the design matrix. Because of this, any two coding schemes are related by a nonsingular transformation.

Static Coding

The default coding for the design points is as follows:

  • Unless you specify CODING=NONE (or NOCODE) in the PROC OPTEX statement, continuous variables are centered and scaled so that their maximum and minimum values are 1 and –1, respectively.

  • The k – 1 columns corresponding to the main effect of a classification variable A are computed as follows: For a design point with A at its ith level, for $1 \leq i \leq k-1$, the columns of the design matrix associated with A are all 0 except for the ith column, which is 1. When A is at its kth level, all k – 1 columns associated with A are –1. Thus, if $\alpha _ i$ denotes the expected response at the ith level of A, the k – 1 columns yield estimates of $\alpha _1 - \alpha _ k, \alpha _2 - \alpha _ k, \dots , \alpha _{k-1} - \alpha _ k$.

  • Columns for crossed effects are computed by taking the horizontal direct product of columns corresponding to the constituent effects.

This coding corresponds to modeling without over-parameterization, by using the same method as the CATMOD procedure in SAS/STAT® software. This is different from the method used by the GLM procedure, which uses an over-parameterized model.

Orthogonal Coding

If you specify CODING=ORTH or CODING=ORTHCAN, the points are first coded as described in the previous section and then recoded so that $X_ C’X_ C = N_ C\cdot I$, where $X_ C$ is the design matrix for the candidate points, $N_ C$ is the number of candidates, and I is the identity matrix. This is required in order for the D- and A-efficiency measures to make sense. For the option CODING=ORTHCAN, this recoding is accomplished by computing a square matrix R such that $X_ C’X_ C = R’R$ and then transforming each row vector $\mb {x}$ as

$\displaystyle  \Strong{x}  $
$\displaystyle  \rightarrow  $
$\displaystyle  \Strong{x}R^{-1}\sqrt {N_ C}  $

If you specify CODING=ORTH, the recoding is done in a similar fashion, except that the matrix R is computed according to $X_ C’X_ C + X_ A’X_ A + X_ I’X_ I = R’R$, where $X_ A$ and $X_ I$ are the design matrices for the AUGMENT= and INITDESIGN= data sets, respectively (coded as described in the previous section.) Thus, these two orthogonal coding options only differ when there is an AUGMENT= or an INITDESIGN= data set ; the option CODING=ORTH includes points from these data sets in computing the orthogonal coding, while the option CODING=ORTHCAN uses only the candidates themselves.

Example of Coding

For example, consider a main effect model with one continuous variable X and one three-level classification variable A. The results of the various coding options are shown in Table 14.7.

Table 14.7: Different Types of Design Coding

Original

Design Matrix With

Design Matrix With

Design Matrix With

Data

CODING=NONE

CODING=STATIC

CODING=ORTH

X

A

 

X

A1

A2

 

X

A1

A2

 

X

A1

A2

1

1

1

1

1

0

1

–1

1

0

1

–1.464

0.598

–0.707

2

2

1

2

0

1

1

–0.6

0

1

1

–0.878

–0.478

1.414

3

3

1

3

–1

–1

1

–0.2

–1

–1

1

–0.293

–1.554

–0.707

4

1

1

4

1

0

1

0.2

1

0

1

0.293

1.554

–0.707

5

2

1

5

0

1

1

0.6

0

1

1

0.878

0.478

1.414

6

3

1

6

–1

–1

1

1

–1

–1

1

1.464

–0.598

–0.707


The first column in each design matrix is an all-ones vector corresponding to the intercept, the next column corresponds to the linear effect of X, and the last two columns correspond to the two degrees of freedom for the main effect of A.

General Recommendations

Coding does not affect the relative ordering of designs by D-efficiency, and the same is true for G-efficiency and the average standard error of prediction. This is easy to see for the latter two measures, which are based on the variance of prediction, since how accurately a point is predicted should not be affected by how the independent variables are coded. For D-optimality, note again that coding corresponds to multiplying the design matrix on the right by some nonsingular transformation A, which changes the determinant of the information matrix as follows:

$\displaystyle  |X’X|  $
$\displaystyle  \rightarrow  $
$\displaystyle  |A’X’XA|\  =\  |A’A||X’X|\  =\  |A|^2|X’X|  $

Thus, recoding simply multiplies the D-criterion by a constant that is the same for all designs. Note, however, that A-optimality is not invariant to coding.

Orthogonal coding will usually be the right one; it is not the default because it depends on the candidate set. Note, however, that for the distance-based criteria, if the distance between two points should be computed in terms of the actual values of the model variables instead of centered and scaled values, then you should specify CODING=NONE or NOCODE. The NOCODE option can also be useful when the NOINT option is specified.