Sometimes, binary data such as Yes/No data are available—for example, 1 means “Yes, I have bought this brand in the last month” and 0 means “No, I have not bought this brand in the last month”. The following statements read a data set with Yes/No purchase data for three hypothetical brands.
title 'Doubling Yes/No Data'; proc format; value yn 0 = 'No ' 1 = 'Yes'; run; data BrandChoice; input a b c; label a = 'Brand A' b = 'Brand B' c = 'Brand B'; format a b c yn.; datalines; 0 0 1 1 1 0 0 1 1 0 1 0 1 0 0 ;
Data such as these cannot be analyzed directly because the raw data do not consist of partitions, each with one column per
level and exactly one 1 in each row. (See the section Using the TABLES Statement.) The data must be doubled so that both Yes and No are represented by a column in the data matrix. The TRANSREG procedure provides one way of doubling.
In the following statements, the DESIGN option specifies that PROC TRANSREG is being used only for coding, not analysis. The
option SEPARATORS=’: ’ specifies that labels for the coded columns are constructed from input variable labels, followed by
a colon and space, followed by the formatted value. The variables are designated in the MODEL statement as CLASS variables,
and the ZERO=NONE option creates binary variables for all levels. The OUTPUT statement specifies the output data set and drops
the _NAME_
, _TYPE_
, and Intercept
variables. PROC TRANSREG stores a list of coded variable names in a macro variable &_TRGIND
, which in this case has the value “aNo aYes bNo bYes cNo cYes
”. This macro variable can be used directly in the VAR statement in PROC CORRESP. The following statements produce Figure 32.12. Only the input table is displayed.
proc transreg data=BrandChoice design separators=': '; model class(a b c / zero=none); output out=Doubled(drop=_: Intercept); run; proc print label; run; proc corresp data=Doubled norow short; var &_trgind; run;
Figure 32.12: Doubling Yes/No Data
Doubling Yes/No Data |
Obs | Brand A: No |
Brand A: Yes |
Brand B: No |
Brand B: Yes |
Brand B: No |
Brand B: Yes |
Brand A | Brand B | Brand B |
---|---|---|---|---|---|---|---|---|---|
1 | 1 | 0 | 1 | 0 | 0 | 1 | No | No | Yes |
2 | 0 | 1 | 0 | 1 | 1 | 0 | Yes | Yes | No |
3 | 1 | 0 | 0 | 1 | 0 | 1 | No | Yes | Yes |
4 | 1 | 0 | 0 | 1 | 1 | 0 | No | Yes | No |
5 | 0 | 1 | 1 | 0 | 1 | 0 | Yes | No | No |
A fuzzy-coded indicator also sums to 1.0 across levels of the categorical variable, but it is coded with fractions rather than with 0 and 1. The fractions represent the distribution of the attribute across several levels of the categorical variable.
Ordinal variables, such as survey responses of 1 to 3, can be represented as two fuzzy-coded variables, as shown in Table 32.2.
The values of the coding sum to one across the two coded variables.
These next steps illustrate the use of binary and fuzzy-coded indicator variables. Fuzzy-coded indicators are used to represent missing data. Note that the missing values in the observation Igor are coded with equal proportions. The following statements produce Figure 32.13.
title 'Fuzzy Coding of Missing Values'; proc transreg data=Neighbor design cprefix=0; model class(Age Sex Height Hair / zero=none); output out=Neighbor2(drop=_: Intercept); id Name; run; data Neighbor3; set Neighbor2; if Sex = ' ' then do; Female = 0.5; Male = 0.5; end; if Hair = ' ' then do; White = 1/3; Brown = 1/3; Blond = 1/3; end; run;
proc print label noobs data=Neighbor3(drop=age--name); format _numeric_ best4.; run;
Figure 32.13: Fuzzy Coding of Missing Values
Fuzzy Coding of Missing Values |
Age Old | Age Young | Sex Female | Sex Male | Height Short | Height Tall | Hair Blond | Hair Brown | Hair White |
---|---|---|---|---|---|---|---|---|
1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 |
0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 |
1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 |
1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 |
1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 0 |
0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
1 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 |
0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 0 |
1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
1 | 0 | 0.5 | 0.5 | 1 | 0 | 0.33 | 0.33 | 0.33 |
There is one set of coded variables for each input categorical variable. If observation 12 is excluded, each set is a binary
design matrix. Each design matrix has one column for each category and exactly one 1 in each row. Fuzzy coding is shown in
the final observation, which corresponds to Igor. The observation for Igor has missing values for the variables Sex
and Hair
. The design matrix variables are coded with fractions that sum to one within each categorical variable.
An alternative way to represent missing data is to treat missing values as an additional level of the categorical variable. This alternative is available with the MISSING option in the PROC CORRESP statement. This approach yields coordinates for missing responses, allowing the comparison of “missing” along with the other levels of the categorical variables.
Greenacre and Hastie (1987) discuss additional coding schemes, including one for continuous variables. Continuous variables can be coded with PROC TRANSREG by specifying BSPLINE(variables / degree=1) in the MODEL statement.