Categories for nominal
and ordinal variables are defined by the normalized, formatted values
of the variable. If you have not explicitly assigned a format to a
variable, the default format for a numeric variable is BEST12., and
the default format for a character variable is $w., where w is the
length of the variable.
The formatted value
is normalized by:
-
-
Truncating to 32 characters
-
Changing lowercase letters
to uppercase.
Hence, if two values
of a variable differ only in the number of leading blanks and in the
case of their letters, they will be assigned to the same category.
Also, if two values differ only past the first 32 characters (after
left justification), they will be assigned to the same category.
Dummy variables are
generated for categorical variables in the Regression and Neural Network
nodes. If a categorical variable has c categories, the number of dummy
variables will be either c or c-1, depending on the role of the variable
and what options are specified. The computer time and memory requirements
for analyzing a categorical variable with c categories are the same
as the requirements for analyzing c or c-1 interval-level variables
for the Regression and Neural Network nodes.
When a categorical variable
appears in two or more data sets used in the same modeling node, such
as the training set (prior to DMDB processing), validation set, and
decision data set, the variable is not required to have the same type
and length in each data set. For example, a variable named TEMPERAT
could be numeric in the training set with values such as 98.6, but
a variable by the same name in the validation set could be character
with values such as "98.6". As long as the normalized, formatted
values from the two data sets agree, the values of the two variables
will be matched correctly. In the Neural Network node only, a categorical
variable that appears in two or more data sets must have the same
formatted length in each data set.