After you have viewed
the sample statistics and variable distributions, it is obvious that
some variables have highly skewed distributions. In highly skewed
distributions, a small percentage of the data points can have a large
amount of influence on the final model. Sometimes, performing a transformation
on an input variable can yield a better fitting model. The
Transform
Variables node enables you to perform variable transformation.
From the
Modify tab,
drag a
Transform Variables node to your diagram
workspace. Connect the
Data Partition node
to the
Transform Variables node. Click
next to the
Variables property
of the
Transform Variables node. The
Variables window
appears.
In the
Variables window,
select the
Statistics option in the upper
right corner of the screen. Scroll the variables list all the way
to the right. You should see the
Skewness and
Kurtosis statistics.
The
Skewness statistic
indicates the level of skewness and the direction of skewness for
a distribution. A
Skewness value of 0 indicates
that the distribution is perfectly symmetrical. A positive
Skewness value
indicates that the distribution is skewed to the right, which describes
all of the variables in this data set. A negative value indicates
that the distribution is skewed to the left.
The
Kurtosis statistic
indicates the peakedness of a distribution. However, this example
focuses only on the
Skewness statistic.
The
Transform
Variables node enables you to rapidly transform interval
variables using standard transformations. You can also create new
variables whose values are calculated from existing variables in the
data set. Note that the most skewed variables are, in order, DEROG,
DELINQ, VALUE, DEBTINC, NINQ, and LOAN. These five variables all have
a
Skewness value greater than 2. Close the
Variables window.
This example applies
a log transformation to all of the input variables. The log transformation
creates a new variable by taking the natural log of each original
input variable. In your diagram workspace, select the
Transform
Variables node. Set the value of the
Interval
Inputs property to
Log.
Right-click the
Transform
Variables property and click
Run.
Click
Yes in the
Confirmation window.
In the
Run Status window, click
Results.
Maximize the
Transformations
Statistics window. This window provides statistics for
the original and transformed variables. The
Formula column
indicates the expression used to transform each variable. Notice that
the absolute value of the
Skewness statistic
for the transformed values is typically smaller than that of the original
variables. Close the
Results window.
The
Transform
Variables enables you to perform a different transformation
on each variable. This is useful when your input data contains variables
that are skewed in different ways. In your diagram workspace, select
the
Transform Variables node. Click
next to the
Variables property
of the
Transform Variables node. The
Variables window
appears. In the
Variables window, note the
Method column.
Use this column to set the transformation for each variable individually.
Before doing so, you
want to recall the distribution for each variable. Select the variable
DEROG and
click
Explore. Note that nearly all of the
observations have a value of 0. Close the
Explore window.
Repeat this process
for the
DELINQ variable. Nearly all of the
values for DELINQ are equal to 0. The next largest class is the missing
values.
In situations where
there is a large number of observations at one value and relatively
few observations spread out over the rest of the distribution, it
can be useful to group the levels of an interval variable. Close the
Variables window.
Instead of fitting a
slope to the whole range of values for DEROG and DELINQ, you need
to estimate the mean in each group. Because most of the applicants
in the data set had no delinquent credit lines, there is a high concentration
of observations where DELINQ=0.
In your process flow
diagram, select the
Transform Variables node.
Click
next to the
Formulas property.
The
Formulas window appears.
The
Formulas window
enables you to create custom variable transformations. Select the
DELINQ variable
and click the
Create variable in the upper
left corner of the
Formulas window. The
Add
Transformation window appears.
Complete the following
steps to transform the DELINQ variable:
-
Enter
INDELINQ
for
the
Name property. The default values are
acceptable for the other properties.
-
In the
Formula dialog
box, enter
DELINQ > 0
.
This definition is an
example of Boolean logic and illustrates one way to dichotomize an
interval variable. The statement is either true or false for each
observation. When the statement is true, the expression evaluates
as 1. Otherwise, the expression evaluates as 0. In other words, when
DELINQ>0, INDELINQ=1. Likewise, when DELINQ=0, INDELINQ=0. If the
value of DELINQ is missing, the expression evaluates to 0, because
missing values are treated as being smaller than any nonmissing values
in a numerical comparison. Because a missing value of DELINQ is reasonably
imputed as DELINQ=0, this does not pose a problem for this example.
-
Click
OK.
The formula now appears in the
Formulas window.
-
Repeat the above steps
for the variable DEROG. Name the new variable
INDEROG
.
-
Click
Preview in
the lower left corner of the screen.
-
Even though DEROG and
DELINQ were used to construct the new variables, the original variables
are still available for analysis. You can modify this if you want,
but this example keeps the original variables. This is done because
the transformed variables contain only a portion of the information
that is contained in the original variables. Specifically, the new
variables identify whether DEROG or DELINQ is greater than zero.