Working with Nodes That Modify, Model, and Explore |
Overview |
Some data can be better mined by modifying the variable values with some transformation function. The data is often useful in its original form, but transforming the data might help maximize the information content that you can retrieve. Transformations are useful when you want to improve the fit of a model to the data. For example, transformations can be used to stabilize variance, remove nonlinearity, improve additivity, and correct non-normality.
You can use the Formula Builder and Expression Builder windows in the Transform Variable node to create variable transformations. You can also view distribution plots of variables before and after the transformation to assess how effective the data transformation is.
View Variable Distribution Plots |
Drag a Transform Variables node from the Modify tab of the node toolbar into the Diagram Workspace.
Connect the Impute node to the Transform Variables node.
Select the Transform Variables node in the Diagram Workspace to view its settings in the Properties panel. The default transformation method for all variables is None. You can use the Variables property to configure variable transformation on a case-by-case basis, or you can use the Default Methods section of the Properties panel to set transformation methods according to variable type.
The variable distribution plots that you view in the Transform Variables node are generated using sampled data. You can configure how the data is sampled in the Sample Properties section of the Transform Variables node Properties panel.
In the Properties panel for the Transform Variables node, click the ellipsis button to the right of the Formulas property. This action opens the Formulas window.
In the Formulas window, the Outputs table is empty, because you have not created any variables yet.
Examine the distributions of the current variables, and note which variables might benefit from transformation. A good variable transformation modifies the distribution of the variable so that it more closely resembles a normal distribution (a bell-shaped curve).
View the distribution plots for the variables SES and URBANICITY to see the data before the missing values were replaced with imputed values. Distribution plots for the variables IMP_REPL_SES and IMP_REPL_URBANICITY show the data after the missing values were imputed and replaced.
Add a Variable Transformation |
Click the Create icon on the left side of the toolbar to start creating a variable transformation.
The Add Transformation window opens.
Edit the following Value columns to configure the new variable that you are creating:
Change the Name from TRANS_0 to OVERALL_RESP_RATE.
Set Format to PERCENT6..
Set Label to Overall Response Rate.
Click
in the Add Transformation window. The Expression Builder window opens.Click on All Functions to see the comprehensive list of pre-built SAS functions that are available for variable transformations.
Select the Variables List tab in the Expression Builder window. Scroll down the list of variables to REP_LIFETIME_GIFT_COUNT, select it, and click . The REP_LIFETIME_GIFT_COUNT variable appears in the Expression Text box.
Click the division operator button . Return to the Variables List tab and select the variable REP_LIFETIME_PROM.
Click Expression Text box.
. The REP_LIFETIME_GIFT_COUNT/LIFETIME_PROM expression appears in theClick
in the Expression Builder window.Click
in the Add Transformation window.In the Formulas window, click
to see a plot of the new variable.Note that because the distribution of OVERALL_RESP_RATE is skewed, you should transform it further.
Click the Edit Expression button on the left side of the Formulas window.
Select the REP_LIFETIME_GIFT_COUNT/REP_LIFETIME_PROM expression in the Expression Text box.
On the Functions tab, select the Mathematical folder and then select LOG(argument) from the panel on the right.
Click
. The expression text is updated as follows:Click
in the Expression Builder window.Click
at the bottom left of the Formulas window.The distribution is now much closer to a normal distribution.
Because the Overall Response Rate variable has been mathematically transformed, the variable's format (PERCENT) is no longer accurate. The variable format requires updating. To change the variable format, click the Edit Properties icon on the left side of the Formulas window.
In the Edit Transformation window, select Format and then press the Backspace key to clear the text box. Leave the Format value blank in order to use the default format for numeric values.
Click
in the Edit Transformation window.Click
to exit the Formulas window.Apply Standard Variable Transformations |
You can now apply standard transformations to some of the original variables to modify the distributions so that they more closely resemble a normal distribution. Typical transformations include functions such as logarithmic functions, binning, square root, and inverse functions. The default method for variable transformations for all target and input measurement levels is none, as noted in the Properties panel.
To apply transformations to selected variables, click the ellipsis button to the right of the Variables property in the Transform Variables Properties panel.
The Variables - Trans window opens.
You can transform individual variables in the Variables - Trans window. Apply the Log Method transformation to each of the following variables:
REP_FILE_AVG_GIFT
REP_LAST_GIFT_AMT
REP_LIFETIME_AVG_GIFT_AMT
REP_LIFETIME_GIFT_AMOUNT
Apply the Optimal method to the following variables:
REP_LIFTIME_CARD_PROM
REP_LIFETIME_GIFT_COUNT
REP_MEDIAN_HOME_VALUE
REP_MEDIAN_HOUSEHOLD_INCOME
REP_PER_CAPITA_INCOME
REP_RECENT_RESPONSE_PROP
REP_RECENT_STAR_STATUS
Select the Method column heading to sort the variable rows by the transformation method.
Click
to close the Variables - Trans window.Note: When Enterprise Miner creates imputed variable values in a data set, the original data set variables remain, but are automatically assigned a Rejected variable status. Rejected variables are not included in data mining algorithms that follow the data imputation step.
Run the Transform Variables node.
Copyright © 2008 by SAS Institute Inc., Cary, NC, USA. All rights reserved.