Working with Nodes That Sample, Explore, and Modify |
As you begin a project, you should consider creating summary statistics for each of the variables, including their relationship with the target, using tools like the StatExplore node.
In this task, you add a StatExplore node to your diagram.
Select the Explore tab on the toolbar at the top left and select the StatExplore node. Drag this node into the Diagram Workspace. Alternatively, you can also right-click the Diagram Workspace and use the pop-up menus to add nodes to the workspace.
Connect the DONOR_RAW_DATA Data Source node to the StatExplore node.
Select the StatExplore node to view its properties. Details about the node appear in the Properties panel. By default, the StatExplore node creates Chi-Square statistics and correlation statistics.
Note: An alternate way to see all of the properties for a node is to double-click the node in the toolbar above the diagram.
To create Chi-Square statistics for the binned interval variables in addition to the class variables, set the Interval Variables property to Yes.
Right-click the StatExplore node and select Run. A Confirmation window appears. Click . A green border appears around each successive node in the diagram as Enterprise Miner runs the path to the StatExplore node.
Note: An alternate way to run a node is to select the Run icon from the Toolbar Shortcut Buttons. Doing so runs the path from the Input Data node to the selected node on the diagram.
If there are any errors in the path that you ran, the border around the node that contains the error will be red rather than green, and an Error window will appear. The Error window tells you that the run has failed and provides information about what is wrong.
A Run Status window opens when the path has run. Click
The Chi-Square plot highlights inputs that are associated with the target. Many of the binned continuous inputs have the largest Cramer's V values. The Pearson's correlation coefficients are displayed if the target is a continuous variable. . The Results window opens.Note: An alternate way to view results is to select the Results icon from the Toolbar Shortcut Buttons.
Maximize the Output window. The Output window provides distribution and summary statistics for the class and interval inputs, including summaries that are relative to the target.
Scroll down to the Interval Variables Summary Statistics section. The Non-Missing column lists the number of observations that have valid values for each interval variable. The Missing column lists the number of observations that have missing values for each interval variable.
Several variables such as DONOR_AGE, INCOME_GROUP, WEALTH_RATING, and MONTHS_SINCE_LAST_PROM_RESP have missing values. The entire customer case is excluded from a regression or neural network analysis when a variable attribute about a customer is missing. Later, you will impute some of these variables using the Replacement node.
Notice that many variables have very large standard deviations. You should plot these variables in order to decide whether transformations are warranted.
Close the Results window.
Note: If you make changes to any of the nodes in your process flow diagram after you have run a path, you need to rerun the path in order for the changes to affect later nodes.
Copyright © 2008 by SAS Institute Inc., Cary, NC, USA. All rights reserved.