Example Process Flow Diagram |
You can use Enterprise Miner to develop predictive models with the Regression, Neural Network, and Tree nodes. You can also import a model that you developed outside Enterprise Miner with a User Defined Model node, or you can write SAS code in a SAS Code node to create a predictive model. It is also possible to predetermine the important inputs (reduce the data dimension) with the Variable Selection node before modeling with one of the intensive modeling nodes. For more information about predictive modeling, see the Predictive Modeling section in the Enterprise Miner online reference documentation, Help EM Reference Predictive Modeling
Many credit organizations use logistic regression to model a binary target, such as GOOD_BAD. For this reason, a regression model will be the first trained.
Add a Regression node to the Diagram Workspace.
Connect the Transform Variables node to the Regression node.
Open the configuration interface to the Regression node. The Variables tab lists the input variables and the target. All of the input variables have a status of use, indicating they will be used for training. If you know that an input is not important in predicting the target, you might want to set the status of that variable to don't use (right-click in the Status cell for that input, select Set Status, and then select don't use). For this example, all variable inputs will be used to train the model.
For this example, use a stepwise regression to build the model. Stepwise regression systematically adds and deletes variables from the model based on the Entry and Stay significance levels (defaults of 0.05). Select the Selection Method tab and then click the Method drop-down arrow to select Stepwise.
By default, the Regression node chooses the model with the smallest negative log likelihood. For this example, the node should automatically set the value for Criteria to Profit/loss. Profit/loss chooses the model that minimizes the expected loss for each decision using the validation data set. Because the validation data set will be used to fine-tune the model and to assess the model, ideally you would want to withhold a test data set to perform an unbiased assessment of the model.
Select the Model Options tab. Notice that Type (of regression) is set to Logistic in the Regression subtab. If the target was an interval variable, such as average daily balance, then the type would automatically be set to Linear.
Select the Target Definition subtab of the Model Options tab. Notice that the event level is set to GOOD. If you wanted to model the probability that a customer has bad credit, you would need to reset the event level in the target profile. You can edit the target profile by right-clicking any cell of the target variable row in the Variables tab and selecting Edit profile.
Save the model by using the File menu to select Save Model As. Type a model name and description in the respective text boxes and then click . When you run the node, the model is saved as entry in the Model Manager. By default, the model is named "Untitled."
Close the Regression node.
Train the model by right-clicking the Regression node icon in the Diagram Workspace and selecting Run. Because you did not run the predecessor nodes in the process flow, they execute before the Regression node begins training. In general, when a node is run, all predecessor nodes need to run in order to pass information through the process flow. You can run the Regression node when it is open, but only if you have first run the predecessor nodes in the process flow.
Click
in the Message window to view the results.When the Regression Results Browser opens, the Estimates tab is displayed.
Note: The initial bar chart frame may not have room to display all of the bar effects. To view all of the effect estimates, use the Format menu and select Rescale Axes.
The scores are ordered by decreasing value in the chart. The color density legend indicates the size of the score for a bar. The legend also displays the minimum and maximum score to the left and right of the legend, respectively. CHECKING, DURATION, and HISTORY are the most important model predictors.
To display a text box that contains summary statistics for a bar, select the View Info tool ( ), then select the bar that you want to investigate, and hold the mouse button down to see the statistics.
You can use the Move and Resize Legend tool icon ( ) on the toolbox to reposition and increase the size of the legend.
To see a chart of the raw parameter estimates, click Estimates to display the parameter estimates plot. An advantage of effect T-scores over the raw parameter estimates is that their size can be compared directly to show the relative strengths of several explanatory variables for the same target. This eliminates the effect of differences in measurement scale for the different explanatory variables.
None of the transformations that you created in the Transform Variables node have large absolute effect T-scores. You should work on the exploratory and modification phases of the SEMMA methodology before training models. You can also try different Regression node settings to obtain a better fitting model.
Select the Statistics tab to display statistics, such as Akaike's Information Criterion, the average squared error, and the average expected loss for the training and validation data sets. The average loss for the cases in the validation data set is about -54 cents (a 54-cent profit), adjusted for the prior probabilities that you specified in the prior vector of the target profile.
Select the Output tab to view the DMREG procedure output. PROC DMREG is the underlying Enterprise Miner procedure that is used to generate the results. The Output tab lists background information about the fitting of the logistic model, a response profile table, and a summary of the stepwise selection process. At each step, odd ratio estimates are provided. Odd ratios are often used to make summary statements about standard logistic regression models. By subtracting one from the odds ratio and multiplying by 100, you can state the percentage in odds for the response for each unit change in a model input. For nominal and binary inputs, odds ratios are presented versus the last level of the input.
Close the Results Browser.
Copyright © 2006 by SAS Institute Inc., Cary, NC, USA. All rights reserved.