Example Process Flow Diagram |
This section shows you how to arrange various Enterprise Miner nodes into a data mining diagram. Several key components of the Enterprise Miner process flow diagram are covered:
input data set
target profile for the target
partitioned data sets
variable transformation
supervised modeling, including logistic regression, tree, and neural network models.
model comparison and assessment
new data scoring.
For more information about the nodes used in this example, see the Enterprise Miner online reference documentation, Help EM Reference or attend one of the data mining courses that the SAS Education Division offers.
In the following scenario, you want to build models that predict the credit status of credit applicants. You will use the champion model, or score card, to determine whether to extend credit to new applicants. The aim is to anticipate and reduce charge-offs and defaults, which management has deemed are too high.
The input data set that you will use to train the models is named SAMPSIO.DMAGECR (the German Credit benchmark data set). This data set is stored in the SAS Sample Library that is included with Enterprise Miner software. It consists of 1,000 past applicants and their resulting credit rating ("GOOD" or "BAD"). The binary target (dependent, response variable) is named GOOD_BAD. The other 20 variables in the data set will serve as model inputs (independent, explanatory variables).
Sixty percent of the data in the SAMPSIO.DMAGECR data set will be employed to train the models (the training data). The remainder of the data will be used to adjust the models for overfitting with regard to the training data and to compare the models (the validation data). The models will be judged primarily on their assessed profitability and accuracy and secondarily on their interpretability.
Each of the modeling nodes can make a decision for each case in the data to be scored, based on numerical consequences that you can specify via a decision matrix and cost variables or constant costs. In Enterprise Miner, a decision matrix is defined as part of the target profile for the target. For this example flow, you want to define a loss matrix that adjusts the models for the expected losses for each decision (accept or reject an applicant). Michie, Spiegelhalter, and Taylor (1994, p. 153) propose the following loss matrix for the SAMPSIO.DMAGECR data set:
|
Decisions: | |
---|---|---|
Target Values | Accept | Reject |
Good | $0 | $1 |
Bad | $5 | $0 |
The rows of the matrix represent the target values, and the columns represent the decisions. For the loss matrix, accepting a bad credit risk is five times worse than rejecting a good credit risk. However, this loss matrix also says that you cannot make any money no matter what you do, so the results may be difficult to interpret. In fact, if you accept a good credit risk, you will make money, that is, you will have a negative loss. And if you reject an applicant (good or bad), there will be no profit or loss aside from the cost of processing the application, which will be ignored. Hence it would be more realistic to subtract one from the first row of the matrix to give a more realistic loss matrix:
|
Decisions: | |
---|---|---|
Target Values | Accept | Reject |
Good | $-1 | 0 |
Bad | $5 | 0 |
This loss matrix will yield the same decisions and the same model selections as the first matrix, but the summary statistics for the second matrix will be easier to interpret.
For a categorical target, such as GOOD_BAD, each modeling node can estimate posterior probabilities for each class, which are defined as conditional probabilities of the classes, given the input variables. By default, Enterprise Miner computes the posterior probabilities under the assumption that the prior probabilities are proportional to the frequencies of the classes in the training data set. For this example, you need to specify the correct prior probabilities in the decision data set, because the sample proportions of the classes in the training data set differ substantially from the proportions in the operational data set to be scored. The training data set that you will use for modeling contains 70% good credit risk applicants and 30% bad credit risk applicants, respectively. The actual assumed proportion of good-to-bad credit risk applicants in the score data set is 90% and 10%, respectively. If you specify the correct priors in the target profile for GOOD_BAD, the posterior probabilities will be correctly adjusted no matter what the proportions are in the training data set.
When the most appropriate model for screening bad credit applicants is determined, the scoring code will be deployed to a fictitious score data set that is named SAMPSIO.DMAGESCR. It contains 75 new applicants. This data set is also stored in the SAS Sample Library. Scoring new data that does not contain the target is the end result of most data mining applications.
Follow these steps to create this process flow diagram:
Note: Example results may differ. Enterprise Miner nodes and their statistical methods may incrementally change between successive releases. Your process flow diagram results may differ slightly from the results shown in this example. However, the overall scope of the analysis will be the same.
Copyright © 2006 by SAS Institute Inc., Cary, NC, USA. All rights reserved.