Example Process Flow Diagram

Process Flow Diagram Scenario

This section shows you how to arrange various Enterprise Miner nodes into a data mining diagram. Several key components of the Enterprise Miner process flow diagram are covered:

For more information about the nodes used in this example, see the Enterprise Miner online reference documentation, Help [arrow] EM Reference or attend one of the data mining courses that the SAS Education Division offers.

In the following scenario, you want to build models that predict the credit status of credit applicants. You will use the champion model, or score card, to determine whether to extend credit to new applicants. The aim is to anticipate and reduce charge-offs and defaults, which management has deemed are too high.

The input data set that you will use to train the models is named SAMPSIO.DMAGECR (the German Credit benchmark data set). This data set is stored in the SAS Sample Library that is included with Enterprise Miner software. It consists of 1,000 past applicants and their resulting credit rating ("GOOD" or "BAD"). The binary target (dependent, response variable) is named GOOD_BAD. The other 20 variables in the data set will serve as model inputs (independent, explanatory variables).

VARIABLE ROLE LEVEL DESCRIPTION
CHECKING input ordinal Checking account status
DURATION input interval Duration in months
HISTORY input ordinal Credit history
PURPOSE input nominal Purpose
AMOUNT input interval Credit amount
SAVINGS input ordinal Savings account/bonds
EMPLOYED input ordinal Present employment since
INSTALLP input interval Installment rate as % of disposable income
MARITAL input nominal Personal status and gender
COAPP input nominal Other debtors/guarantors
RESIDENT input interval Present residence since
PROPERTY input nominal Property
AGE input interval Age in years
OTHER input nominal Other installment plans
HOUSING input nominal Housing
EXISTCR input interval Number of existing credits at this bank
JOB input ordinal Job title
DEPENDS input interval Number of dependents
TELEPHON input binary Telephone
FOREIGN input binary Foreign worker
GOOD_BAD target binary Good or bad credit rating

Sixty percent of the data in the SAMPSIO.DMAGECR data set will be employed to train the models (the training data). The remainder of the data will be used to adjust the models for overfitting with regard to the training data and to compare the models (the validation data). The models will be judged primarily on their assessed profitability and accuracy and secondarily on their interpretability.

Each of the modeling nodes can make a decision for each case in the data to be scored, based on numerical consequences that you can specify via a decision matrix and cost variables or constant costs. In Enterprise Miner, a decision matrix is defined as part of the target profile for the target. For this example flow, you want to define a loss matrix that adjusts the models for the expected losses for each decision (accept or reject an applicant). Michie, Spiegelhalter, and Taylor (1994, p. 153) propose the following loss matrix for the SAMPSIO.DMAGECR data set:


Decisions:
Target Values Accept Reject
Good $0 $1
Bad $5 $0

The rows of the matrix represent the target values, and the columns represent the decisions. For the loss matrix, accepting a bad credit risk is five times worse than rejecting a good credit risk. However, this loss matrix also says that you cannot make any money no matter what you do, so the results may be difficult to interpret. In fact, if you accept a good credit risk, you will make money, that is, you will have a negative loss. And if you reject an applicant (good or bad), there will be no profit or loss aside from the cost of processing the application, which will be ignored. Hence it would be more realistic to subtract one from the first row of the matrix to give a more realistic loss matrix:


Decisions:
Target Values Accept Reject
Good $-1 0
Bad $5 0

This loss matrix will yield the same decisions and the same model selections as the first matrix, but the summary statistics for the second matrix will be easier to interpret.

For a categorical target, such as GOOD_BAD, each modeling node can estimate posterior probabilities for each class, which are defined as conditional probabilities of the classes, given the input variables. By default, Enterprise Miner computes the posterior probabilities under the assumption that the prior probabilities are proportional to the frequencies of the classes in the training data set. For this example, you need to specify the correct prior probabilities in the decision data set, because the sample proportions of the classes in the training data set differ substantially from the proportions in the operational data set to be scored. The training data set that you will use for modeling contains 70% good credit risk applicants and 30% bad credit risk applicants, respectively. The actual assumed proportion of good-to-bad credit risk applicants in the score data set is 90% and 10%, respectively. If you specify the correct priors in the target profile for GOOD_BAD, the posterior probabilities will be correctly adjusted no matter what the proportions are in the training data set.

When the most appropriate model for screening bad credit applicants is determined, the scoring code will be deployed to a fictitious score data set that is named SAMPSIO.DMAGESCR. It contains 75 new applicants. This data set is also stored in the SAS Sample Library. Scoring new data that does not contain the target is the end result of most data mining applications.

Follow these steps to create this process flow diagram:

[Process flow diagram with Input Data Source connected to Data Partition connected to Transform Variables connected to three nodes: Regression, Neural Network, and Tree. The three modeling nodes all connect to Assessment which connects to Score! connected to Input Data Source and connected to Distribution Explorer, which is connected to SAS Code node.]

Note:   Example results may differ. Enterprise Miner nodes and their statistical methods may incrementally change between successive releases. Your process flow diagram results may differ slightly from the results shown in this example. However, the overall scope of the analysis will be the same.  [cautionend]

space
Previous Page | Next Page | Top of Page