About the Rapid Predictive Modeler

Overview of the Rapid Predictive Modeler

Sampling Strategies for the SAS Rapid Predictive Modeler

Organizing Data for the SAS Rapid Predictive Modeler

Overview of the Rapid Predictive Modeler

SAS Rapid Predictive Modeler is designed to build models for the following types of data mining classification and regression problems:

classification models that predict the value of a discrete variable. Some examples are classification models that predict the value of a variable, such as True or False; Purchase or Decline; High, Medium, or Low; and Churn or Continues.
regression models that predict the value of a continuous variable. Some examples are regression models that might predict amounts such as revenue, sales, or success rate by using continuous values.

To create a model by using the SAS Rapid Predictive Modeler, you must supply a data set, where every row contains a set of independent predictor variables (known as inputs) and at least one dependent variable (known as a target). The SAS Rapid Predictive Modeler decides whether variables are continuous or categorical, and chooses the input variables that should be included in the model.

Your model can be saved as SAS code and then deployed in a SAS environment. You can use the SAS model code to score new data, and then use the results to make more informed business decisions. This process is called model scoring. For example, you can use scored data to decide which customers to select for churn, or to detect transactions that might be fraudulent.

Sampling Strategies for the SAS Rapid Predictive Modeler

The SAS Rapid Predictive Modeler uses a composite sampling approach. The number of observations that are included in the data sample depend on these factors:

number of input variables
total number of observations in the data source
whether the data contains rare event targets
number of events in the data

Here are the guidelines that the SAS Rapid Predictive Modeler uses to determine the number of observations that are processed:

Number of Input Variables	Number of Observations Processed
<100	80,000
100–200	40,000
>200	20,000

To understand the conditions in the following table, here are some key points:

The number of observations being processed is determined by the number of input variables. See the preceding table.
In predictive modeling, if you are modeling a binary target, your target variable has an event level of 0 or 1. The event level could also be formatted to use No or Yes. Here is an example. A bank is trying to predict whether a customer will have bad credit. In the training data, each customer with bad credit is set to Yes, which means an event occurred for that customer. Each customer with good credit is considered a non-event.

Condition	Rare Event
Condition	Yes	No
total number of observations < number of observations being processed OR total number of events < (0.10*number of observations being processed)	Sample the data so that there is a 10:1 ratio of non-events to events.	no sampling
total number of events > (0.10*number of observations being processed)	Sample the following proportion of the rare events: $10 times . fraction open , 0.10 , times n u m b e r o f o b s e r v eh t i o n s b e i n g p r o c e s s e d close , over n u m b e r o f e v e n t s end fraction. Click image for alternative formats.$	stratified sampling

Organizing Data for the SAS Rapid Predictive Modeler

Before you can build a model, you need input data that represents historical events and characteristics that can be used for prediction. You also need target data that represents the event or value that you want to predict. In many cases, the input data is derived from one time period and the target data is derived from a later time period. The combined input and target data that you use to develop your model is called training data.

For example, you might mine last year's sales receipts to predict next year's expected revenue or to predict which customers will respond to a special offer. Using historical data from past events to predict performance on future events is called model training.

For the best model results, your model training data should contain a large number of observations stored as rows of data. For example, many retail customer models use input data that has tens of thousands of observations.

If your target variable contains a rare event (for example, an offer that perhaps only 1% of your customers will respond to), you must ensure that your training data contains a significant number of these customers in your data set. You might want to oversample your training data to make sure you select all customers who accepted the offer, and provide an equal number of customers who did not accept. Oversampling makes it easier for a model with a rare event target to find a stable solution.

When you perform oversampling to boost rare event occurrences in your training data, you artificially inflate the occurrence of targeted events in your training data relative to the natural population. To compensate for the difference between the training data and the population data, the SAS Rapid Predictive Modeler provides you with a prior probability setting. Prior probability settings specify the true proportional frequencies of the targeted event in the population data.

The data that you mine using the SAS Rapid Predictive Modeler should be organized into rows (observations) and columns (variables). One of the columns should represent a target variable.

Consider the following example:

Name	Age	Gender	Income	Treatment	Purchase
Ricardo	29	M	33000	Y	Y
Susan	35	F	51000	Y	N
Jeremy	49	M	110000	N	Y

Name

a column that contains ID values for each observation. The SAS Rapid Predictive Modeler does not process ID variable columns for analytical content.

Age, Gender, Income, and Treatment

input columns that are used by the SAS Rapid Predictive Modeler.

Purchase

a target column.

When you configure your table of input data, you can also designate a frequency column. The values in the frequency column are nonnegative integers and must sum to 1.

By using the Variables to exclude from the model role, you can also select columns that you want the SAS Rapid Predictive Modeler to ignore during your analysis.

Training data always requires input and target variable values. Data that you use for scoring requires only input variable values; a target column is optional. When the model is used to make predictions from new data, the target column is not required. When the model is used to monitor effectiveness, the target column is required. Data that you use for scoring also typically includes an ID column.

Reserved Prefixes for Variables

SAS Enterprise Miner uses several default prefixes for generated nodes. If one of the variables in your input data uses any of these prefixes, you might see an error in the SAS log. If any of the variables in your input data set use these prefixes, it is recommended that you change the name of the variable in the input data set.

Reserved Prefixes
BL_	BP_	CL_	CP_
D_	E_	EL_	EP_
F_	I_	IC_	M_
P_	Q_	R_	RA_
RAS_	RAT_	RD_	RDS_
RDT_	ROI_	RS_	RT_
S_	T_	U_	V_