Data Mining and SEMMA

Definition of Data Mining

Overview of the Data

Predictive and Descriptive Techniques

Overview of SEMMA

Overview of the Nodes

Some General Usage Rules for Nodes

Definition of Data Mining

This document defines data mining as advanced methods for exploring and modeling relationships in large amounts of data.

Overview of the Data

A typical data set has many thousands of observations. An observation can represent an entity such as an individual customer, a specific transaction, or a certain household. Variables in the data set contain specific information such as demographic information, sales history, or financial information for each observation. How this information is used depends on the research question of interest.

When discussing types of data, consider the measurement level of each variable. You can generally classify each variable as one of the following:

Interval — a continuous variable that contains values across a range. For these variables, the mean (or average) is interpretable. Examples include income, temperature, or height.
Categorical — a classification variable with a finite number of distinct, discrete values. Examples include gender (male or female) and drink size (small, medium, large). Because these variables are noncontinuous, the mean is not interpretable. Categorical data can be grouped in several ways. SAS Enterprise Miner uses the following groupings:
- Unary — a variable that has the same value for every observation in the data set.
- Binary — a variable that has two possible values (for example, gender).
- Nominal — a variable that has more than two levels, but the values of each level have no implied order. Examples are pie flavors such as cherry, apple, and peach.
- Ordinal — a variable that has more than two levels. The values of each level have an implied order. Examples are drink sizes such as small, medium, and large.
  
  Note: Ordinal variables can be treated as nominal variables when you are not interested in the ordering of the levels. However, nominal variables cannot be treated as ordinal variables because, by definition, there is no implied ordering.

To obtain a meaningful analysis, you must construct an appropriate data set and specify the correct measurement level for each variable in that data set.

Predictive and Descriptive Techniques

Predictive modeling techniques enable you to identify whether a set of input variables is useful in predicting some outcome, or target, variable. For example, a financial institution might try to determine whether knowledge of an applicant’s income and credit history (input variables) helps predict whether the client is likely to default on a loan (outcome variable).

To distinguish the input variables from the outcome variables, set the model for each variable in the data set. Identify outcome variables with the variable role Target, and identify input variables with the variable role Input. Other variable roles include Cost, Frequency, ID, and Rejected. The Rejected variable role specifies that the identified variable is not included in the model building process. The ID variable role indicates that the identified variable contains a unique identifier for each observation.

Predictive modeling techniques require one or more outcome variables of interest. Each technique attempts to predict the outcome as accurately as possible, according to a specified criterion such as maximizing accuracy or minimizing loss. This document shows you how to use several predictive modeling techniques through SAS Enterprise Miner, including regression models, decision trees, and neural networks. Each of these techniques enables you to predict a binary, nominal, ordinal, or continuous variable from any combination of input variables.

Descriptive techniques enable you to identify underlying patterns in a data set. These techniques do not have a specific outcome variable of interest. This document explores how to use SAS Enterprise Miner to perform the following descriptive analyses:

Cluster analysis — This analysis attempts to find natural groupings of observations in the data, based on a set of input variables. After grouping the observations into clusters, you can use the input variables to attempt to characterize each group. When the clusters have been identified and interpreted, you can decide whether to treat each independently.
Association analysis — This analysis identifies groupings of products or services that tend to be purchased at the same time, or at different times by the same customer. This analysis answers questions such as the following:
- What proportion of the people who purchased eggs and milk also purchased bread?
- What proportion of the people who have a car loan with some financial institution later obtain a home mortgage from the same institution?

Overview of SEMMA

SEMMA is an acronym used to describe the SAS data mining process. It stands for Sample, Explore, Modify, Model, and Assess. SAS Enterprise Miner nodes are arranged on tabs with the same names.

Sample — These nodes identify, merge, partition, and sample input data sets, among other tasks.
Explore — These nodes explore data sets statistically and graphically. These nodes plot the data, obtain descriptive statistics, identify important variables, and perform association analysis, among other tasks.
Modify — These nodes prepare the data for analysis. Examples of the tasks that you can complete for these nodes are creating additional variables, transforming existing variables, identifying outliers, replacing missing values, performing cluster analysis, and analyzing data with self-organizing maps (SOMs) or Kohonen networks.
Model — These nodes fit a predictive model to a target variable. Available models include decision trees, neural networks, least angle regressions, support vector machines, linear regressions, and logistic regressions.
Assess — These nodes compare competing predictive models. They build charts that plot the percentage of respondents, percentage of respondents captures, lift, and profit.

SAS Enterprise Miner also includes the Utility and Applications tabs for nodes that provide necessary tools but that are not easily categorized on a SEMMA tab. One such example is the SAS Code node. This node enables you to insert custom SAS code into your process flow diagrams. Another example is the Score Code Export node, which exports the files that are necessary for score code deployment in production systems.

Overview of the Nodes

Sample Nodes

The Append node enables you to append data sets that are exported by two or more paths in a single SAS Enterprise Miner process flow diagram. The Append node can append data according to the data role, such as joining training data to training data, transaction data to transaction data, score data to score data, and so on. The Append node can append data that was previously partitioned in train, test, and validate roles into one large training data set.

The Data Partition node enables you to partition data sets into training, test, and validation data sets. The training data set is used for preliminary model fitting. The validation data set is used to monitor and tune the model weights during estimation and is also used for model assessment. The test data set is an additional hold-out data set that you can use for model assessment. This node uses simple random sampling, stratified random sampling, or user-defined partitions to create partitioned data sets.

The File Import node enables you to import data that is stored in external formats into a data source that SAS Enterprise Miner can interpret. The File Import node currently can process CSV flat files, JMP tables, Microsoft Excel and Lotus spreadsheet files, Microsoft Access database tables, and DBF, DLM, and DTA files.

The Filter node enables you to apply a filter to the training data set in order to exclude outliers or other observations that you do not want to include in your data mining analysis. Outliers can greatly affect modeling results and, subsequently, the accuracy and reliability of trained models.

The Input Data node enables you to access SAS data sets and other types of data. The Input Data node represents the introduction of predefined metadata into a Diagram Workspace for processing. You can view metadata information about your source data in the Input Data node, such as initial values for measurement levels and model roles of each variable.

The Merge node enables you to merge observations from two or more data sets into a single observation in a new data set. The Merge node supports both one-to-one and match merging. In addition, you have the option to rename certain variables (for example, predicted values and posterior probabilities) depending on the settings of the node.

The Sample node enables you to take random, stratified random, and cluster samples of data sets. Sampling is recommended for extremely large databases because it can significantly decrease model training time. If the sample is sufficiently representative, then relationships found in the sample can be expected to generalize to the complete data set.

The Time Series node enables you to understand trends and seasonal variations in the transaction data that you collect from your customers and suppliers over the time, by converting transactional data into time series data. Transactional data is timestamped data that is collected over time at no particular frequency. By condensing the information into a time series, you can discover trends and seasonal variations in customer and supplier habits that might not be visible in transactional data.

Explore Nodes

The Association node enables you to identify association relationships within the data. For example, if a customer buys a loaf of bread, how likely is the customer to also buy a gallon of milk? The node also enables you to perform sequence discovery if a sequence variable is present in the data set.

The Cluster node enables you to segment your data by grouping observations that are statistically similar. Observations that are similar tend to be in the same cluster, and observations that are different tend to be in different clusters. The cluster identifier for each observation can be passed to other tools for use as an input, ID, or target variable. It can also be used as a group variable that enables automatic construction of separate models for each group.

The DMDB node creates a data mining database that provides summary statistics and factor-level information for class and interval variables in the imported data set. The DMDB is a metadata catalog that is used to store valuable counts and statistics for model building.

The Graph Explore node is an advanced visualization tool that enables you to explore large volumes of data graphically to uncover patterns and trends and to reveal extreme values in the database. For example, you can analyze univariate distributions, investigate multivariate distributions, and create scatter and box plots and constellation and 3-D charts. Graph Explore plots are fully interactive and are dynamically linked to highlight data selections in multiple views.

The Link Analysis node transforms unstructured transactional or relational data into a model that can be graphed. Such models can be used to discover fraud detection, criminal network conspiracies, telephone traffic patterns, website structure and usage, database visualization, and social network analysis. Also, the node can be used to recommend new products to existing customers.

The Market Basket node performs association rule mining over transaction data in conjunction with item taxonomy. This node is useful in retail marketing scenarios that involve tens of thousands of distinct items, where the items are grouped into subcategories, categories, departments, and so on. This is called item taxonomy. The Market Basket node uses the taxonomy data and generates rules at multiple levels in the taxonomy.

The MultiPlot node is a visualization tool that enables you to explore larger volumes of data graphically. The MultPlot node automatically creates bar charts and scatter plots for the input and target variables without making several menu or window item selections. The code that is created by this node can be used to create graphs in a batch environment.

The Path Analysis node enables you to analyze web log data to determine the paths that visitors take as they navigate through a website. You can also use the node to perform sequence analysis.

The SOM/Kohonen node enables you to perform unsupervised learning by using Kohonen vector quantization (VQ), Kohonen self-organizing maps (SOMs), or batch SOMs with Nadaraya-Watson or local-linear smoothing. Kohonen VQ is a clustering method, but SOMs are primarily dimension-reduction methods.

The StatExplore node is a multipurpose node that you use to examine variable distributions and statistics in your data sets. Use the StatExplore node to compute standard univariate statistics, to compute standard bivariate statistics by class target and class segment, and to compute correlation statistics for interval variables by interval input and target. You can also use the StatExplore node to reject variables based on target correlation.

The Variable Clustering node is a useful tool for selecting variables or cluster components for analysis. Variable clustering removes collinearity, decreases variable redundancy, and helps reveal the underlying structure of the input variables in a data set. Large numbers of variables can complicate the task of determining the relationships that might exist between the independent variables and the target variable in a model. Models that are built with too many redundant variables can destabilize parameter estimates, confound variable interpretation, and increase the computing time that is required to run the model. Variable clustering can reduce the number of variables that are required to build reliable predictive or segmentation models.

The Variable Selection node enables you to evaluate the importance of input variables in predicting or classifying the target variable. The node uses either an R-square or a Chi-square selection (tree-based) criterion. The R-square criterion removes variables that have large percentages of missing values, and removes class variables that are based on the number of unique values. The variables that are not related to the target are set to a status of rejected. Although rejected variables are passed to subsequent tools in the process flow diagram, these variables are not used as model inputs by modeling nodes such as the Neural Network and Decision Tree tools. If a variable interest is rejected, you can force that variable into the model by reassigning the variable role in any modeling node.

Modify Nodes

The Drop node enables you to drop selected variables from your scored SAS Enterprise Miner data sets. You can drop variables that have the roles of Assess, Classification, Frequency, Hidden, Input, Rejected, Residual, and Target from your scored data sets. Use the Drop node to trim the size of data sets and metadata during tree analysis.

The Impute node enables you to replace missing values for interval variables with the mean, median, midrange, mid-minimum spacing, distribution-based replacement, or use a replacement M-estimator such as Tukey's biweight, Hubers, or Andrew's Wave, or by using a tree-based imputation method. Missing values for class variables can be replaced with the most frequently occurring value, distribution-based replacement, tree-based imputation, or a constant.

The Interactive Binning node is used to model nonlinear functions of multiple modes of continuous distributions. The interactive tool computes initial bins by quintiles, and then you can split and combine the initial quintile-based bins into custom final bins.

The Principal Components node enables you to perform a principal components analysis for data interpretation and dimension reduction. The node generates principal components that are uncorrelated linear combinations of the original input variables and that depend on the covariance matrix or correlation matrix of the input variables. In data mining, principal components are usually used as the new set of input variables for subsequent analysis by modeling nodes.

The Replacement node enables you to replace selected values for class variables. The Replacement node summarizes all values of class variables and provides you with an editable variables list. You can also select a replacement value for future unknown values.

The Rules Builder node enables you to create ad hoc sets of rules for your data that result in user-definable outcomes. For example, you might use the Rules Builder node to define outcomes named Deny and Review based on rules such as the following:

IF P_Default_Yes > 0.4 then do      
    EM_OUTCOME-"Deny";      
    IF AGE > 60 then        
      EM_OUTCOME="Review";   
END;

The Transform Variables node enables you to create new variables that are transformations of existing variables in your data. Transformations can be used to stabilize variances, remove nonlinearity, improve additivity, and correct nonnormality in variables. You can also use the Transform Variables node to transform class variables and to create interaction variables. Examples of variable transformations include taking the square root of a variable, maximizing the correlation with the target variable, or normalizing a variable.

Model Nodes

The AutoNeural node can be used to automatically configure a neural network. The AutoNeural node implements a search algorithm to incrementally select activation functions for a variety of multilayer networks.

The Decision Tree node enables you to fit decision tree models to your data. The implementation includes features that are found in a variety of popular decision tree algorithms (for example, CHAID, CART, and C4.5). The node supports both automatic and interactive training. When you run the Decision Tree node in automatic mode, it automatically ranks the input variables based on the strength of their contribution to the tree. This ranking can be used to select variables for use in subsequent modeling. You can override any automatic step with the option to define a splitting rule and prune explicit tools or subtrees. Interactive training enables you to explore and evaluate data splits as you develop them.

The Dmine Regression node enables you to compute a forward stepwise, least squares regression model. In each step, the independent variable that contributes maximally to the model R-square value is selected. The tool can also automatically bin continuous terms.

The DMNeural node is another modeling node that you can use to fit an additive nonlinear model. The additive nonlinear model uses bucketed principal components as inputs to predict a binary or an interval target variable with automatic selection of an activation function.

The Ensemble node enables you to create new models by combining the posterior probabilities (for class targets) or the predicted values (for interval targets) from multiple predecessor models.

The Gradient Boosting node uses tree boosting to create a series of decision trees that together form a single predictive model. Each tree in the series is fit to the residual of the prediction from the earlier trees in the series. The residual is defined in terms of the derivative of a loss function. For squared error loss with an interval target, the residual is simply the target value minus the predicted value. Boosting is defined for binary, nominal, and interval targets.

The LARS node enables you to use Least Angle Regression algorithms to perform variable selection and model fitting tasks. The LARs node can produce models that range from simple intercept models to complex multivariate models that have many inputs. When the LARs node is used to perform model fitting, it uses criteria from either least angle regression or the LASSO regression to choose the optimal model.

The MBR (Memory-Based Reasoning) node enables you to identify similar cases and to apply information that is obtained from these cases to a new record. The MBR node uses k-nearest neighbor algorithms to categorize or predict observations.

The Model Import node enables you to import models into the SAS Enterprise Miner environment that were not created by SAS Enterprise Miner. For example, models that were created by using SAS PROC LOGISTIC can now be run, assessed, and modified in SAS Enterprise Miner.

The Neural Network node enables you to construct, train, and validate multilayer feedforward neural networks. Users can select from several predefined architectures or manually select input, hidden, and target layer functions and options.

The Partial Least Squares node is a tool for modeling continuous and binary targets based on SAS/STAT PROC PLS. The Partial Least Squares node produces DATA step score code and standard predictive model assessment results.

The Regression node enables you to fit both linear and logistic regression models to your data. You can use continuous, ordinal, and binary target variables. You can use both continuous and discrete variables as inputs. The node supports the stepwise, forward, and backward selection methods. A point-and-click interaction builder enables you to create higher-order modeling terms.

The Rule Induction node enables you to improve the classification of rare events in your modeling data. The Rule Induction node creates a Rule Induction model that uses split techniques to remove the largest pure split node from the data. Rule Induction also creates binary models for each level of a target variable and ranks the levels from the most rare event to the most common. After all levels of the target variable are modeled, the score code is combined into a SAS DATA step.

The SVM node uses supervised machine learning to perform binary classification problems, including polynomial, radial basis function, and sigmoid nonlinear kernels. The standard SVM problem solves binary classification problems by constructing a set of hyperplanes that maximize the margin between two classes. The SVM node does not support multi-class problems or support vector regression.

The TwoStage node enables you to compute a two-stage model for predicting a class and interval target variables at the same time. The interval target variable is usually a value that is associated with a level of the class target.

Assess Nodes

The Cutoff node provides tabular and graphical information to help you determine the best cutoff point or points for decision making models that have binary target variables.

The Decisions node enables you to define target profiles to produce optimal decisions. You can define fixed and variable costs, prior probabilities, and profit or loss matrices. These values are used in model selection steps.

The Model Comparison node provides a common framework for comparing models and predictions from any of the modeling tools (such as Regression, Decision Tree, and Neural Network tools). The comparison is based on standard model fit statistics as well as potential expected and actual profits or losses that would result from implementing the model. The node produces the following charts that help describe the usefulness of the model: lift, profit, return on investment, receiver operating curves, diagnostic charts, and threshold-based charts.

The Score node enables you to manage, edit, export, and execute scoring code that is generated from a trained model. Scoring is the generation of predicted values for a data set that cannot contain a target variable. The Score node generates and manages scoring formulas in the form of a single SAS DATA step, which can be used in most SAS environments even without the presence of SAS Enterprise Miner.

The Segment Profile node enables you to assess and explore segmented data sets. Segmented data is created from data BY values, clustering, or applied business rules. The Segment Profile node facilitates data exploration to identify factors that differentiate individual segments from the population, and to compare the distribution of key factors between individual segments and the population. The Segment Profile node outputs a Profile plot of variable distributions across segments and the population, a Segment Size pie chart, a Variable Worth plot that ranks factor importance within each segment, and summary statistics for the segmentation results. The Segment Profile node does not generate score code or modify metadata.

Utility Nodes

The Control Point node establishes a nonfunctional connection point to clarify and simplify process flow diagrams. For example, suppose three Input Data nodes are to be connected to three modeling nodes. If no Control Point node is used, then nine connections are required to connect all of the Input Data nodes to all of the modeling nodes. However, if a Control Point node is used, only six connections are required.

The End Groups node terminates a group processing segment in the process flow diagram. If the group processing function is stratified, bagging, or boosting, the Ends Groups node will function as a model node and present the final aggregated model. (Ensemble nodes are not required as in SAS Enterprise Miner 4.3.) Nodes that follow the Ends Groups node continue data mining processes normally.

The ExtDemo node illustrates the various UI elements that can be used by SAS Enterprise Miner extension nodes.

The Metadata node enables you to modify the columns metadata information at some point in your process flow diagram. You can modify attributes such as roles, measurement levels, and order.

The Reporter node tool uses SAS Output Delivery System (ODS) capabilities to create a single document for the given analysis in PDF or RTF format. The document includes important SAS Enterprise Miner results, such as variable selection, model diagnostic tables, and model results plots. The document can be viewed and saved directly and will be included in SAS Enterprise Miner report package files.

The SAS Code node tool enables you to incorporate SAS code into process flows that you develop using SAS Enterprise Miner. The SAS Code node extends the functionality of SAS Enterprise Miner by making other SAS System procedures available in your data mining analysis. You can also write a SAS DATA step to create customized scoring code, to conditionally process data, and to concatenate or to merge existing data sets.

The Score Code Export node tool enables you to extract score code and score metadata to an external folder. The Score Code Export node must be preceded by a Score node.

The Start Groups node initiates a group processing segment in the process flow diagram. The Start Groups node performs the following types of group processing:

Stratified group processing that repeats processes for values of a class variable, such as GENDER=M and GENDER=F.
Bagging, or bootstrap aggregation via repeated resampling.
Boosting, or boosted bootstrap aggregation, using repeated resampling with residual-based weights.
Index processing, which repeats processes for a fixed number of times. Index processing is normally used with a Sampling node or with user's code for a sample selection.

Applications Nodes

The Incremental Response node directly models the incremental impact of a treatment (such as a marketing action or incentive) and optimizes customer targeting in order to obtain the maximal response or return on investment. You can use incremental response modeling to determine the likelihood that a customer purchases a product, uses a coupon, or to predict the incremental revenue that is realized during a promotional period.

The Ratemaking node builds a generalized linear model, which is an extension of a traditional linear model. Using class and binned interval input variables, these models enable the population mean to depend on a linear predictor through a nonlinear link function.

The Survival node performs survival analysis on mining customer databases when there are time-dependent outcomes. Some examples of time-dependent outcomes are customer churn, cancellation of all products and services, unprofitable behavior, and server downgrade or extreme inactivity.

Some General Usage Rules for Nodes

These are some general rules that govern how you place nodes in a process flow diagram.

The Input Data node cannot be preceded by a node that exports a data set.
The Sample node must be preceded by a node that exports a data set.
The Assessment node must be preceded by one or more model nodes.
The Score node and the Score Code Export node must be preceded by a node that produces score code. Any node that modifies the data or builds models generates score code.
The SAS Code node can be used in any stage of a process flow diagram. It does not require an input data set.