Introduction to SAS Enterprise Miner 5.3 Software |
The nodes of Enterprise Miner are organized according to the Sample,
Explore, Modify, Model, and Assess (SEMMA) data mining methodology. In addition,
there are also Credit Scoring and Utility node tools. You use the Credit Scoring
node tools to score your data models and to create freestanding code. You
use the Utility node tools to submit SAS programming statements, and to define
control points in the process flow diagram.
Note: The Credit Scoring tab does not appear
in all installed versions of Enterprise Miner.
Remember that in a data mining project, it can be an advantage to repeat
parts of the data mining process. For example, you might want to explore and
plot the data at several intervals throughout your project. It might be advantageous
to fit models, assess the models, and then refit the models and then assess
them again.
The following tables list the nodes and give each node's primary purpose.
Node Name |
Description |
Append |
Use the Append node to append data sets that are exported by two different
paths in a single process flow diagram. The Append node can also append
train, validation, and test data sets into a new training data set. |
Data Partition |
Use the Data Partition node to partition data sets into training, test,
and validation data sets. The training data set is used for preliminary model
fitting. The validation data set is used to monitor and tune the model weights
during estimation and is also used for model assessment. The test data set
is an additional hold-out data set that you can use for model assessment.
This node uses simple random sampling, stratified random sampling, or clustered
sampling to create partitioned data sets. See Chapter 3. |
Filter |
Use the Filter node to create and apply filters to your training data
set and optionally, to the validation and test data sets. You can use filters
to exclude certain observations, such as extreme outliers and errant data
that you do not want to include in your mining analysis. Filtering extreme
values from the training data tends to produce better models because the parameter
estimates are more stable. By default, the Filter node ignores target and
rejected variables. |
Input Data Source |
Use the Input Data Source node to access SAS data sets and other types
of data. This node introduces a predefined Enterprise Miner Data Source and
metadata into a Diagram Workspace for processing. You can view metadata information
about your data in the Input Data Source node, such as initial values for
measurement levels and model roles of each variable. Summary statistics are
displayed for interval and class variables. See Chapter 3. |
Merge |
Use the Merge node to merge observations from two or more data sets
into a single observation in a new data set. |
Sample |
Use the Sample node to take random, stratified random samples, and to
take cluster samples of data sets. Sampling is recommended for extremely large
databases because it can significantly decrease model training time. If the
random sample sufficiently represents the source data set, then data relationships
that Enterprise Miner finds in the sample can be extrapolated upon the complete
source data set. The Sample node writes the sampled observations to an output
data set and saves the seed values that are used to generate the random numbers
for the samples so that you can replicate the samples. |
Time Series |
Use the Time Series node to convert transactional data to time series
data to perform seasonal and trend analysis. This node enables you to understand
trends and seasonal variations in the transaction data that you collect from
your customers and suppliers over the time, by converting transactional data
into time series data. Transactional data is time-stamped data that is collected
over time at no particular frequency. By contrast, time series data is time-stamped
data that is collected over time at a specific frequency. The size of transaction
data can be very large, which makes traditional data mining tasks difficult.
By condensing the information into a time series, you can discover trends
and seasonal variations in customer and supplier habits that might not be
visible in transactional data. |
Node Name |
Description |
Association |
Use the Association node to identify association relationships within
the data. For example, if a customer buys a loaf of bread, how likely is the
customer to also buy a gallon of milk? You use the Association node to perform
sequence discovery if a time-stamped variable (a sequence variable) is present
in the data set. Binary sequences are constructed automatically, but you can
use the Event Chain Handler to construct longer sequences that are based on
the patterns that the algorithm discovered. |
Cluster |
Use the Cluster node to segment your data so that you can identify data
observations that are similar in some way. When displayed in a plot, observations
that are similar tend to be in the same cluster, and observations that are
different tend to be in different clusters. The cluster identifier for each
observation can be passed to other nodes for use as an input, ID, or target
variable. This identifier can also be passed as a group variable that enables
you to automatically construct separate models for each group. |
DMDB |
The DMDB node creates a data mining database that provides summary
statistics and factor-level information for class and interval variables in
the imported data set.
In Enterprise Miner 4.3, the DMDB database optimized the performance
of the Variable Selection, Tree, Neural Network, and Regression nodes. It
did so by reducing the number of passes through the data that the analytical
engine needed to make when running a process flow diagram. Improvements to
the Enterprise Miner 5.3 software have eliminated the need to use the DMDB
node to optimize the performance of nodes, but the DMDB database can still
provide quick summary statistics for class and interval variables at a given
point in a process flow diagram. |
Graph Explore |
The Graph Explore node is an advanced visualization tool that enables
you to explore large volumes of data graphically to uncover patterns
and trends and to reveal extreme values in the database. You can analyze
univariate distributions, investigate multivariate distributions, create scatter
and box plots, constellation and 3D charts, and so on. If the Graph Explore
node follows a node that exports a data set in the process flow, it can use
either a sample or the entire data set as input. The resulting plot
is fully interactive: you can rotate a chart to different angles and move
it anywhere on the screen to obtain different perspectives on the data. You
can also probe the data by positioning the cursor over a particular bar within
the chart. A text window displays the values that correspond to that bar.
You may also want to use the node downstream in the process flow to perform
tasks, such as creating a chart of the predicted values from a model developed
with one of the modeling nodes. |
Market Basket |
The Market Basket node performs association rule mining over transaction
data in conjunction with item taxonomy. Transaction data contain sales transaction
records with details about items bought by customers. Market basket analysis
uses the information from the transaction data to give you insight about which
products tend to be purchased together. This information can be used to change
store layouts, to determine which products to put on sale, or to determine
when to issue coupons or some other profitable course of action.
The market basket analysis is not limited to the retail marketing domain.
The analysis framework can be abstracted to other areas such as word co-occurrence
relationships in text documents.
The Market Basket node is not included with SAS Enterprise Miner for
the Desktop. |
MultiPlot |
Use the MultiPlot node to explore larger volumes of data graphically.
The MultiPlot node automatically creates bar charts and scatter plots for
the input and target variables without requiring you to make several menu
or window item selections. The code that is created by this node can be used
to create graphs in a batch environment. See Chapter 3. |
Path Analysis |
Use the Path Analysis node to analyze Web log data and to determine
the paths that visitors take as they navigate through a Web site. You can
also use the node to perform sequence analysis. |
SOM/Kohonen |
Use the SOM/Kohonen node to perform unsupervised learning by using Kohonen
vector quantization (VQ), Kohonen self-organizing maps (SOMs), or batch SOMs
with Nadaraya-Watson or local-linear smoothing. Kohonen VQ is a clustering
method, whereas SOMs are primarily dimension-reduction methods. |
StatExplore |
Use the StatExplore node to examine variable distributions and statistics
in your data sets. You can use the StatExplore node to compute standard univariate
distribution statistics, to compute standard bivariate statistics by class
target and class segment, and to compute correlation statistics for interval
variables by interval input and target. You can also combine the StatExplore
node with other Enterprise Miner tools to perform data mining tasks such as
using the StatExplore node with the Metadata node to reject variables, using
the StatExplore node with the Transform Variables node to suggest transformations,
or even using the StatExplore node with the Regression node to create interactions
terms. See Chapter 3. |
Variable Clustering |
Variable clustering is a useful tool for data reduction, such as choosing
the best variables or cluster components for analysis. Variable clustering
removes collinearity, decreases variable redundancy, and helps to reveal the
underlying structure of the input variables in a data set. When properly
used as a variable-reduction tool, the Variable Clustering node can replace
a large set of variables with the set of cluster components with little loss
of information. |
Variable Selection |
Use the Variable Selection node to evaluate the importance of input
variables in predicting or classifying the target variable. To preselect the
important inputs, the Variable Selection node uses either an R-Square or a
Chi-Square selection (tree-based) criterion. You can use the R-Square criterion
to remove variables in hierarchies, remove variables that have large percentages
of missing values, and remove class variables that are based on the number
of unique values. The variables that are not related to the target are set
to a status of rejected. Although rejected variables are passed to subsequent
nodes in the process flow diagram, these variables are not used as model inputs
by a more detailed modeling node, such as the Neural Network and Decision
Tree nodes. You can reassign the status of the input model variables to rejected
in the Variable Selection node. See Chapter 5. |
Node Name |
Description |
Drop |
Use the Drop node to drop certain variables from your scored Enterprise
Miner data sets. You can drop variables that have roles of Assess, Classification,
Frequency, Hidden, Input, Predict, Rejected, Residual, Target, and Other from
your scored data sets. |
Impute |
Use the Impute node to impute (fill in) values for observations that
have missing values. You can replace missing values for interval variables
with the mean, median, midrange, mid-minimum spacing, distribution-based replacement.
Alternatively, you can use a replacement M-estimator such as Tukey's biweight,
Hubers, or Andrew's Wave. You can also estimate the replacement values for
each interval input by using a tree-based imputation method. Missing values
for class variables can be replaced with the most frequently occurring value,
distribution-based replacement, tree-based imputation, or a constant. See
Chapter 5. |
Interactive Binning |
The Interactive Binning node is an interactive grouping tool that you
use to model nonlinear functions of multiple modes of continuous distributions.
The interactive tool computes initial bins by quantiles; then you can interactively
split and combine the initial bins.You use the Interactive Binning node to
create bins or buckets or classes of all input variables. You can create bins
in order to reduce the number of unique levels as well as attempt to improve
the predictive power of each input. The Interactive Binning node enables
you to select strong characteristics based on the Gini statistic and to group
the selected characteristics based on business considerations. The node is
helpful in shaping the data to represent risk ranking trends rather than modeling
quirks, which might lead to overfitting. |
Principal Components |
Use the Principal Components node to perform a principal components
analysis for data interpretation and dimension reduction. The node generates
principal components that are uncorrelated linear combinations of the original
input variables and that depend on the covariance matrix or correlation matrix
of the input variables. In data mining, principal components are usually used
as the new set of input variables for subsequent analysis by modeling nodes. |
Replacement |
Use the Replacement node to impute (fill in) values for observations
that have missing values and to replace specified non-missing values for class
variables in data sets. You can replace missing values for interval variables
with the mean, median, midrange, or mid-minimum spacing, or with a distribution-based
replacement. Alternatively, you can use a replacement M-estimator such as
Tukey's biweight, Huber's, or Andrew's Wave. You can also estimate the replacement
values for each interval input by using a tree-based imputation method. Missing
values for class variables can be replaced with the most frequently occurring
value, distribution-based replacement, tree-based imputation, or a constant.
See Chapters 3, 4, and 5. |
Rules Builder |
The Rules Builder node accesses the Rules Builder window so you can
create ad hoc sets of rules with user-definable outcomes. You can interactively
define the values of the outcome variable and the paths to the outcome. This
is useful in ad hoc rule creation such as applying logic for posterior probabilities
and scorecard values. Any Input Data Source data set can be used as an input
to the Rules Builder node. Rules are defined using charts and histograms based
on a sample of the data. |
Transform Variables |
Use the Transform Variables node to create new variables that are transformations
of existing variables in your data. Transformations are useful when you want
to improve the fit of a model to the data. For example, transformations can
be used to stabilize variances, remove nonlinearity, improve additivity, and
correct nonnormality in variables. In Enterprise Miner, the Transform Variables
node also enables you to transform class variables and to create interaction
variables. See Chapter 5. |
Node Name |
Description |
AutoNeural |
Use the AutoNeural node to automatically configure a neural network.
It conducts limited searches for a better network configuration. See Chapters
5 and 6. |
Decision Tree |
Use the Decision Tree node to fit decision tree models to your data.
The implementation includes features that are found in a variety of popular
decision tree algorithms such as CHAID, CART, and C4.5. The node supports
both automatic and interactive training. When you run the Decision Tree node
in automatic mode, it automatically ranks the input variables, based on the
strength of their contribution to the tree. This ranking can be used to select
variables for use in subsequent modeling. You can override any automatic step
with the option to define a splitting rule and prune explicit tools or subtrees.
Interactive training enables you to explore and evaluate a large set of trees
as you develop them. See Chapters 4 and 6. |
Dmine Regression |
Use the Dmine Regression node to compute a forward stepwise least-squares
regression model. In each step, an independent variable is selected that contributes
maximally to the model R-square value. |
DMNeural |
Use DMNeural node to fit an additive nonlinear model. The additive nonlinear
model uses bucketed principal components as inputs to predict a binary or
an interval target variable. |
Ensemble |
Use the Ensemble node to create new models by combining the posterior
probabilities (for class targets) or the predicted values (for interval targets)
from multiple predecessor models. |
Gradient Boosting |
Gradient boosting is a boosting approach that creates a series of simple
decision trees that together form a single predictive model. Each tree in
the series is fit to the residual of the prediction from the earlier trees
in the series. Each time the data is used to grow a tree, the accuracy of
the tree is computed. The successive samples are adjusted to accommodate previously
computed inaccuracies. Because each successive sample is weighted according
to the classification accuracy of previous models, this approach is sometimes
called stochastic gradient boosting. Boosting is defined for binary, nominal,
and interval targets. |
MBR (Memory-Based Reasoning) |
Use the MBR (Memory-Based Reasoning) node to identify similar cases
and to apply information that is obtained from these cases to a new record.
The MBR node uses k-nearest neighbor algorithms to categorize
or predict observations. |
Model Import |
Use the Model Import node to import and assess a model that was not
created by one of the Enterprise Miner modeling nodes. You can then use the
Model Comparison node to compare the user-defined model with one or more
models that you developed with an Enterprise Miner modeling node. This process
is called integrated assessment. |
Neural Network |
Use the Neural Network node to construct, train, and validate multilayer
feedforward neural networks. By default, the Neural Network node automatically
constructs a multilayer feedforward network that has one hidden layer consisting
of three neurons. In general, each input is fully connected to the first hidden
layer, each hidden layer is fully connected to the next hidden layer, and
the last hidden layer is fully connected to the output. The Neural Network
node supports many variations of this general form. See Chapters 5 and 6. |
Partial Least Squares |
The Partial Least Squares node is a tool for modeling continuous and
binary targets that are based on SAS/STAT PROC PLS. Partial least squares
regression produces factor scores that are linear combinations of the original
predictor variables. As a result, no correlation exists between the factor
score variables that are used in the predictive regression model. Consider
a data set that has a matrix of response variables Y and a matrix with a large
number of predictor variables X. Some of the predictor variables are highly
correlated. A regression model that uses factor extraction for the data
computes the factor score matrix T=XW, where W is the weight matrix. Next,
the model considers the linear regression model Y=TQ+E, where Q is a matrix
of regression coefficients for the factor score matrix T, and where E is the
noise term. After computing the regression coefficients, the regression model
becomes equivalent to Y=XB+E, where B=WQ, which can be used as a predictive
regression model. |
Regression |
Use the Regression node to fit both linear and logistic regression models
to your data. You can use continuous, ordinal, and binary target variables.
You can use both continuous and discrete variables as inputs. The node supports
the stepwise, forward, and backward selection methods. A point-and-click term
editor enables you to customize your model by specifying interaction terms
and the ordering of the model terms. See Chapters 5 and 6. |
Rule Induction |
Use the Rule Induction node to improve the classification of rare events
in your modeling data. The Rule Induction node creates a Rule Induction model
that uses split techniques to remove the largest pure split node from the
data. Rule Induction also creates binary models for each level of a target
variable and ranks the levels from the most rare event to the most common.
After all levels of the target variable are modeled, the score code is combined
into a SAS DATA step. |
Support Vector Machines (Experimental) |
Support Vector Machines are used for classification. They use a hyperplane
to separate points mapped on a higher dimensional space. The data points used
to build this hyperplane are called support vectors. |
TwoStage |
Use the TwoStage node to compute a two-stage model for predicting a
class and an interval target variables at the same time. The interval target
variable is usually a value that is associated with a level of the class target. |
Note: These modeling nodes use a directory table
facility,
called the Model Manager, in which you can store and access models on demand.
The modeling nodes also enable you to modify the target profile or profiles
for a target variable.
Node Name |
Description |
Cutoff |
The Cutoff node provides tabular and graphical information to assist
users in determining an appropriate probability cutoff point for decision
making with binary target models. The establishment of a cutoff decision point
entails the risk of generating false positives and false negatives, but an
appropriate use of the Cutoff node can help minimize those risks.
You will typically run the node at least twice. In the first run,
you obtain all the plots and tables. In subsequent runs, you can change the
values of the Cutoff Method and Cutoff User Input properties, customizing
the plots, until an optimal cutoff value is obtained. |
Decisions |
Use the Decisions node to define target profiles for a target that produces
optimal decisions. The decisions are made using a user-specified decision
matrix and output from a subsequent modeling procedure. |
Model Comparison |
Use the Model Comparison node to use a common framework for comparing
models and predictions from any of the modeling tools (such as Regression,
Decision Tree, and Neural Network tools). The comparison is based on the expected
and actual profits or losses that would result from implementing the model.
The node produces the following charts that help to describe the usefulness
of the model: lift, profit, return on investment, receiver operating curves,
diagnostic charts, and threshold-based charts. See Chapter 6. |
Segment Profile |
Use the Segment Profile node to assess and explore segmented data sets.
Segmented data is created from data BY-values, clustering, or applied business
rules. The Segment Profile node facilitates data exploration to identify factors
that differentiate individual segments from the population, and to compare
the distribution of key factors between individual segments and the population.
The Segment Profile node outputs a Profile plot of variable distributions
across segments and population, a Segment Size pie chart, a Variable Worth
plot that ranks factor importance within each segment, and summary statistics
for the segmentation results. The Segment Profile node does not generate score
code or modify metadata. |
Score |
Use the Score node to manage, edit, export, and execute scoring code
that is generated from a trained model. Scoring is the generation of predicted
values for a data set that might not contain a target variable. The Score
node generates and manages scoring formulas in the form of a single SAS DATA
step, which can be used in most SAS environments even without the presence
of Enterprise Miner. See Chapter 6. |
Node Name |
Description |
Control Point |
Use the Control Point node to establish a control point to reduce the
number of connections that are made in process flow diagrams. For example,
suppose three Input Data nodes are to be connected to three modeling nodes.
If no Control Point node is used, then nine connections are required to connect
all of the Input Data nodes to all of the modeling nodes. However, if a Control
Point node is used, only six connections are required. |
End Groups |
The End Groups node is used only in conjunction with the Start Groups
node. The End Groups node acts as a boundary marker that defines the end of
group processing operations in a process flow diagram. Group processing operations
are performed on the portion of the process flow diagram that exists between
the Start Groups node and the End Groups node.
If the group processing function that is specified in the Start Groups
node is stratified, bagging, or boosting, the End Groups node functions as
a model node and presents the final aggregated model. Enterprise Miner tools
that follow the End Groups node continue data mining processes normally. |
Start Groups |
The Start Groups node is useful when your data can be segmented or grouped,
and you want to process the grouped data in different ways. The Start Groups
node uses BY-group processing as a method to process observations from one
or more data sources that are grouped or ordered by values of one or more
common variables. BY variables identify the variable or variables by
which the data source is indexed, and BY statements process data and order
output according to the BY-group values.
You can use the Enterprise Miner Start Groups node to perform these
tasks:
-
define group variables such as GENDER or JOB, in order to obtain
separate analyses for each level of a group variable
-
analyze more than one target variable in the same process flow
-
specify index looping, or how many
times the flow that follows
the node should loop
-
resample the data set and use unweighted sampling to create bagging
models
-
resample the training data set and use reweighted sampling to
create boosting models
|
Metadata |
Use the Metadata node to modify the columns metadata information at
some point in your process flow diagram. You can modify attributes such as
roles, measurement levels, and order. |
Reporter |
The Reporter node uses SAS Output Delivery System (ODS) capability to
create a single PDF or RTF file that contains information about the open process
flow diagram. The PDF or RTF documents can be viewed and saved directly and
are included in Enterprise Miner report package files.
The report contains a header that shows the Enterprise Miner settings,
process flow diagram, and detailed information for each node. Based on the
Nodes property setting, each node that is included in the open process flow
diagram has a header, property settings, and a variable summary. Moreover,
the report also includes results such as variable selection, model diagnostic
tables, and plots from the Results browser. Score code, log, and output listing
are not included in the report. Those items are found in the Enterprise Miner
package folder. |
SAS Code |
Use the SAS Code node to incorporate new or existing SAS code into process
flows that you develop using Enterprise Miner. The SAS Code node extends the
functionality of Enterprise Miner by making other SAS procedures available
in your data mining analysis. You can also write a SAS DATA step to create
customized scoring code, to conditionally process data, and to concatenate
or to merge existing data sets. See Chapter 6. |
|
|
Copyright © 2008 by SAS Institute Inc., Cary, NC, USA. All rights reserved.