About Nodes

Most SAS Enterprise Miner nodes are organized on tabs according to the SEMMA data mining methodology. There are also tabs for Credit Scoring and Utility node groups. Use the Credit Scoring nodes to score data models and to create freestanding code. Use the Utility nodes to submit SAS programming statements and to define control points in the process flow diagram.
Note: The Credit Scoring nodes do not appear in all installed versions of SAS Enterprise Miner. For more information about the Credit Scoring nodes, see the SAS Enterprise Miner Credit Scoring Help.
Sample Nodes
Node Name
Description
Append
Use the Append node to append data sets that are exported by two different paths in a single process flow diagram. The Append node can also append train, validation, and test data sets into a new training data set.
Data Partition
Use the Data Partition node to partition an input data set into a training, test, and validation data set. The training data set is used for preliminary model fitting. The validation data set is used to monitor and tune the free model parameters during estimation. It is also used for model assessment. The test data set is an additional holdout data set that you can use for model assessment.
File Import
Use the File Import node to convert selected external flat files, spreadsheets, and database tables into a format that SAS Enterprise Miner recognizes as a data source and can use in data mining process flow diagrams.
Filter
Use the Filter node to create and apply filters to the input data. You can use filters to exclude certain observations, such as extreme outliers and errant data that you do not want to include in a mining analysis.
Input Data
The Input Data node represents the data source that you choose for a mining analysis. It provides details (metadata) about the variables in the data source that you want to use.
Merge
Use the Merge node to merge observations from two or more data sets into a single observation in a new data set.
Sample
Use the Sample node to extract a simple random sample, nth-observation sample, stratified sample, first-n sample, or cluster sample of an input data set. Sampling is recommended for extremely large databases because it can significantly decrease model training time. If the random sample sufficiently represents the source data set, then data relationships that SAS Enterprise Miner finds in the sample can be applied to the complete source data set.
Time Series
Use the Time Series node to convert transactional data to time series data to perform seasonal and trend analysis. Transactional data is timestamped data that is collected over time at no particular frequency. By contrast, time series data is timestamped data that is collected over time at a specific frequency.
Explore Nodes
Node Name
Description
Association
Use the Association node to identify association and sequence relationships within the data. For example, “If a customer buys cat food, how likely is the customer to also buy cat litter?” In the case of sequence discovery, this question could be extended and posed as, “If a customer buys cat food today, how likely is the customer to buy cat litter within the next week?”
Cluster
Use the Cluster node to perform observation clustering, which can be used to segment databases. Clustering places objects into groups or clusters suggested by the data. The objects in each cluster tend to be similar to each other in some sense, and objects in different clusters tend to be dissimilar.
DMDB (Data Mining Database)
The DMDB node creates a data mining database that provides summary statistics and factor-level information for class and interval variables in the imported data set. Improvements to SAS Enterprise Miner have eliminated the previous need to use the DMDB node to optimize the performance of nodes. However, the DMDB database can still provide quick summary statistics for class and interval variables at a given point in a process flow diagram.
Graph Explore
The Graph Explore node is an advanced visualization tool that enables you to interactively explore large volumes of data to uncover patterns and trends and to reveal extreme values in the database. You can analyze univariate distributions, investigate multivariate distributions, create scatter and box plots, constellation and 3-D charts, and so on.
Market Basket
The Market Basket node performs association rule mining over transaction data in conjunction with item taxonomy. Market basket analysis uses the information from the transaction data to give you insight (for example, about which products tend to be purchased together). The market basket analysis is not limited to the retail marketing domain and can be abstracted to other areas such as word co-occurrence relationships in text documents.
MultiPlot
Use the MultiPlot node to visualize data from a wide range of perspectives. The MultiPlot node automatically creates bar charts and scatter plots for the input and target variables without requiring you to make several menu or window item selections.
Path Analysis
Use the Path Analysis node to analyze web log data and to determine the paths that visitors take as they navigate through a website. You can also use the node to perform sequence analysis.
SOM/Kohonen
Use the SOM/Kohonen node to perform unsupervised learning by using Kohonen vector quantization (VQ), Kohonen self-organizing maps (SOMs), or batch SOMs with Nadaraya-Watson or local-linear smoothing. Kohonen VQ is a clustering method, whereas SOMs are primarily dimension-reduction methods.
StatExplore
Use the StatExplore node to examine the statistical properties of an input data set. You can use the StatExplore node to compute standard univariate distribution statistics, to compute standard bivariate statistics by class target and class segment, and to compute correlation statistics for interval variables by interval input and target.
Variable Clustering
Variable clustering is a useful tool for data reduction and can remove collinearity, decrease variable redundancy, and help reveal the underlying structure of the input variables in a data set. When properly used as a variable-reduction tool, the Variable Clustering node can replace a large set of variables with the set of cluster components with little loss of information.
Variable Selection
Use the Variable Selection node to quickly identify input variables that are useful for predicting the target variable.
Modify Nodes
Node Name
Description
Drop
Use the Drop node to remove variables from data sets or hide variables from the metadata. You can drop specific variables and all variables of a particular type.
Impute
Use the Impute node to replace missing values. For example, you could replace missing values of an interval variable with the mean or using an M-estimator such as Andrew’s Wave. Missing values for the training, validation, test, and score data sets are replaced using imputation statistics that are calculated from the active training predecessor data set.
Interactive Binning
The Interactive Binning node is an interactive grouping tool that you use to model nonlinear functions of multiple modes of continuous distributions. The interactive tool computes initial bins by quantiles. Then you can interactively split and combine the initial bins. This node enables you to select strong characteristics based on the Gini statistic and to group the selected characteristics based on business considerations. The node is helpful in shaping the data to represent risk-ranking trends rather than modeling quirks, which might lead to overfitting.
Principal Components
Use the Principal Components node to generate principal components. Principal components are uncorrelated linear combinations of the original input variables and which depend on the covariance matrix or correlation matrix of the input variables. In data mining, principal components are usually used as the new set of input variables for subsequent analysis by modeling nodes.
Replacement
Use the Replacement node to generate score code to process unknown levels when scoring and also to interactively specify replacement values for class and interval levels. In some cases, you might want to reassign specified nonmissing values before performing imputation calculations for the missing values.
Rules Builder
The Rules Builder node accesses the Rules Builder window that so you can create ad hoc sets of rules with user-definable outcomes. You can interactively define the values of the outcome variable and the paths to the outcome. This is useful, for example, in applying logic for posterior probabilities and scorecard values. Rules are defined using charts and histograms based on a sample of the data.
Transform Variables
Use the Transform Variables node to create new variables or variables that are transformations of existing variables in the data. Transformations are useful when you want to improve the fit of a model to the data. For example, transformations can be used to stabilize variances, remove nonlinearity, improve additivity, and correct non-normality in variables. The Transform Variables node also enables you to create interaction variables.
Model Nodes
Node Name
Description
AutoNeural
Use the AutoNeural node as an automated tool to help you find optimal configurations for a neural network model.
Decision Tree
Use the Decision Tree node to fit decision tree models to the data. The implementation includes features that are found in a variety of popular decision tree algorithms such as CHAID, CART, and C4.5. The node supports both automatic and interactive training. When you run the Decision Tree node in automatic mode, it automatically ranks the input variables, based on the strength of their contribution to the tree. This ranking can be used to select variables for use in subsequent modeling. You can override any automatic step with the option to define a splitting rule and prune explicit tools or subtrees. Interactive training enables you to explore and evaluate a large set of trees as you develop them.
Dmine Regression
Use the Dmine Regression node to compute a forward stepwise least squares regression model. In each step, an independent variable is selected. This independent variable contributes maximally to the model R-square value.
DMNeural
Use DMNeural node to fit an additive nonlinear model. The additive nonlinear model uses bucketed principal components as inputs to predict a binary or an interval target variable. The algorithm that is used in DMNeural network training was developed to overcome the problems of the common neural networks that are likely to occur especially when the data set contains highly collinear variables.
Ensemble
Use the Ensemble node to create new models by combining the posterior probabilities (for class targets) or the predicted values (for interval targets) from multiple predecessor models. One common ensemble approach is to use multiple modeling methods, such as a neural network and a decision tree, to obtain separate models from the same training data set. The component models from the two complementary modeling methods are integrated by the Ensemble node to form the final model solution
Gradient Boosting
Gradient boosting creates a series of simple decision trees that together form a single predictive model. Each tree in the series is fit to the residual of the prediction from the earlier trees in the series. Each time the data is used to grow a tree, the accuracy of the tree is computed. The successive samples are adjusted to accommodate previously computed inaccuracies. Because each successive sample is weighted according to the classification accuracy of previous models, this approach is sometimes called stochastic gradient boosting. Boosting is defined for binary, nominal, and interval targets.
LARS (Least Angle Regressions)
The LARs node can perform both variable selection and model-fitting tasks. When used for variable selection, the LARs node selects variables in a continuous fashion, where coefficients for each selected variable grow from zero to the variable's least square estimates. With a small modification, you can use LARs to efficiently produce LASSO solutions.
MBR (Memory-Based Reasoning)
Use the MBR node to identify similar cases and to apply information that is obtained from these cases to a new record. The MBR node uses k-nearest neighbor algorithms to categorize or predict observations.
Model Import
Use the Model Import node to import and assess a model that was not created by one of the SAS Enterprise Miner modeling nodes. You can then use the Model Comparison node to compare the user-defined model with one or more models that you developed with a SAS Enterprise Miner modeling node. This process is called integrated assessment.
Neural Network
Use the Neural Network node to construct, train, and validate multilayer, feed-forward neural networks. By default, the Neural Network node automatically constructs a network that has one hidden layer consisting of three neurons. In general, each input is fully connected to the first hidden layer, each hidden layer is fully connected to the next hidden layer, and the last hidden layer is fully connected to the output. The Neural Network node supports many variations of this general form.
Partial Least Squares
The Partial Least Squares node is a tool for modeling continuous and binary targets. This node extracts factors called components or latent vectors that can be used to explain response variation or predictor variation in the analyzed data.
Regression
Use the Regression node to fit both linear and logistic regression models to the data. You can use continuous, ordinal, and binary target variables, and you can use both continuous and discrete input variables. The node supports the stepwise, forward, and backward selection methods.
Rule Induction
Use the Rule Induction node to improve the classification of rare events. The Rule Induction node creates a Rule Induction model that uses split techniques to remove the largest pure split node from the data. Rule Induction also creates binary models for each level of a target variable and ranks the levels from the most rare event to the most common. After all levels of the target variable are modeled, the score code is combined into a SAS DATA step.
SVM (Support Vector Machines)
A support vector machine (SVM) is a supervised machine learning method that is used to perform classification and regression analysis. The standard SVM problem solves binary classification problems that produce non-probability output (only sign +1/-1) by constructing a set of hyperplanes that maximize the margin between two classes.
TwoStage
Use the TwoStage node to build a sequential or concurrent two-stage model for predicting a class variable and an interval target variable at the same time. The interval target variable is usually a value that is associated with a level of the class target.
Note: These modeling nodes use a directory table facility, called the Model Manager, in which you can store and access models on demand.
Assess Nodes
Node Name
Description
Cutoff
The Cutoff node provides tabular and graphical information to assist you in determining an appropriate probability cutoff point for decision making with binary target models. The establishment of a cutoff decision point entails the risk of generating false positives and false negatives, but an appropriate use of the Cutoff node can help minimize those risks. You typically run the node at least twice. In the first run, you obtain all the plots and tables. In subsequent runs, you can change the node properties until an optimal cutoff value is obtained.
Decisions
Use the Decisions node to define target profiles for a target that produces optimal decisions. The decisions are made using a user-specified decision matrix and output from a subsequent modeling procedure.
Model Comparison
Use the Model Comparison node to compare models and predictions from any of the modeling nodes. The comparison is based on the expected and actual profits or losses that would result from implementing the model. The node produces the charts that help describe the usefulness of the model.
Score
Use the Score node to manage SAS scoring code that is generated from a trained model or models, to save the SAS scoring code to a location on the client computer, and to run the SAS scoring code. Scoring is the generation of predicted values for a data set that might not contain a target variable.
Segment Profile
Use the Segment Profile node to examine segmented or clustered data and identify factors that differentiate data segments from the population. The node generates various reports that aid in exploring and comparing the distribution of these factors within the segments and population.
Utility Nodes
Node Name
Description
Control Point
Use the Control Point node to establish a control point within process flow diagrams. A control point simplifies distributing the connections between process flow steps that have multiple interconnected nodes. The Control Point node can reduce the number of connections that are made.
End Groups
The End Groups node is used only in conjunction with the Start Groups node. The End Groups node acts as a boundary marker that defines the end of group processing operations in a process flow diagram. Group processing operations are performed on the portion of the process flow diagram that exists between the Start Groups node and the End Groups node. If you specify Stratified, Bagging, or Boosting in the group processing function of the Start Groups node,then the End Groups node functions as a model node and presents the final aggregated model.
Ext Demo
The Ext Demo node illustrates the various controls that can be used in SAS Enterprise Miner extension nodes. These controls enable users to pass arguments to an underlying SAS program. By choosing an appropriate user interface control, an extension node developer can specify how information about the node's arguments are presented to the user and place restrictions on the values of the arguments. The Ext Demo node's results also provide examples of the various types of graphs that can be generated by an extension node using the %EM_REPORT macro.
Metadata
Use the Metadata node to modify the columns metadata information (such as roles, measurement levels, and order) in a process flow diagram.
Reporter
The Reporter node uses SAS Output Delivery System (ODS) capability to create a single PDF or RTF file that contains information about the open process flow diagram. The report shows the SAS Enterprise Miner settings, process flow diagram, and detailed information for each node. The report also includes results such as variable selection, model diagnostic tables, and plots from the Results browser. The score code, log, and output listing are not included in the report; those items are found in the SAS Enterprise Miner package folder.
SAS Code
Use the SAS Code node to incorporate new or existing SAS code into process flows that you develop using SAS Enterprise Miner.
Score Code Export
The Score Code Export node is an extension for SAS Enterprise Miner that exports files that are necessary for score code deployment. Extensions are programmable add-ins for the SAS Enterprise Miner environment.
Start Groups
The Start Groups node is useful when the data can be segmented or grouped, and you want to process the grouped data in different ways. The Start Groups node uses BY-group processing as a method to process observations from one or more data sources that are grouped or ordered by values of one or more common variables.
Applications Nodes
Node Name
Description
Incremental Response
The Incremental Response node models the incremental impact of a treatment in order to optimize customer targeting for maximum return on investment. The Incremental Response node can determine the likelihood that a customer purchases a product or uses a coupon. It can predict the incremental revenue that is realized during a promotional period.
Survival
Survival data mining is the application of survival analysis to data mining problems concerning customers. The application to the business problem changes the nature of the statistical techniques. The issue in survival data mining is not whether an event will occur in a certain time interval, but when the next event will occur. The SAS Enterprise Miner Survival node performs survival analysis on mining customer databases when there are time-dependent outcomes.