Glossary :: Getting Started with SAS(R) Enterprise Miner(TM) 13.1

assessment

the process of determining how well a model computes good outputs from input data that is not used during training. Assessment statistics are automatically computed when you train a model with a modeling node. By default, assessment statistics are calculated from the validation data set.

champion model

the best predictive model that is chosen from a pool of candidate models in a data mining environment. Candidate models are developed using various data mining heuristics and algorithm configurations. Competing models are compared and assessed using criteria such as training, validation, and test data fit and model score comparisons.

chi-squared automatic interaction detection

a technique for building decision trees. The CHAID technique specifies a significance level of a chi-square test to stop tree growth. Short-form: CHAID.

classification and regression trees

a decision tree technique that is used for classifying or segmenting a data set. The technique provides a set of rules that can be applied to new data sets in order to predict which records will have a particular outcome. It also segments a data set by creating 2-way splits. The CART technique requires less data preparation than CHAID. Short form: CART.

data mining database

a SAS data set that is designed to optimize the performance of the modeling nodes. DMDBs enhance performance by reducing the number of passes that the analytical engine needs to make through the data. Each DMDB contains a meta catalog, which includes summary statistics for numeric variables and factor-level information for categorical variables. Short form: DMDB.

data source

a data object that represents a SAS data set in the Java-based Enterprise Miner GUI. A data source contains all the metadata for a SAS data set that Enterprise Miner needs in order to use the data set in a data mining process flow diagram. The SAS data set metadata that is required to create an Enterprise Miner data source includes the name and location of the data set, the SAS code that is used to define its library path, and the variable roles, measurement levels, and associated attributes that are used in the data mining process.

decision tree

the complete set of rules that are used to split data into a hierarchy of successive segments. A tree consists of branches and leaves, in which each set of leaves represents an optimal segmentation of the branches above them according to a statistical measure.

dependent variable

a variable whose value is determined by the value of another variable or by the values of a set of variables.

depth

the number of successive hierarchical partitions of the data in a tree. The initial, undivided segment has a depth of 0.

Gini index

a measure of the total leaf impurity in a decision tree.

hidden layer

in a neural network, a layer between input and output to which one or more activation functions are applied. Hidden layers are typically used to introduce nonlinearity.

imputation

the computation of replacement values for missing input values.

input variable

a variable that is used in a data mining process to predict the value of one or more target variables.

interval variable

a continuous variable that contains values across a range. For example, a continuous variable called Temperature could have values such as 0, 32, 34, 36, 43.5, 44, 56, 80, 99, 99.9, and 100.

leaf

in a tree diagram, any segment that is not further segmented. The final leaves in a tree are called terminal nodes.

logistic regression

a form of regression analysis in which the target variable (response variable) represents a binary-level or ordinal-level response.

metadata

a description or definition of data or information.

MLP

See multilayer perceptron.

model

a formula or algorithm that computes outputs from inputs. A data mining model includes information about the conditional distribution of the target variables, given the input variables.

multilayer perceptron

a neural network that has one or more hidden layers, each of which has a linear combination function and executes a nonlinear activation function on the input to that layer. Short form: MLP. See also hidden layer.

neural networks

a class of flexible nonlinear regression models, discriminant models, data reduction models, and nonlinear dynamic systems that often consist of a large number of neurons. These neurons are usually interconnected in complex ways and are often organized into layers. See also neuron.

node

(1) in the SAS Enterprise Miner user interface, a graphical object that represents a data mining task in a process flow diagram. The statistical tools that perform the data mining tasks are called nodes when they are placed on a data mining process flow diagram. Each node performs a mathematical or graphical operation as a component of an analytical and predictive data model. (2) in a neural network, a linear or nonlinear computing element that accepts one or more inputs, computes a function of the inputs, and can direct the result to one or more other neurons. Nodes are also known as neurons or units. (3) a leaf in a tree diagram. The terms leaf, node, and segment are closely related and sometimes refer to the same part of a tree. See also process flow diagram and internal node.

observation

a row in a SAS data set. All of the data values in an observation are associated with a single entity such as a customer or a state. Each observation contains either one data value or a missing-value indicator for each variable.

overfit

to train a model to the random variation in the sample data. Overfitted models contain too many parameters (weights), and they do not generalize well. See also underfit.

partition

to divide available data into training, validation, and test data sets.

PFD

See process flow diagram.

predicted value

in a regression model, the value of a dependent variable that is calculated by evaluating the estimated regression equation for a specified set of values of the explanatory variables.

prior probability

a probability that reflects knowledge about the population before obtaining the sample on hand.

process flow diagram

a graphical representation of the various data mining tasks that are performed by individual Enterprise Miner nodes during a data mining analysis. A process flow diagram consists of two or more individual nodes that are connected in the order in which the data miner wants the corresponding statistical operations to be performed. Short form: PFD.

profit matrix

a table of expected revenues and expected costs for each decision alternative for each level of a target variable.

project

a user-created GUI entity that contains the related SAS Enterprise Miner components required for the data mining models. A project contains SAS Enterprise Miner data sources, process flow diagrams, and results data sets and model packages.

pruning

the process of removing nodes from a decision tree when those nodes involve less than optimal decision rules.

root node

the initial segment of a tree. The root node represents the entire data set that is submitted to the tree, before any splits are made.

sampling

the process of subsetting a population into n cases. Sampling decreases the time required for fitting a model.

SAS data set

a file whose contents are in one of the native SAS file formats. There are two types of SAS data sets: SAS data files and SAS data views. SAS data files contain data values in addition to descriptor information that is associated with the data. SAS data views contain only the descriptor information plus other information that is required for retrieving data values from other SAS data sets or from files that are stored in other software vendors' file formats.

scoring

the process of applying a model to new data in order to compute outputs. Scoring is the last process that is performed in data mining.

SEMMA

the data mining process that is used by Enterprise Miner. SEMMA stands for Sample, Explore, Modify, Model, and Assess.

subdiagram

in a process flow diagram, a collection of nodes that are compressed into a single node. The use of subdiagrams can improve your control of the information flow in the diagram.

target variable

a variable whose values are known in one or more data sets that are available (in training data, for example) but whose values are unknown in one or more future data sets (in a score data set, for example). Data mining models use data from known variables to predict the values of target variables.

training

the process of computing good values for the weights in a model.

training data

currently available data that contains input values and target values that are used for model training.

transformation

the process of applying a function to a variable in order to adjust the variable's range, variability, or both.

tree

the complete set of rules that are used to split data into a hierarchy of successive segments. A tree consists of branches and leaves, in which each set of leaves represents an optimal segmentation of the branches above them according to a statistical measure.

tree structure

a type of data structure that uses the graphic analogy of a tree with branches and leaves. Each set of leaves represents an optimal segmentation of the branches above it, according to a statistical measure and the rules that govern the structure.

underfit

to train a model to only part of the actual patterns in the sample data. Underfit models contain too few parameters (weights), and they do not generalize well. See also overfit.

validation data

data that is used to validate the suitability of a data model that was developed using training data. Both training data sets and validation data sets contain target variable values. Target variable values in the training data are used to train the model. Target variable values in the validation data set are used to compare the training model's predictions to the known target values, assessing the model's fit before using the model to score new data.

variable

a column in a SAS data set or in a SAS data view. The data values for each variable describe a single characteristic for all observations. Each SAS variable can have the following attributes: name, data type (character or numeric), length, format, informat, and label.

variable attribute

any of the following characteristics that are associated with a particular variable: name, label, format, informat, data type, and length.

variable level

the set of data dimensions for binary, interval, or class variables. Binary variables have two levels. A binary variable CREDIT could have levels of 1 and 0, Yes and No, or Accept and Reject. Interval variables have levels that correspond to the number of interval variable partitions. For example, an interval variable PURCHASE_AGE might have levels of 0-18, 19-39, 40-65, and >65. Class variables have levels that correspond to the class members. For example, a class variable HOMEHEAT might have four variable levels: Coal/Wood, FuelOil, Gas, and Electric. Data mining decision and profit matrixes are composed of variable levels.