Glossary :: Getting Started with SAS(R) Enterprise Miner(TM) 14.1

assessment

the process of determining how well a model computes good outputs from input data that is not used during training. Assessment statistics are automatically computed when you train a model with a modeling node. By default, assessment statistics are calculated from the validation data set.

attribute

See variable attribute.

CART

See classification and regression tree.

CHAID

See chi-squared automatic interaction detection.

champion model

the best predictive model that is chosen from a pool of candidate models in a data mining environment.

chi-squared automatic interaction detection (CHAID)

a technique for building decision trees. The CHAID technique specifies a significance level of a chi-square test to stop tree growth.

classification and regression tree (CART)

a decision tree technique that is used for classifying or segmenting a data set. The technique provides a set of rules that can be applied to new data sets in order to predict which records will have a particular outcome. It also segments a data set by creating 2-way splits. The CART technique requires less data preparation than CHAID.

data mining database (DMDB)

a SAS data set that is designed to optimize the performance of the modeling nodes. DMDBs enhance performance by reducing the number of passes that the analytical engine needs to make through the data. Each DMDB contains a meta catalog, which includes summary statistics for numeric variables and factor-level information for categorical variables.

data mining model (model)

a formula or algorithm that computes outputs from inputs. A data mining model includes information about the conditional distribution of the target variables, given the input variables.

data set

See SAS data set.

dependent variable (response variable, experimental variable)

a variable that is observed to change in response to the independent variables. In a function y=f(x), the value of the dependent variable y is a function of the independent variable x.

depth

the number of successive hierarchical partitions of the data in a tree. The initial, undivided segment has a depth of 0.

DMDB

See data mining database.

experimental variable

See dependent variable.

Gini index

a measure of the total leaf impurity in a decision tree.

hidden layer

in a neural network, a layer between input and output to which one or more activation functions are applied. Hidden layers are typically used to introduce nonlinearity.

imputation

the computation of replacement values for missing input values.

input variable

a variable that is used in a data mining process to predict the value of one or more target variables.

interval variable

a continuous variable that contains values across a range. For example, a continuous variable called Temperature could have values such as 0, 32, 34, 36, 43.5, 44, 56, 80, 99, 99.9, and 100.

leaf

in a tree diagram, any segment that is not further segmented. The final leaves in a tree are called terminal nodes.

logistic regression

a form of regression analysis in which the target variable (response variable) represents a binary-level, categorical, or ordinal-level response.

metadata

descriptive data about data that is stored and managed in a database, in order to facilitate access to captured and archived data for further use.

MLP

See multilayer perceptron.

model

See data mining model.

multilayer perceptron (MLP)

a neural network that has one or more hidden layers, each of which has a linear combination function and executes a nonlinear activation function on the input to that layer. See also hidden layer.

neural network

any of a class of models that usually consist of a large number of neurons, interconnected in complex ways and organized into layers. Examples are flexible nonlinear regression models, discriminant models, data reduction models, and nonlinear dynamic systems.

observation

a row in a SAS data set. All of the data values in an observation are associated with a single entity such as a customer or a state. Each observation contains either one data value or a missing-value indicator for each variable.

overfit

to train a model to the random variation in the sample data. Overfitted models contain too many parameters (weights), and they do not generalize well. See also underfit.

PFD

See process flow diagram.

predicted value

in a regression model, the value of a dependent variable that is calculated by evaluating the estimated regression equation for a specified set of values of the explanatory variables.

prior probability

a probability that reflects knowledge about the population before obtaining the sample on hand.

process flow diagram (PFD)

a graphical sequence of interconnected symbols that represent an ordered set of steps or tasks that, when combined, form a workflow designed to yield an analytical result.

profit matrix

a table of expected revenues and expected costs for each decision alternative for each level of a target variable.

project

a user-created GUI entity that contains the related SAS Enterprise Miner components required for the data mining models. A project contains SAS Enterprise Miner data sources, process flow diagrams, and results data sets and model packages.

pruning

the process of removing nodes from a decision tree when those nodes involve less than optimal decision rules.

response variable

See dependent variable.

root

See root node.

root node (root)

the topmost level in a hierarchical tree, representing the entire tree and its contents.

sampling

the process of subsetting a population into n cases. Sampling decreases the time required for fitting a model.

SAS data set (data set)

a file whose contents are in one of the native SAS file formats. There are two types of SAS data sets: SAS data files and SAS data views.

SAS variable (variable)

a column in a SAS data set or in a SAS data view. The data values for each variable describe a single characteristic for all observations (rows).

SEMMA

the data mining process that is used by SAS Enterprise Miner. SEMMA stands for Sample, Explore, Modify, Model, and Assess.

subdiagram

in a process flow diagram, a collection of nodes that are compressed into a single node. The use of subdiagrams can improve your control of the information flow in the diagram.

target variable

a variable whose values are known in one or more data sets that are available (in training data, for example) but whose values are unknown in one or more future data sets (in a score data set, for example). Data mining models use data from known variables to predict the values of target variables.

training data

data that contains input values and target values that are used to train and build predictive models.

transformation

in statistics, the process of applying a function to a variable in order to adjust the variable's range, variability, or both.

tree structure

a type of data structure that uses the graphic analogy of a tree with branches and leaves. Each set of leaves represents an optimal segmentation of the branches above it, according to a statistical measure and the rules that govern the structure.

underfit

to train a model to only part of the actual patterns in the sample data. Underfit models contain too few parameters (weights), and they do not generalize well. See also overfit.

validation data

data that is used to validate the suitability of a data model that was developed using training data. Both training data sets and validation data sets contain target variable values. Target variable values in the training data are used to train the model. Target variable values in the validation data set are used to compare the training model's predictions to the known target values, assessing the model's fit before using the model to score new data.

variable

See SAS variable.

variable attribute (attribute)

any of the following characteristics that are associated with a particular variable: name, label, format, informat, data type, and length.

variable level

the set of data dimensions for binary, interval, or class variables. Binary variables have two levels. A binary variable CREDIT could have levels of 1 and 0, Yes and No, or Accept and Reject. Interval variables have levels that correspond to the number of interval variable partitions. For example, an interval variable PURCHASE_AGE might have levels of 0-18, 19-39, 40-65, and >65. Class variables have levels that correspond to the class members. For example, a class variable HOMEHEAT might have four variable levels: Coal/Wood, FuelOil, Gas, and Electric. Data mining decision and profit matrixes are composed of variable levels.