Previous Page | Next Page

Glossary

Glossary

assessment

the process of determining how well a model computes good outputs from input data that is not used during training. Assessment statistics are automatically computed when you train a model with a modeling node. By default, assessment statistics are calculated from the validation data set.

association discovery

the process of identifying items that occur together in a particular event or record. This technique is also known as market basket analysis. Association discovery rules are based on frequency counts of the number of times items occur alone and in combination in the database.

binary variable

a variable that contains two discrete values (for example, PURCHASE: Yes and No).

branch

a subtree that is rooted in one of the initial divisions of a segment of a tree. For example, if a rule splits a segment into seven subsets, then seven branches grow from the segment.

CART (classification and regression trees)

a decision tree technique that is used for classifying or segmenting a data set. The technique provides a set of rules that can be applied to new data sets in order to predict which records will have a particular outcome. It also segments a data set by creating 2-way splits. The CART technique requires less data preparation than CHAID.

case

a collection of information about one of many entities that are represented in a data set. A case is an observation in the data set.

CHAID (chi-squared automatic interaction detection)

a technique for building decision trees. The CHAID technique specifies a significance level of a chi-square test to stop tree growth.

champion model

the best predictive model that is chosen from a pool of candidate models in a data mining environment. Candidate models are developed using various data mining heuristics and algorithm configurations. Competing models are compared and assessed using criteria such as training, validation, and test data fit and model score comparisons.

clustering

the process of dividing a data set into mutually exclusive groups such that the observations for each group are as close as possible to one another, and different groups are as far as possible from one another.

cost variable

a variable that is used to track cost in a data mining analysis.

data mining database (DMDB)

a SAS data set that is designed to optimize the performance of the modeling nodes. DMDBs enhance performance by reducing the number of passes that the analytical engine needs to make through the data. Each DMDB contains a meta catalog, which includes summary statistics for numeric variables and factor-level information for categorical variables.

data source

a data object that represents a SAS data set in the Java-based Enterprise Miner GUI. A data source contains all the metadata for a SAS data set that Enterprise Miner needs in order to use the data set in a data mining process flow diagram. The SAS data set metadata that is required to create an Enterprise Miner data source includes the name and location of the data set, the SAS code that is used to define its library path, and the variable roles, measurement levels, and associated attributes that are used in the data mining process.

data subdirectory

a subdirectory within the Enterprise Miner project location. The data subdirectory contains files that are created when you run process flow diagrams in an Enterprise Miner project.

decile

any of the nine points that divide the values of a variable into ten groups of equal frequency, or any of those groups.

dependent variable

a variable whose value is determined by the value of another variable or by the values of a set of variables.

depth

the number of successive hierarchical partitions of the data in a tree. The initial, undivided segment has a depth of 0.

diagram

See process flow diagram.

format

a pattern or set of instructions that SAS uses to determine how the values of a variable (or column) should be written or displayed. SAS provides a set of standard formats and also enables you to define your own formats.

generalization

the computation of accurate outputs, using input data that was not used during training.

hidden layer

in a neural network, a layer between input and output to which one or more activation functions are applied. Hidden layers are typically used to introduce nonlinearity.

hidden neuron

in a feed-forward, multilayer neural network, a neuron that is in one or more of the hidden layers that exist between the input and output neuron layers. The size of a neural network depends largely on the number of layers and on the number of hidden units per layer. See also hidden layer.

hold-out data

a portion of the historical data that is set aside during model development. Hold-out data can be used as test data to benchmark the fit and accuracy of the emerging predictive model. See also model.

imputation

the computation of replacement values for missing input values.

input variable

a variable that is used in a data mining process to predict the value of one or more target variables.

interval variable

a continuous variable that contains values across a range. For example, a continuous variable called Temperature could have values such as 0, 32, 34, 36, 43.5, 44, 56, 80, 99, 99.9, and 100.

leaf

in a tree diagram, any segment that is not further segmented. The final leaves in a tree are called terminal nodes.

level

a successive hierarchical partition of data in a tree. The first level represents the entire unpartitioned data set. The second level represents the first partition of the data into segments, and so on.

libref (library reference)

a name that is temporarily associated with a SAS library. The complete name of a SAS file consists of two words, separated by a period. The libref, which is the first word, indicates the library. The second word is the name of the specific SAS file. For example, in VLIB.NEWBDAY, the libref VLIB tells SAS which library contains the file NEWBDAY. You assign a libref with a LIBNAME statement or with an operating system command.

lift

in association analyses and sequence analyses, a calculation that is equal to the confidence factor divided by the expected confidence. See also confidence, expected confidence.

logistic regression

a form of regression analysis in which the target variable (response variable) represents a binary-level or ordinal-level response.

macro variable

a variable that is part of the SAS macro programming language. The value of a macro variable is a string that remains constant until you change it. Macro variables are sometimes referred to as symbolic variables.

measurement

the process of assigning numbers to an object in order to quantify, rank, or scale an attribute of the object.

measurement level

a classification that describes the type of data that a variable contains. The most common measurement levels for variables are nominal, ordinal, interval, log-interval, ratio, and absolute. See also interval variable, nominal variable, ordinal variable.

metadata

a description or definition of data or information.

metadata sample

a sample of the input data source that is downloaded to the client and that is used throughout SAS Enterprise Miner to determine meta information about the data, such as number of variables, variable roles, variable status, variable level, variable type, and variable label.

model

a formula or algorithm that computes outputs from inputs. A data mining model includes information about the conditional distribution of the target variables, given the input variables.

multilayer perceptron (MLP)

a neural network that has one or more hidden layers, each of which has a linear combination function and executes a nonlinear activation function on the input to that layer. See also hidden layer.

neural networks

a class of flexible nonlinear regression models, discriminant models, data reduction models, and nonlinear dynamic systems that often consist of a large number of neurons. These neurons are usually interconnected in complex ways and are often organized into layers. See also neuron.

node

(1) in the SAS Enterprise Miner user interface, a graphical object that represents a data mining task in a process flow diagram. The statistical tools that perform the data mining tasks are called nodes when they are placed on a data mining process flow diagram. Each node performs a mathematical or graphical operation as a component of an analytical and predictive data model. (2) in a neural network, a linear or nonlinear computing element that accepts one or more inputs, computes a function of the inputs, and optionally directs the result to one or more other neurons. Nodes are also known as neurons or units. (3) a leaf in a tree diagram. The terms leaf, node, and segment are closely related and sometimes refer to the same part of a tree. See also process flow diagram, internal node.

nominal variable

a variable that contains discrete values that do not have a logical order. For example, a nominal variable called Vehicle could have values such as car, truck, bus, and train.

numeric variable

a variable that contains only numeric values and related symbols, such as decimal points, plus signs, and minus signs.

observation

a row in a SAS data set. All of the data values in an observation are associated with a single entity such as a customer or a state. Each observation contains either one data value or a missing-value indicator for each variable.

partition

to divide available data into training, validation, and test data sets.

perceptron

a linear or nonlinear neural network with or without one or more hidden layers.

predicted value

in a regression model, the value of a dependent variable that is calculated by evaluating the estimated regression equation for a specified set of values of the explanatory variables.

process flow diagram

a graphical representation of the various data mining tasks that are performed by individual Enterprise Miner nodes during a data mining analysis. A process flow diagram consists of two or more individual nodes that are connected in the order in which the data miner wants the corresponding statistical operations to be performed.

profit matrix

a table of expected revenues and expected costs for each decision alternative for each level of a target variable.

project

a collection of Enterprise Miner process flow diagrams. See also process flow diagram.

root node

the initial segment of a tree. The root node represents the entire data set that is submitted to the tree, before any splits are made.

rule

See association analysis rule, sequence analysis rule, tree splitting rule.

sampling

the process of subsetting a population into n cases. The reason for sampling is to decrease the time required for fitting a model.

SAS data set

a file whose contents are in one of the native SAS file formats. There are two types of SAS data sets: SAS data files and SAS data views. SAS data files contain data values in addition to descriptor information that is associated with the data. SAS data views contain only the descriptor information plus other information that is required for retrieving data values from other SAS data sets or from files whose contents are in other software vendors' file formats.

scoring

the process of applying a model to new data in order to compute outputs. Scoring is the last process that is performed in data mining.

seed

an initial value from which a random number function or CALL routine calculates a random value.

segmentation

the process of dividing a population into sub-populations of similar individuals. Segmentation can be done in a supervisory mode (using a target variable and various techniques, including decision trees) or without supervision (using clustering or a Kohonen network). See also Kohonen network.

self-organizing map

See SOM (self-organizing map).

SEMMA

the data mining process that is used by Enterprise Miner. SEMMA stands for Sample, Explore, Modify, Model, and Assess.

sequence variable

a variable whose value is a time stamp that is used to determine the sequence in which two or more events occurred.

SOM (self-organizing map)

a competitive learning neural network that is used for clustering, visualization, and abstraction. A SOM classifies the parameter space into multiple clusters, while at the same time organizing the clusters into a map that is based on the relative distances between clusters. See also Kohonen network.

target variable

a variable whose values are known in one or more data sets that are available (in training data, for example) but whose values are unknown in one or more future data sets (in a score data set, for example). Data mining models use data from known variables to predict the values of target variables.

test data

currently available data that contains input values and target values that are not used during training, but which instead are used for generalization and to compare models.

training

the process of computing good values for the weights in a model.

training data

currently available data that contains input values and target values that are used for model training.

transformation

the process of applying a function to a variable in order to adjust the variable's range, variability, or both.

tree

the complete set of rules that are used to split data into a hierarchy of successive segments. A tree consists of branches and leaves, in which each set of leaves represents an optimal segmentation of the branches above them according to a statistical measure.

validation data

data that is used to validate the suitability of a data model that was developed using training data. Both training data sets and validation data sets contain target variable values. Target variable values in the training data are used to train the model. Target variable values in the validation data set are used to compare the training model's predictions to the known target values, assessing the model's fit before using the model to score new data.

variable

a column in a SAS data set or in a SAS data view. The data values for each variable describe a single characteristic for all observations. Each SAS variable can have the following attributes: name, data type (character or numeric), length, format, informat, and label.

variable attribute

any of the following characteristics that are associated with a particular variable: name, label, format, informat, data type, and length.

variable level

the set of data dimensions for binary, interval, or class variables. Binary variables have two levels. A binary variable CREDIT could have levels of 1 and 0, Yes and No, or Accept and Reject. Interval variables have levels that correspond to the number of interval variable partitions. For example, an interval variable PURCHASE_AGE might have levels of 0-18, 19-39, 40-65, and >65. Class variables have levels that correspond to the class members. For example, a class variable HOMEHEAT might have four variable levels: Coal/Wood, FuelOil, Gas, and Electric. Data mining decision and profit matrixes are composed of variable levels.

Previous Page | Next Page | Top of Page