Glossary
- assessment
-
the process of determining how well a model computes
good outputs from input data that is not used during training. Assessment
statistics are automatically computed when you train a model with
a modeling node. By default, assessment statistics are calculated
from the validation data set.
- attribute
-
- CART
-
- CHAID
-
- champion model
-
the best predictive model that is chosen from
a pool of candidate models in a data mining environment.
- chi-squared automatic interaction detection (CHAID)
-
a technique for building decision trees. The CHAID
technique specifies a significance level of a chi-square test to stop
tree growth.
- classification and regression tree (CART)
-
a decision tree technique that is used for classifying
or segmenting a data set. The technique provides a set of rules that
can be applied to new data sets in order to predict which records
will have a particular outcome. It also segments a data set by creating
2-way splits. The CART technique requires less data preparation than
CHAID.
- data mining database (DMDB)
-
a SAS data set that is designed to optimize the
performance of the modeling nodes. DMDBs enhance performance by reducing
the number of passes that the analytical engine needs to make through
the data. Each DMDB contains a meta catalog, which includes summary
statistics for numeric variables and factor-level information for
categorical variables.
- data mining model (model)
-
a formula or algorithm that computes outputs from
inputs. A data mining model includes information about the conditional
distribution of the target variables, given the input variables.
- data set
-
- dependent variable (response variable, experimental variable)
-
a variable that is observed to change in response
to the independent variables. In a function y=f(x), the value of the
dependent variable y is a function of the independent variable x.
- depth
-
the number of successive hierarchical partitions
of the data in a tree. The initial, undivided segment has a depth
of 0.
- DMDB
-
- experimental variable
-
- Gini index
-
a measure of the total leaf impurity in a decision
tree.
- hidden layer
-
in a neural network, a layer between input and
output to which one or more activation functions are applied. Hidden
layers are typically used to introduce nonlinearity.
- imputation
-
the computation of replacement values for missing
input values.
- input variable
-
a variable that is used in a data mining process
to predict the value of one or more target variables.
- interval variable
-
a continuous variable that contains values across
a range. For example, a continuous variable called Temperature could
have values such as 0, 32, 34, 36, 43.5, 44, 56, 80, 99, 99.9, and
100.
- leaf
-
in a tree diagram, any segment that is not further
segmented. The final leaves in a tree are called terminal nodes.
- logistic regression
-
a form of regression analysis in which the target
variable (response variable) represents a binary-level, categorical,
or ordinal-level response.
- metadata
-
descriptive data about data that is stored and
managed in a database, in order to facilitate access to captured and
archived data for further use.
- MLP
-
- model
-
- multilayer perceptron (MLP)
-
a neural network that has one or more hidden layers,
each of which has a linear combination function and executes a nonlinear
activation function on the input to that layer.
See also hidden layer.
- neural network
-
any of a class of models that usually consist
of a large number of neurons, interconnected in complex ways and organized
into layers. Examples are flexible nonlinear regression models, discriminant
models, data reduction models, and nonlinear dynamic systems.
- observation
-
a row in a SAS data set. All of the data values
in an observation are associated with a single entity such as a customer
or a state. Each observation contains either one data value or a missing-value
indicator for each variable.
- overfit
-
to train a model to the random variation in the
sample data. Overfitted models contain too many parameters (weights),
and they do not generalize well.
See also underfit.
- PFD
-
- predicted value
-
in a regression model, the value of a dependent
variable that is calculated by evaluating the estimated regression
equation for a specified set of values of the explanatory variables.
- prior probability
-
a probability that reflects knowledge about the
population before obtaining the sample on hand.
- process flow diagram (PFD)
-
a graphical sequence of interconnected symbols
that represent an ordered set of steps or tasks that, when combined,
form a workflow designed to yield an analytical result.
- profit matrix
-
a table of expected revenues and expected costs
for each decision alternative for each level of a target variable.
- project
-
a user-created GUI entity that contains the related
SAS Enterprise Miner components required for the data mining models.
A project contains SAS Enterprise Miner data sources, process flow
diagrams, and results data sets and model packages.
- pruning
-
the process of removing nodes from a decision
tree when those nodes involve less than optimal decision rules.
- response variable
-
- root
-
- root node (root)
-
the topmost level in a hierarchical tree, representing
the entire tree and its contents.
- sampling
-
the process of subsetting a population into n
cases. Sampling decreases the time required for fitting a model.
- SAS data set (data set)
-
a file whose contents are in one of the native
SAS file formats. There are two types of SAS data sets: SAS data files
and SAS data views.
- SAS variable (variable)
-
a column in a SAS data set or in a SAS data view.
The data values for each variable describe a single characteristic
for all observations (rows).
- SEMMA
-
the data mining process that is used by SAS Enterprise
Miner. SEMMA stands for Sample, Explore, Modify, Model, and Assess.
- subdiagram
-
in a process flow diagram, a collection of nodes
that are compressed into a single node. The use of subdiagrams can
improve your control of the information flow in the diagram.
- target variable
-
a variable whose values are known in one or more
data sets that are available (in training data, for example) but whose
values are unknown in one or more future data sets (in a score data
set, for example). Data mining models use data from known variables
to predict the values of target variables.
- training data
-
data that contains input values and target values
that are used to train and build predictive models.
- transformation
-
in statistics, the process of applying a function
to a variable in order to adjust the variable's range, variability,
or both.
- tree structure
-
a type of data structure that uses the graphic
analogy of a tree with branches and leaves. Each set of leaves represents
an optimal segmentation of the branches above it, according to a statistical
measure and the rules that govern the structure.
- underfit
-
to train a model to only part of the actual patterns
in the sample data. Underfit models contain too few parameters (weights),
and they do not generalize well.
See also overfit.
- validation data
-
data that is used to validate the suitability
of a data model that was developed using training data. Both training
data sets and validation data sets contain target variable values.
Target variable values in the training data are used to train the
model. Target variable values in the validation data set are used
to compare the training model's predictions to the known target values,
assessing the model's fit before using the model to score new data.
- variable
-
- variable attribute (attribute)
-
any of the following characteristics that are
associated with a particular variable: name, label, format, informat,
data type, and length.
- variable level
-
the set of data dimensions for binary, interval,
or class variables. Binary variables have two levels. A binary variable
CREDIT could have levels of 1 and 0, Yes and No, or Accept and Reject.
Interval variables have levels that correspond to the number of interval
variable partitions. For example, an interval variable PURCHASE_AGE
might have levels of 0-18, 19-39, 40-65, and >65. Class variables
have levels that correspond to the class members. For example, a class
variable HOMEHEAT might have four variable levels: Coal/Wood, FuelOil,
Gas, and Electric. Data mining decision and profit matrixes are composed
of variable levels.
Copyright © SAS Institute Inc. All rights reserved.