Glossary
- assessment
-
the process of determining how well a model computes
good outputs from input data that is not used during training. Assessment
statistics are automatically computed when you train a model with
a modeling node. By default, assessment statistics are calculated
from the validation data set.
- champion model
-
the best predictive model that is chosen from
a pool of candidate models in a data mining environment. Candidate
models are developed using various data mining heuristics and algorithm
configurations. Competing models are compared and assessed using criteria
such as training, validation, and test data fit and model score comparisons.
- chi-squared automatic interaction detection
-
a technique for building decision trees. The CHAID
technique specifies a significance level of a chi-square test to stop
tree growth. Short-form: CHAID.
- classification and regression trees
-
a decision tree technique that is used for classifying
or segmenting a data set. The technique provides a set of rules that
can be applied to new data sets in order to predict which records
will have a particular outcome. It also segments a data set by creating
2-way splits. The CART technique requires less data preparation than
CHAID. Short form: CART.
- data mining database
-
a SAS data set that is designed to optimize the
performance of the modeling nodes. DMDBs enhance performance by reducing
the number of passes that the analytical engine needs to make through
the data. Each DMDB contains a meta catalog, which includes summary
statistics for numeric variables and factor-level information for
categorical variables. Short form: DMDB.
- data source
-
a data object that represents a SAS data set in
the Java-based Enterprise Miner GUI. A data source contains all the
metadata for a SAS data set that Enterprise Miner needs in order to
use the data set in a data mining process flow diagram. The SAS data
set metadata that is required to create an Enterprise Miner data source
includes the name and location of the data set, the SAS code that
is used to define its library path, and the variable roles, measurement
levels, and associated attributes that are used in the data mining
process.
- decision tree
-
the complete set of rules that are used to split
data into a hierarchy of successive segments. A tree consists of branches
and leaves, in which each set of leaves represents an optimal segmentation
of the branches above them according to a statistical measure.
- dependent variable
-
a variable whose value is determined by the value
of another variable or by the values of a set of variables.
- depth
-
the number of successive hierarchical partitions
of the data in a tree. The initial, undivided segment has a depth
of 0.
- Gini index
-
a measure of the total leaf impurity in a decision
tree.
- hidden layer
-
in a neural network, a layer between input and
output to which one or more activation functions are applied. Hidden
layers are typically used to introduce nonlinearity.
- imputation
-
the computation of replacement values for missing
input values.
- input variable
-
a variable that is used in a data mining process
to predict the value of one or more target variables.
- interval variable
-
a continuous variable that contains values across
a range. For example, a continuous variable called Temperature could
have values such as 0, 32, 34, 36, 43.5, 44, 56, 80, 99, 99.9, and
100.
- leaf
-
in a tree diagram, any segment that is not further
segmented. The final leaves in a tree are called terminal nodes.
- logistic regression
-
a form of regression analysis in which the target
variable (response variable) represents a binary-level or ordinal-level
response.
- metadata
-
a description or definition of data or information.
- MLP
-
See multilayer perceptron.
- model
-
a formula or algorithm that computes outputs from
inputs. A data mining model includes information about the conditional
distribution of the target variables, given the input variables.
- multilayer perceptron
-
a neural network that has one or more hidden layers,
each of which has a linear combination function and executes a nonlinear
activation function on the input to that layer. Short form: MLP. See
also hidden layer.
- neural networks
-
a class of flexible nonlinear regression models,
discriminant models, data reduction models, and nonlinear dynamic
systems that often consist of a large number of neurons. These neurons
are usually interconnected in complex ways and are often organized
into layers. See also neuron.
- node
-
(1) in the SAS Enterprise Miner user interface,
a graphical object that represents a data mining task in a process
flow diagram. The statistical tools that perform the data mining tasks
are called nodes when they are placed on a data mining process flow
diagram. Each node performs a mathematical or graphical operation
as a component of an analytical and predictive data model. (2) in
a neural network, a linear or nonlinear computing element that accepts
one or more inputs, computes a function of the inputs, and can direct
the result to one or more other neurons. Nodes are also known as neurons
or units. (3) a leaf in a tree diagram. The terms leaf, node, and
segment are closely related and sometimes refer to the same part of
a tree. See also process flow diagram and internal node.
- observation
-
a row in a SAS data set. All of the data values
in an observation are associated with a single entity such as a customer
or a state. Each observation contains either one data value or a missing-value
indicator for each variable.
- overfit
-
to train a model to the random variation in the
sample data. Overfitted models contain too many parameters (weights),
and they do not generalize well. See also underfit.
- partition
-
to divide available data into training, validation,
and test data sets.
- PFD
-
See process flow diagram.
- predicted value
-
in a regression model, the value of a dependent
variable that is calculated by evaluating the estimated regression
equation for a specified set of values of the explanatory variables.
- prior probability
-
a probability that reflects knowledge about the
population before obtaining the sample on hand.
- process flow diagram
-
a graphical representation of the various data
mining tasks that are performed by individual Enterprise Miner nodes
during a data mining analysis. A process flow diagram consists of
two or more individual nodes that are connected in the order in which
the data miner wants the corresponding statistical operations to be
performed. Short form: PFD.
- profit matrix
-
a table of expected revenues and expected costs
for each decision alternative for each level of a target variable.
- project
-
a collection of Enterprise Miner process flow
diagrams. See also process flow diagram.
- pruning
-
the process of removing nodes from a decision
tree when those nodes involve less than optimal decision rules.
- root node
-
the initial segment of a tree. The root node represents
the entire data set that is submitted to the tree, before any splits
are made.
- sampling
-
the process of subsetting a population into n
cases. Sampling decreases the time required for fitting a model.
- SAS data set
-
a file whose contents are in one of the native
SAS file formats. There are two types of SAS data sets: SAS data files
and SAS data views. SAS data files contain data values in addition
to descriptor information that is associated with the data. SAS data
views contain only the descriptor information plus other information
that is required for retrieving data values from other SAS data sets
or from files that are stored in other software vendors' file formats.
- scoring
-
the process of applying a model to new data in
order to compute outputs. Scoring is the last process that is performed
in data mining.
- SEMMA
-
the data mining process that is used by Enterprise
Miner. SEMMA stands for Sample, Explore, Modify, Model, and Assess.
- subdiagram
-
in a process flow diagram, a collection of nodes
that are compressed into a single node. The use of subdiagrams can
improve your control of the information flow in the diagram.
- target variable
-
a variable whose values are known in one or more
data sets that are available (in training data, for example) but whose
values are unknown in one or more future data sets (in a score data
set, for example). Data mining models use data from known variables
to predict the values of target variables.
- training
-
the process of computing good values for the weights
in a model.
- training data
-
currently available data that contains input values
and target values that are used for model training.
- transformation
-
the process of applying a function to a variable
in order to adjust the variable's range, variability, or both.
- tree
-
the complete set of rules that are used to split
data into a hierarchy of successive segments. A tree consists of branches
and leaves, in which each set of leaves represents an optimal segmentation
of the branches above them according to a statistical measure.
- underfit
-
to train a model to only part of the actual patterns
in the sample data. Underfit models contain too few parameters (weights),
and they do not generalize well. See also overfit.
- validation data
-
data that is used to validate the suitability
of a data model that was developed using training data. Both training
data sets and validation data sets contain target variable values.
Target variable values in the training data are used to train the
model. Target variable values in the validation data set are used
to compare the training model's predictions to the known target values,
assessing the model's fit before using the model to score new data.
- variable
-
a column in a SAS data set or in a SAS data view.
The data values for each variable describe a single characteristic
for all observations. Each SAS variable can have the following attributes:
name, data type (character or numeric), length, format, informat,
and label.
- variable attribute
-
any of the following characteristics that are
associated with a particular variable: name, label, format, informat,
data type, and length.
- variable level
-
the set of data dimensions for binary, interval,
or class variables. Binary variables have two levels. A binary variable
CREDIT could have levels of 1 and 0, Yes and No, or Accept and Reject.
Interval variables have levels that correspond to the number of interval
variable partitions. For example, an interval variable PURCHASE_AGE
might have levels of 0-18, 19-39, 40-65, and >65. Class variables
have levels that correspond to the class members. For example, a class
variable HOMEHEAT might have four variable levels: Coal/Wood, FuelOil,
Gas, and Electric. Data mining decision and profit matrixes are composed
of variable levels.
Copyright © SAS Institute Inc. All rights reserved.