Glossary
- data source
-
a data object that represents a SAS data set in
the Java-based Enterprise Miner GUI. A data source contains all the
metadata for a SAS data set that Enterprise Miner needs in order to
use the data set in a data mining process flow diagram. The SAS data
set metadata that is required to create an Enterprise Miner data source
includes the name and location of the data set, the SAS code that
is used to define its library path, and the variable roles, measurement
levels, and associated attributes that are used in the data mining
process.
- Gini index
-
a measure of the total leaf impurity in a decision
tree.
- logistic regression
-
a form of regression analysis in which the target
variable (response variable) represents a binary-level or ordinal-level
response.
- metadata
-
a description or definition of data or information.
- model
-
a formula or algorithm that computes outputs from
inputs. A data mining model includes information about the conditional
distribution of the target variables, given the input variables.
- node
-
(1) in the SAS Enterprise Miner user interface,
a graphical object that represents a data mining task in a process
flow diagram. The statistical tools that perform the data mining tasks
are called nodes when they are placed on a data mining process flow
diagram. Each node performs a mathematical or graphical operation
as a component of an analytical and predictive data model. (2) in
a neural network, a linear or nonlinear computing element that accepts
one or more inputs, computes a function of the inputs, and can direct
the result to one or more other neurons. Nodes are also known as neurons
or units. (3) a leaf in a tree diagram. The terms leaf, node, and
segment are closely related and sometimes refer to the same part of
a tree. See also process flow diagram and internal node.
- observation
-
a row in a SAS data set. All of the data values
in an observation are associated with a single entity such as a customer
or a state. Each observation contains either one data value or a missing-value
indicator for each variable.
- overfit
-
to train a model to the random variation in the
sample data. Overfitted models contain too many parameters (weights),
and they do not generalize well. See also underfit.
- partition
-
to divide available data into training, validation,
and test data sets.
- PFD
-
See process flow diagram.
- process flow diagram
-
a graphical representation of the various data
mining tasks that are performed by individual Enterprise Miner nodes
during a data mining analysis. A process flow diagram consists of
two or more individual nodes that are connected in the order in which
the data miner wants the corresponding statistical operations to be
performed. Short form: PFD.
- project
-
a user-created GUI entity that contains the related
SAS Enterprise Miner components required for the data mining models.
A project contains SAS Enterprise Miner data sources, process flow
diagrams, and results data sets and model packages.
- scorecard
-
a report that estimates the likelihood that a
borrower will display a defined behavior such as payment default.
- SAS data set
-
a file whose contents are in one of the native
SAS file formats. There are two types of SAS data sets: SAS data files
and SAS data views. SAS data files contain data values in addition
to descriptor information that is associated with the data. SAS data
views contain only the descriptor information plus other information
that is required for retrieving data values from other SAS data sets
or from files that are stored in other software vendors' file
formats.
- target variable
-
a variable whose values are known in one or more
data sets that are available (in training data, for example) but whose
values are unknown in one or more future data sets (in a score data
set, for example). Data mining models use data from known variables
to predict the values of target variables.
- training
-
the process of computing good values for the weights
in a model.
- training data
-
currently available data that contains input values
and target values that are used for model training.
- underfit
-
to train a model to only part of the actual patterns
in the sample data. Underfit models contain too few parameters (weights),
and they do not generalize well. See also overfit.
- validation data
-
data that is used to validate the suitability
of a data model that was developed using training data. Both training
data sets and validation data sets contain target variable values.
Target variable values in the training data are used to train the
model. Target variable values in the validation data set are used
to compare the training model's predictions to the known target
values, assessing the model's fit before using the model to score
new data.
- variable
-
a column in a SAS data set or in a SAS data view.
The data values for each variable describe a single characteristic
for all observations. Each SAS variable can have the following attributes:
name, data type (character or numeric), length, format, informat,
and label.
Copyright © SAS Institute Inc. All rights reserved.