Getting Started with Enterprise Miner 4.3: Glossary :: Getting Started with SAS(R) Enterprise Miner(TM) 4.3

Glossary

activation: a mathematical transformation of a neuron's net input to yield the neuron's output.
activation function: in a neural network, a mathematical transformation of the net input that will yield the output of a neuron.
assessment: the process of determining how well a model computes good outputs from input data that is not used during training. Assessment statistics are automatically computed when you train a model with a modeling node. By default, assessment statistics are calculated from the validation data set.
association analysis: the identification of items that occur together in a particular event or record. Association analysis rules are based on frequency counts of the number of times items occur alone and in combination in a database. See also association discovery.
association analysis rule: in association analyses, an association between two or more items. An association analysis rule should not be interpreted as a direct causation. An association analysis rule is expressed as follows: If item A is part of an event, then item B is also part of the event X percent of the time.
association discovery: the process of identifying items that occur together in a particular event or record. This technique is also known as market basket analysis. Association discovery rules are based on frequency counts of the number of times items occur alone and in combination in the database.
back propagation: the computation of derivatives for a multilayer perceptron. See also multilayer perceptron (MLP).
binary variable: a variable that contains two discrete values (for example, PURCHASE: Yes and No).
branch: a subtree that is rooted in one of the initial divisions of a segment of a tree. For example, if a rule splits a segment into seven subsets, then seven branches grow from the segment.
case: a collection of information about one of many entities that are represented in a data set. A case is an observation in the data set.
character variable: a variable whose values can consist of alphabetic characters and special characters as well as numeric characters.
cluster sampling: the process of selecting a sample of groups or clusters from a population. The sample contains all of the members of each selected group or cluster.
clustering: the process of dividing a data set into mutually exclusive groups such that the observations for each group are as close as possible to one another, and different groups are as far as possible from one another.
combination function: a function that is applied both to inputs and to hidden layers and which computes the net input to a hidden neuron or to an output neuron.
competitive network: a two-layer neural network that is trained over time by tracking and predicting input-output responses. Over time, output nodes become associated with input vector patterns from the input data set. The weight vector for an output node can be approximated. The approximated weight vector is the average of the input data vectors that are in the vector pattern that is associated with an output node.
confidence: in association analyses, a measure of the strength of the association. In the rule A --> B, confidence is the percentage of times that event B occurs after event A occurs. See also association analysis rule.
cost variable: a variable that is used to track cost in a data mining analysis.
data mining database (DMDB): a SAS data set that is designed to optimize the performance of the modeling nodes. DMDBs enhance performance by reducing the number of passes that the analytical engine needs to make through the data. Each DMDB contains a meta catalog, which includes summary statistics for numeric variables and factor-level information for categorical variables.
data subdirectory: a subdirectory within the Enterprise Miner project location. The data subdirectory contains files that are created when you run process flow diagrams in an Enterprise Miner project.
DataSources folder: a folder in the project subdirectory for an Enterprise Miner project. The DataSources folder contains all of the data sources that have been created, configured, and associated with a data mining project. See also project subdirectory.
decile: any of the nine points that divide the values of a variable into ten groups of equal frequency, or any of those groups.
dependent variable: a variable whose value is determined by the value of another variable or by the values of a set of variables.
depth: the number of successive hierarchical partitions of the data in a tree. The initial, undivided segment has a depth of 0.
diagram lock file: a temporary lock (.lck) file that is created when an Enterprise Miner process flow diagram is opened and that is deleted when the diagram is closed. The lock file prevents more than one user from modifying an Enterprise Miner process flow diagram at one time. Diagram lock files are created in the Workspaces folder, one level beneath the project subdirectory. See also project subdirectory.
error function: a function that measures how well a neural network or other model fits the training data. The error function is also called a Lyapunov function or an estimation criterion.
estimation criterion: another term for error function. See error function.
expected confidence: in association analyses, the number of consequent transactions divided by the total number of transactions. For example, suppose that 100 transactions of item B were made, and that the data set includes a total of 10,000 transactions. Given the rule A-->B, the expected confidence for item B is 100 divided by 10,000, or one percent.
format: a pattern or set of instructions that SAS uses to determine how the values of a variable (or column) should be written or displayed. SAS provides a set of standard formats and also enables you to define your own formats.
frequency variable (freq variable): a variable that represents the frequency of occurrence for other values in each observation in a data mining analysis. Unlike some variables that are used by SAS procedures, a frequency variable can contain noninteger values and can be used for weighting observations.
generalization: the computation of accurate outputs, using input data that was not used during training.
Gini index: a measure of the total leaf impurity in a decision tree.
hidden layer: in a neural network, a layer between input and output to which one or more activation functions are applied. Hidden layers are typically used to introduce nonlinearity.
hidden neuron: in a feed-forward, multilayer neural network, a neuron that is in one or more of the hidden layers that exist between the input and output neuron layers. The size of a neural network depends largely on the number of layers and on the number of hidden units per layer. See also hidden layer.
imputation: the computation of replacement values for missing input values.
informat: a pattern or set of instructions that SAS uses to determine how data values in an input file should be interpreted. SAS provides a set of standard informats and also enables you to define your own informats.
input variable: a variable that is used in a data mining process to predict the value of one or more target variables.
internal node: in a tree, a segment that has been further segmented. See also node.
interval variable: a continuous variable that contains values across a range. For example, a continuous variable called Temperature could have values such as 0, 32, 34, 36, 43.5, 44, 56, 80, 99, 99.9, and 100.
Kass adjustment: a p-value adjustment that multiplies the p-value by a Bonferroni factor that depends on the number of branches and chi-square target values, and sometimes on the number of distinct input values. The Kass adjustment is used in the Tree node.
Kohonen network: any of several types of competitive networks that were invented by Teuvo Kohonen. Kohonen vector quantization networks and self-organizing maps (SOMs) are two types of Kohonen network that are commonly used in data mining. See also Kohonen vector quantization network, SOM (self-organizing map).
Kohonen vector quantization network: a type of competitive network that can be viewed either as an unsupervised density estimator or as an autoassociator. The Kohonen vector quantization algorithm is closely related to the k-means cluster analysis algorithm. See also Kohonen network.
leaf: in a tree diagram, any segment that is not further segmented. The final leaves in a tree are called terminal nodes.
level: a successive hierarchical partition of data in a tree. The first level represents the entire unpartitioned data set. The second level represents the first partition of the data into segments, and so on.
libref (library reference): a name that is temporarily associated with a SAS library. The complete name of a SAS file consists of two words, separated by a period. The libref, which is the first word, indicates the library. The second word is the name of the specific SAS file. For example, in VLIB.NEWBDAY, the libref VLIB tells SAS which library contains the file NEWBDAY. You assign a libref with a LIBNAME statement or with an operating system command.
lift: in association analyses and sequence analyses, a calculation that is equal to the confidence factor divided by the expected confidence. See also confidence, expected confidence.
logistic regression: a form of regression analysis in which the target variable (response variable) represents a binary-level or ordinal-level response.
Lyapunov function: another term for error function. See error function.
macro variable: a variable that is part of the SAS macro programming language. The value of a macro variable is a string that remains constant until you change it. Macro variables are sometimes referred to as symbolic variables.
measurement: the process of assigning numbers to an object in order to quantify, rank, or scale an attribute of the object.
measurement level: a classification that describes the type of data that a variable contains. The most common measurement levels for variables are nominal, ordinal, interval, log-interval, ratio, and absolute. See also interval variable, nominal variable, ordinal variable.
metadata: a description or definition of data or information.
metadata sample: a sample of the input data source that is downloaded to the client and that is used throughout SAS Enterprise Miner to determine meta information about the data, such as number of variables, variable roles, variable status, variable level, variable type, and variable label.
model: a formula or algorithm that computes outputs from inputs. A data mining model includes information about the conditional distribution of the target variables, given the input variables.
multilayer perceptron (MLP): a neural network that has one or more hidden layers, each of which has a linear combination function and executes a nonlinear activation function on the input to that layer. See also hidden layer.
neural networks: a class of flexible nonlinear regression models, discriminant models, data reduction models, and nonlinear dynamic systems that often consist of a large number of neurons. These neurons are usually interconnected in complex ways and are often organized into layers. See also neuron.
neuron: a linear or nonlinear computing element in a neural network. Neurons accept one or more inputs. They apply functions to the inputs, and they can send the results to one or more other neurons. Neurons are also called nodes or units.
node: (1) in the SAS Enterprise Miner user interface, a graphical object that represents a data mining task in a process flow diagram. The statistical tools that perform the data mining tasks are called nodes when they are placed on a data mining process flow diagram. Each node performs a mathematical or graphical operation as a component of an analytical and predictive data model. (2) in a neural network, a linear or nonlinear computing element that accepts one or more inputs, computes a function of the inputs, and optionally directs the result to one or more other neurons. Nodes are also known as neurons or units. (3) a leaf in a tree diagram. The terms leaf, node, and segment are closely related and sometimes refer to the same part of a tree. See also process flow diagram, internal node.
nominal variable: a variable that contains discrete values that do not have a logical order. For example, a nominal variable called Vehicle could have values such as car, truck, bus, and train.
numeric variable: a variable that contains only numeric values and related symbols, such as decimal points, plus signs, and minus signs.
observation: a row in a SAS data set. All of the data values in an observation are associated with a single entity such as a customer or a state. Each observation contains one data value for each variable.
ordinal variable: a variable that contains discrete values that have a logical order. For example, a variable called Rank could have values such as 1, 2, 3, 4, and 5.
output variable: in a data mining process, a variable that is computed from the input variables as a prediction of the value of a target variable.
overfit: to train a model to the random variation in the sample data. Overfit models contain too many parameters (weights), and they do not generalize well. See also underfit.
partition: to divide available data into training, validation, and test data sets.
perceptron: a linear or nonlinear neural network with or without one or more hidden layers.
predicted value: in a regression model, the value of a dependent variable that is calculated by evaluating the estimated regression equation for a specified set of values of the explanatory variables.
prediction variable: a variable that contains predicted values (outputs) for a target variable.
process flow diagram: a graphical representation of the various data mining tasks that are performed by individual Enterprise Miner nodes during a data mining analysis. A process flow diagram consists of two or more individual nodes that are connected in the order in which the data miner wants the corresponding statistical operations to be performed.
profit matrix: a table of expected revenues and expected costs for each decision alternative for each level of a target variable.
project: a collection of Enterprise Miner process flow diagrams. See also process flow diagram.
project subdirectory: a subdirectory that is used for storing Enterprise Miner project files. The project subdirectory contains folders for data sources, process flow diagram workspaces, target profiles, and statistical results that are associated with a project. The project subdirectory also contains the temporary diagram lock (.lck) files that are created whenever a process flow diagram is opened. See also diagram lock file, DataSources folder, Workspaces folder, Results folder.
rejected variable: a variable that is excluded from a data mining analysis. Variables can be rejected manually during data configuration, or data mining nodes can reject a variable that does not meet some specified criterion.
Reports subdirectory: a subdirectory that is used for storing HTML reports that are generated by the Reporter node. Each report has its own subdirectory. The name of the subdirectory is the same as the name of the corresponding report.
Results folder: a folder in the project subdirectory of an Enterprise Miner project. The Results folder contains the result files that are generated by process flow diagrams in the project. See also project subdirectory.
root node: the initial segment of a tree. The root node represents the entire data set that is submitted to the tree, before any splits are made.
rule: See association analysis rule, sequence analysis rule, tree splitting rule.
sampling: the process of subsetting a population into n cases. The reason for sampling is to decrease the time required for fitting a model.
SAS data set: a file whose contents are in one of the native SAS file formats. There are two types of SAS data sets: SAS data files and SAS data views. SAS data files contain data values in addition to descriptor information that is associated with the data. SAS data views contain only the descriptor information plus other information that is required for retrieving data values from other SAS data sets or from files whose contents are in other software vendors' file formats.
SAS data view: a type of SAS data set that retrieves data values from other files. A SAS data view contains only descriptor information such as the data types and lengths of the variables (columns), plus other information that is required for retrieving data values from other SAS data sets or from files that are stored in other software vendors' file formats. SAS data views can be created by the ACCESS and SQL procedures, as well as by the SAS DATA step.
scoring: the process of applying a model to new data in order to compute outputs. Scoring is the last process that is performed in data mining.
seed: an initial value from which a random number function or CALL routine calculates a random value.
segmentation: the process of dividing a population into sub-populations of similar individuals. Segmentation can be done in a supervisory mode (using a target variable and various techniques, including decision trees) or without supervision (using clustering or a Kohonen network). See also Kohonen network.
self-organizing map: See SOM (self-organizing map).
SEMMA: the data mining process that is used by Enterprise Miner. SEMMA stands for Sample, Explore, Modify, Model, and Assess.
sequence analysis rule: in sequence discovery, an association between two or more items, taking a time element into account. For example, the sequence analysis rule A --> B implies that event B occurs after event A occurs.
sequence variable: a variable whose value is a time stamp that is used to determine the sequence in which two or more events occurred.
simple random sample: a sample in which each item in the population has an equal chance of being selected.
SOM (self-organizing map): a competitive learning neural network that is used for clustering, visualization, and abstraction. A SOM classifies the parameter space into multiple clusters, while at the same time organizing the clusters into a map that is based on the relative distances between clusters. See also Kohonen network.
standard deviation: a statistical measure of the variability of a group of data values. This measure, which is the most widely used measure of the dispersion of a frequency distribution, is equal to the positive square root of the variance.
stratified random sample: a sample obtained by dividing a population into nonoverlapping parts, called strata, and randomly selecting items from each stratum.
subdiagram: in a process flow diagram, a collection of nodes that are compressed into a single node. The use of subdiagrams can improve your control of the information flow in the diagram.
target variable: a variable whose values are known in one or more data sets that are available (in training data, for example) but whose values are unknown in one or more future data sets (in a score data set, for example). Data mining models use data from known variables to predict the values of target variables.
test data: currently available data that contains input values and target values that are not used during training, but which instead are used for generalization and to compare models.
training: the process of computing good values for the weights in a model.
training data: currently available data that contains input values and target values that are used for model training.
transformation: the process of applying a function to a variable in order to adjust the variable's range, variability, or both.
tree: the complete set of rules that are used to split data into a hierarchy of successive segments. A tree consists of branches and leaves, in which each set of leaves represents an optimal segmentation of the branches above them according to a statistical measure.
tree diagram: a graphical representation of part or all of a tree. The tree diagram can include segment statistics, the names of the variables that were used to split the segments, and the values of the variables.
tree splitting rule: in decision trees, a conditional mathematical statement that specifies how to split segments of a tree's data into subsegments.
trial variable: a variable that contains count data for a binomial target. For example, the number of individuals who responded to a mailing would be a trial variable. Some of the trials are classified as events, and the others are classified as non-events.
unary variable: a variable that contains a single, discrete value.
underfit: to train a model to only part of the actual patterns in the sample data. Underfit models contain too few parameters (weights), and they do not generalize well. See also overfit.
validation data: data that is used to validate the suitability of a data model that was developed using training data. Both training data sets and validation data sets contain target variable values. Target variable values in the training data are used to train the model. Target variable values in the validation data set are used to compare the training model's predictions to the known target values, assessing the model's fit before using the model to score new data.
variable: a column in a SAS data set or in a SAS data view. The data values for each variable describe a single characteristic for all observations. Each SAS variable can have the following attributes: name, data type (character or numeric), length, format, informat, and label. See also SAS data set, SAS data view, macro variable.
variable attributes: the name, label, format, informat, data type, and length that are associated with a particular variable.
weight: a constant that is used in a model for which the constant values are unknown or unspecified prior to the analysis.
Workspaces folder: a folder in the project subdirectory for an Enterprise Miner project. One workspace named Emwsnn is created in the Workspaces folder for every process flow diagram that is created in the data mining project, where nn is the ordinal for the diagram's creation. The Workspaces folder contains a process flow diagram, a subfolder for each node in the diagram, and target profile descriptor tables, as well as Reports and Results subfolders that are associated with the diagram. See also project subdirectory.

Top of Page