Glossary
- catalog directory
-
a part of a SAS catalog that stores and maintains
information about the name, type, description, and update status of
each member of the catalog.
- clustering
-
the process of dividing a data set into mutually
exclusive groups so that the observations for each group are as close
as possible to one another and different groups are as far as possible
from one another. In SAS Text Miner, clustering involves discovering
groups of documents that are more similar to each other than they
are to the rest of the documents in the collection. When the clusters
are determined, examining the words that occur in the cluster reveals
the focus of the cluster. Forming clusters within the document collection
can help you understand and summarize the collection without reading
every document. The clusters can reveal the central themes and key
concepts that are emphasized by the collection.
- concept linking
-
finding and displaying the terms that are highly
associated with the selected term in the Terms table.
- data source
-
a data object that represents a SAS data set in
the Java-based Enterprise Miner GUI. A data source contains all the
metadata for a SAS data set that Enterprise Miner needs in order to
use the data set in a data mining process flow diagram. The SAS data
set metadata that is required to create an SAS Enterprise data source
includes the name and location of the data set; the SAS code that
is used to define its library path; and the variable roles, measurement
levels, and associated attributes that are used in the data mining
process.
- diagram
-
See process flow diagram.
- entity
-
any of several types of information that SAS Text
Miner is able to distinguish from general text. For example, SAS Text
Miner can identify names (of people, places, companies, or products,
for example), addresses (including street addresses, post office addresses,
e-mail addresses, and URLs), dates, measurements, currency amounts,
and many other types of entities.
- libref
-
a name that is temporarily associated with a SAS
library. The complete name of a SAS file consists of two words, separated
by a period. The libref, which is the first word, indicates the library.
The second word is the name of the specific SAS file. For example,
in VLIB.NEWBDAY, the libref VLIB tells SAS which library contains
the file NEWBDAY. You assign a libref with a LIBNAME statement or
with an operating system command.
- model
-
a formula or algorithm that computes outputs from
inputs. A data mining model includes information about the conditional
distribution of the target variables, given the input variables.
- node
-
(1) in the SAS Enterprise Miner user interface,
a graphical object that represents a data mining task in a process
flow diagram. The statistical tools that perform the data mining tasks
are called nodes when they are placed on a data mining process flow
diagram. Each node performs a mathematical or graphical operation
as a component of an analytical and predictive data model. (2) in
a neural network, a linear or nonlinear computing element that accepts
one or more inputs, computes a function of the inputs, and optionally
directs the result to one or more other neurons. Nodes are also known
as neurons or units. (3) a leaf in a tree diagram. The terms leaf,
node, and segment are closely related and sometimes refer to the same
part of a tree.
- parsing
-
to analyze text for the purpose of separating
it into its constituent words, phrases, multiword terms, punctuation
marks, or other types of information.
- partitioning
-
to divide available data into training, validation,
and test data sets.
- process flow diagram
-
a graphical representation of the various data
mining tasks that are performed by individual Enterprise Miner nodes
during a data mining analysis. A process flow diagram consists of
two or more individual nodes that are connected in the order in which
the data miner wants the corresponding statistical operations to be
performed. Short form: PFD.
- roll-up terms
-
the highest-weighted terms in the document collection.
- SAS data set
-
a file whose contents are in one of the native
SAS file formats. There are two types of SAS data sets: SAS data files
and SAS data views. SAS data files contain data values in addition
to descriptor information that is associated with the data. SAS data
views contain only the descriptor information plus other information
that is required for retrieving data values from other SAS data sets
or from files that are stored in other software vendors' file formats.
- scoring
-
the process of applying a model to new data in
order to compute output. Scoring is the last process that is performed
in data mining.
- segmentation
-
the process of dividing a population into sub-populations
of similar individuals. Segmentation can be done in a supervisory
mode (using a target variable and various techniques, including decision
trees) or without supervision (using clustering or a Kohonen network).
- singular value decomposition
-
a technique through which high-dimensional data
is transformed into lower-dimensional data.
- source-level debugger
-
an interactive environment in SAS that enables
you to detect and resolve logical errors in programs that are being
developed. The debugger consists of windows and a group of commands.
- stemming
-
the process of finding and returning the root
form of a word. For example, the root form of grind, grinds, grinding,
and ground is grind.
- stop list
-
a SAS data set that contains a simple collection
of low-information or extraneous words that you want to remove from
text mining analysis.
- test data
-
currently available data that contains input values
and target values that are not used during training, but which instead
are used for generalization and model comparisons.
- training data
-
currently available data that contains input values
and target values that are used for model training.
- validation data
-
data that is used to validate the suitability
of a data model that was developed using training data. Both training
data sets and validation data sets contain target variable values.
Target variable values in the training data are used to train the
model. Target variable values in the validation data set are used
to compare the training model's predictions to the known target values,
assessing the model's fit before using the model to score new data.
- variable
-
a column in a SAS data set or in a SAS data view.
The data values for each variable describe a single characteristic
for all observations. Each SAS variable can have the following attributes:
name, data type (character or numeric), length, format, informat,
and label.
Copyright © SAS Institute Inc. All rights reserved.