This example uses the
same diagram workspace that you created in Chapter 2. You have the
option to create a new diagram for this example, but instructions
to do so are not provided in this example. First, you need to add
the SAMPSIO.DMABASE data source to project.
-
In the Project Panel,
right-click
Data Sources and click
Create
Data Source.
-
In the
Data
Source Wizard — Metadata Source window, click
Next.
-
In the
Data
Source Wizard — Select a SAS Table and enter
SAMPSIO.DMABASE
in
the
Table field. Click
Next.
-
In the
Data
Source Wizard — Table Information window, click
Next.
-
In the
Data
Source Wizard — Metadata Advisor Options window,
select
Advanced. Click
Next.
-
In the
Data
Source Wizard — Column Metadata window, make the
following changes:
-
For the variable NAME, set the
Role to
ID.
-
For the variables TEAM, POSITION,
LEAGUE, DIVISION, and SALARY set the
Role to
Rejected.
-
Ensure that all other variables
have a
Role of
Input.
-
In the
Data
Source Wizard — Create Sample window, click
Next.
-
In the
Data
Source Wizard — Data Source Attributes window,
click
Next.
-
In the
Data
Source Wizard — Summary window, click
Finish.
This creates the
All:
Baseball Data data source in your Project Panel. Drag
the
All: Baseball Data data source to your
diagram workspace.
If you explore the data
in the
All: Baseball Data data source, you
will notice that SALARY and LOGSALAR have missing values. Although
it is not always necessary to impute missing values, at times the
amount of missing data can prevent the Cluster node from obtaining
a solution. The Cluster node needs some complete observations in order
to generate the initial clusters. When the amount of missing data
is too extreme, use the Replacement or Impute node to handle the missing
values. This example uses imputation to replace missing values with
the median.
On the
Modify tab,
drag an
Impute node to your diagram workspace.
Connect the
All: Baseball Data data source
to the
Impute node. In the
Interval
Variables property subgroup, set the value of the
Default
Input Method property to
Median.
On the
Explore tab,
drag a
Cluster node to your diagram workspace.
Connect the
Impute node to the
Cluster node.
By default, the Cluster
node uses the Cubic Clustering Criterion (CCC) to approximate the
number of clusters. The node first makes a preliminary clustering
pass, beginning with the number of clusters that is specified in the
Preliminary
Maximum value in the
Selection Criterion properties.
After the preliminary pass completes, the multivariate means of the
clusters are used as inputs for a second pass that uses agglomerative,
hierarchical algorithms to combine and reduce the number of clusters.
Then, the smallest number of clusters that meets all four of the following
criteria is selected.
-
The number of clusters must be
greater than or equal to the number that is specified as the Minimum
value in the Selection Criterion properties.
-
The number of clusters must have
cubic clustering criterion statistic values that are greater than
the CCC Cutoff that is specified in the Selection Criterion properties.
-
The number of clusters must be
less than or equal to the Final Maximum value.
-
A peak in the number of clusters
exists.
Note: If the data to be clustered
is such that the four criteria are not met, then the number of clusters
is set to the first local peak. In this event, the following warning
is displayed in the Cluster node log:
WARNING: The number
of clusters selected based on the CCC values may not be valid. Please
refer to the documentation of the Cubic Clustering Criterion. There
are no number of clusters matching the specified minimum and maximum
number of clusters. The number of clusters will be set to the first
local peak.
After the number of clusters is determined,
a final pass runs to produce the final clustering for the Automatic
setting.
You are now ready to
cluster the input data.