IMSTAT Procedure (Analytics)

CLUSTER Statement

The CLUSTER statement can be used to perform a k-means cluster analysis that uses the Euclidean distance between values or it can use a density-based algorithm—DBSCAN—that was originally developed to discover clusters in large spatial databases with noise.

Performing a Cluster Analysis

Syntax

CLUSTER <variable-list> </ options>;

Optional Argument

variable-list

specifies a list of variables. If you do not specify this option, then all the variables in the table are used.

For the k-means clustering, the variable list defines a vector for each observation that is used to compute the Euclidean distance between the observations. By minimizing the within-cluster sum of squares (WCSS) of this distance, a set of clusters and their centers are determined.

CLUSTER Statement Options

BUBMAXPTS=n

specifies the maximum number of points in each bubble and must exceed the value specified in the BUBMINPTS= option.

Applies to METHOD=DBSCAN

BUBMINPTS=n

specifies the minimum number of points in each bubble.

Default 1
Applies to METHOD=DBSCAN

CLUSTINFO

specifies to save the cluster center that each observation belongs to, and the distance between them, into the temporary table.

Interaction You must specify the TEMPTABLE option along with this option.

CONV=c

specifies the convergence criterion c for the k-means analysis. When the relative change in within-cluster sum of squares (WCSS) between successive iterations is less than c, the analysis is presumed to have converged.

Default 1e-05
Applies to METHOD=KMEANS

DISP=(variable-list)

DISP=variable-name

specifies the variable or variables to include in the clustering results. Use this option with the NSAMP= option to generate output that is suitable for graphing.

DIST= EUC | SQUAREDEUC | MANHATTAN | MAXIMUM | COSINE | JACCARD | HAMMING

specifies the distance measure used in the DBSCAN method. The k-means method uses DIST=EUC.

DMAX=v

specifies the maximum diameter of bubbles with the specified DIST= distance measure.

Default 0

EPS=r

specifies the distance value for neighborhood querying in the DBSCAN method. You must specify a value for r when METHOD=DBSCAN. There is no default value since r is a distance measurement and depends on the range of the data.

The values of EPS= and MINPTS= are important for the number of clusters that DBSCAN can find. The EPS= value determines the minimal radius of the clusters in terms of the distance measurement. (See the DIST= option.) The value of MINPNTS=n suggests that data points in a cluster with less than n observations are noise.

FREQ=(variable-list)

FREQ=variable-name

specifies the variable or variables that are used to perform the frequency analysis for each cluster. The procedure generates separate output tables for each variable.

INITMETHOD=FORGY | RAND | AVG

specifies the method for obtaining the initial estimate of cluster assignment. For the INIT=FORGY partition method, the initial mean centers are computed from NUMCLUS × nt × nn observations where nt and nn are the number of threads and number of nodes used by the server.

When you specify INIT=RAND, all methods are assigned randomly to one of the NUMCLUS clusters. When you specify INIT=AVG, the initial centers are formed by averaging the observations on a thread-by-thread basis.
Alias INIT=
Default FORGY

MAXITER=i

specifies a positive integer that determines the maximum number of iterations for clustering.

Default 10

METHOD=KMEANS | DBSCAN

specifies the clustering method for the analysis.

Default KMEANS

MINPTS=n

specifies the minimum number of points required in one cluster for the DBSCAN method. The EPS= and MINPTS= options can have a dramatic effect on the clusters that are generated with the DBSCAN method. You should exercise care in specifying the values for these options.

Default 1, which means that any cluster should contain at least one data point.

NOCASE

specifies that the comparison between terms and the values of character variables is case insensitive. By default, comparisons are case sensitive.

NOIDF

specifies that only the term frequency is used to construct the vectors, and not the inverse document frequency.

NONORM

specifies that the term frequency-inverse document frequency (TF-IDF) vectors are not normalized.

NOPREPARSE

prevents the procedure from pre-parsing and pre-generating code for temporary expressions, scoring programs, and other user-written SAS statements.

When this option is specified, the user-written statements are sent to the server "as-is" and then the server attempts to generate code from it. If the server detects problems with the code, the error messages might not to be as detailed as the messages that are generated by SAS client. If you are debugging your user-written program, then you might want to pre-parse and pre-generate code in the procedure. However, if your SAS statements compile and run as you want them to, then you can specify this option to avoid the work of parsing and generating code on the SAS client.
When you specify this option in the PROC IMSTAT statement, the option applies to all statements that can generate code. You can also exclude specific statements from pre-parsing by using the NOPREPARSE option in statements that allow temporary columns or the SCORE statement.
Alias NOPREP

NREP=k

specifies the number of representative points for each bubble.

Default 1

NSAMP=k

specifies the number of sample points that are returned for each cluster. Note that this returns the k nearest and k farthest points from the cluster centers including their distances.

Default 0

NUMCLUSTERS=number

specifies the number of clusters for the k-means analysis.

Alias NUMCLUS=
Default 2

NSAMP=k

specifies the number of sample points to return for each cluster. This option returns the k nearest and k farthest points from the cluster centers, including their Euclidean distances.

Default 0

SAVE=table-name

saves the result table so that you can use it in other IMSTAT procedure statements like STORE, REPLAY, and FREE. The value for table-name must be unique within the scope of the procedure execution. The name of a table that has been freed with the FREE statement can be used again in subsequent SAVE= options.

SAVETERMS

specifies to save the TF-IDF vectors into the temporary table along with clustering results.

Interaction You must specify the TEMPTABLE option along with this option.

SAVEVECTORS

specifies to save the distance vectors to the temporary table along with clustering results. When the VARS= option already includes the variables used to obtain the distance vectors, these variables are not saved again

Alias SAVEVEC
Interaction You must specify the TEMPTABLE option along with this option.

SEED=number

specifies the random number seed to use when the initialization methods INIT=FORGY or INIT=RAND are also specified. Specifying a nonzero number results in a reproducible random number stream for the specific combination of number of threads and number of worker nodes in the server.

SETSIZE

requests that the server estimate the size of the result set. The procedure does not create a result table if the SETSIZE option is specified. Instead, the procedure reports the number of rows that are returned by the request and the expected memory consumption for the result set (in KB). If you specify the SETSIZE option, the SAS log includes the number of observations and the estimated result set size. See the following log sample:

NOTE: The LASR Analytic Server action request for the STATEMENT
      statement would return 17 rows and approximately
      3.641 kBytes of data.
The typical use of the SETSIZE option is to get an estimate of the size of the result set in situations where you are unsure whether the SAS session can handle a large result set. Be aware that in order to determine the size of the result set, the server has to perform the work as if you were receiving the actual result set. Requesting the estimated size of the result set does consume resources on the server. The estimated number of KB is very close to the actual memory consumption of the result set. It might not be immediately obvious how this size relates to the displayed table, since many tables contain hidden columns. In addition, some elements of the result set might not be converted to tabular output by the procedure.

TEMPEXPRESS="SAS-expressions"

TEMPEXPRESS=file-reference

specifies either a quoted string that contains the SAS expression that defines the temporary variables or a file reference to an external file with the SAS statements.

Alias TE=

TEMPNAMES=variable-name

TEMPNAMES=(variable-list)

specifies the list of temporary variables for the request. Each temporary variable must be defined through SAS statements that you supply with the TEMPEXPRESS= option.

Alias TN=

TEMPTABLE

generates an in-memory temporary table from the result set. The IMSTAT procedure displays the name of the table and stores it in the &_TEMPLAST_ macro variable, provided that the statement executed successfully.

When the IMSTAT procedure exits, all temporary tables created during the IMSTAT session are removed. Temporary tables are not displayed on a TABLEINFO request, unless the temporary table is the active table for the request.

TERMS=("term1" <, "term2"...>)

specifies the terms used in computing term frequency. Each string represents one term. Specifying the TERMS= option triggers the use of TF-IDF to compute distance vectors for character variables. The TERMS= option is useful if you have a fairly small number of terms to pass to the server. If the number of terms is large, you might want to use the TERMDATA= option instead.

TERMDATA=table-name

specifies an in-memory table in the server that contains the term list. Specifying the TERMDATA= option triggers the use of TF-IDF to compute distance vectors for character variables. If you specify both the TERMS= and TERMDATA= options, the server uses the union of the two sets in the TF-IDF calculation. Note that the IMSTAT procedure assumes that the TERMDATA= table already exists in the server.

TIMEOUT=seconds

specifies the maximum number of seconds that the server should run the statement. If the time-out is reached, the server terminates the request and generates an error and error message. By default, there is no time-out.

TOKENS=("token1" <, "token2"...>)

specifies the tokens that separate terms when scanning character variables. If you do not specify the tokens (term delimiters), then the terms are compared against the full raw length of the character variable.

For example, if the term list is "better", "worse", and the variable Opinion contains "Recent news is better than last week," then without a token list that contains the " " (blank) delimiter, the term "better" is not counted. In other words, the absence of the blank token prevents the TF-IDF from scanning "better". If your token list is large, you might want to use the TOKENDATA= option instead.

TOKENDATA=table-name

specifies an in-memory table in the server that contains the tokens list.

VARS=variable-name

VARS=(variable-name1 <, variable-name2, ...>)

specifies the names of the variables to transfer to a temporary table in the server. This option is ignored unless you score an in-memory table and the TEMPTABLE option is specified. The observations with these variables are copied to the generated temporary table.

Details

Two clustering methods are implemented in the CLUSTER statement:
  • The default clustering method is k-means clustering. For this method, the optional list of variables defines a vector for each observation, which is used to compute the Euclidean distance between the observations. By minimizing the within-cluster sum of squares (WCSS) of this distance, a set of clusters and their centers can be determined.
  • You can also use a density-based algorithm, DBSCAN. It was originally developed to discover clusters in large spatial databases with noise, and was published in the proceedings of KDD 1996. You do not need to supply the number of clusters for the DBSCAN method. The bubbling scheme is designed specifically to improve the DBSCAN performance for large data sets. Bubbling is disabled by default, because the default value of the DMAX= option is 0.
The statement also supports TF-IDF (term frequency-inverse document frequency) to compute distance vectors for character variables. For any observation with TF-IDF, the total length of the distance vector is given by the number of numerical variables plus the number of terms. The following options relate to TF-IDF calculations: TERMS=, TERMDATA=, TOKENS=, TOKENDATA=, NOCASE, NOIDF, NONORM, and SAVETERMS. You trigger TF-IDF calculations by specifying the TERMS= or TERMDATA= options in the CLUSTER statement.