The CLUSTER statement can be used to perform a k-means cluster analysis that uses the Euclidean distance between values or it can use a density-based algorithm—DBSCAN—that was originally developed to discover clusters in large spatial databases with noise.
Example: | Performing a Cluster Analysis |
specifies a list of variables. If you do not specify this option, then all the variables in the table are used.
specifies the maximum number of points in each bubble and must exceed the value specified in the BUBMINPTS= option.
Applies to | METHOD=DBSCAN |
specifies the minimum number of points in each bubble.
Default | 1 |
Applies to | METHOD=DBSCAN |
specifies to save the cluster center that each observation belongs to, and the distance between them, into the temporary table.
Interaction | You must specify the TEMPTABLE option along with this option. |
specifies the convergence criterion c for the k-means analysis. When the relative change in within-cluster sum of squares (WCSS) between successive iterations is less than c, the analysis is presumed to have converged.
Default | 1e-05 |
Applies to | METHOD=KMEANS |
specifies the variable or variables to include in the clustering results. Use this option with the NSAMP= option to generate output that is suitable for graphing.
specifies the distance measure used in the DBSCAN method. The k-means method uses DIST=EUC.
specifies the maximum diameter of bubbles with the specified DIST= distance measure.
Default | 0 |
specifies the distance value for neighborhood querying in the DBSCAN method. You must specify a value for r when METHOD=DBSCAN. There is no default value since r is a distance measurement and depends on the range of the data.
specifies the variable or variables that are used to perform the frequency analysis for each cluster. The procedure generates separate output tables for each variable.
specifies the method for obtaining the initial estimate of cluster assignment. For the INIT=FORGY partition method, the initial mean centers are computed from NUMCLUS × nt × nn observations where nt and nn are the number of threads and number of nodes used by the server.
Alias | INIT= |
Default | FORGY |
specifies a positive integer that determines the maximum number of iterations for clustering.
Default | 10 |
specifies the clustering method for the analysis.
Default | KMEANS |
specifies the minimum number of points required in one cluster for the DBSCAN method. The EPS= and MINPTS= options can have a dramatic effect on the clusters that are generated with the DBSCAN method. You should exercise care in specifying the values for these options.
Default | 1, which means that any cluster should contain at least one data point. |
specifies that the comparison between terms and the values of character variables is case insensitive. By default, comparisons are case sensitive.
specifies that only the term frequency is used to construct the vectors, and not the inverse document frequency.
specifies that the term frequency-inverse document frequency (TF-IDF) vectors are not normalized.
prevents the procedure from pre-parsing and pre-generating code for temporary expressions, scoring programs, and other user-written SAS statements.
Alias | NOPREP |
specifies the number of representative points for each bubble.
Default | 1 |
specifies the number of sample points that are returned for each cluster. Note that this returns the k nearest and k farthest points from the cluster centers including their distances.
Default | 0 |
specifies the number of clusters for the k-means analysis.
Alias | NUMCLUS= |
Default | 2 |
specifies the number of sample points to return for each cluster. This option returns the k nearest and k farthest points from the cluster centers, including their Euclidean distances.
Default | 0 |
saves the result table so that you can use it in other IMSTAT procedure statements like STORE, REPLAY, and FREE. The value for table-name must be unique within the scope of the procedure execution. The name of a table that has been freed with the FREE statement can be used again in subsequent SAVE= options.
specifies to save the TF-IDF vectors into the temporary table along with clustering results.
Interaction | You must specify the TEMPTABLE option along with this option. |
specifies to save the distance vectors to the temporary table along with clustering results. When the VARS= option already includes the variables used to obtain the distance vectors, these variables are not saved again
Alias | SAVEVEC |
Interaction | You must specify the TEMPTABLE option along with this option. |
specifies the random number seed to use when the initialization methods INIT=FORGY or INIT=RAND are also specified. Specifying a nonzero number results in a reproducible random number stream for the specific combination of number of threads and number of worker nodes in the server.
requests that the server estimate the size of the result set. The procedure does not create a result table if the SETSIZE option is specified. Instead, the procedure reports the number of rows that are returned by the request and the expected memory consumption for the result set (in KB). If you specify the SETSIZE option, the SAS log includes the number of observations and the estimated result set size. See the following log sample:
NOTE: The LASR Analytic Server action request for the STATEMENT
statement would return 17 rows and approximately
3.641 kBytes of data.
specifies either a quoted string that contains the SAS expression that defines the temporary variables or a file reference to an external file with the SAS statements.
Alias | TE= |
specifies the list of temporary variables for the request. Each temporary variable must be defined through SAS statements that you supply with the TEMPEXPRESS= option.
Alias | TN= |
generates an in-memory temporary table from the result set. The IMSTAT procedure displays the name of the table and stores it in the &_TEMPLAST_ macro variable, provided that the statement executed successfully.
specifies the terms used in computing term frequency. Each string represents one term. Specifying the TERMS= option triggers the use of TF-IDF to compute distance vectors for character variables. The TERMS= option is useful if you have a fairly small number of terms to pass to the server. If the number of terms is large, you might want to use the TERMDATA= option instead.
specifies an in-memory table in the server that contains the term list. Specifying the TERMDATA= option triggers the use of TF-IDF to compute distance vectors for character variables. If you specify both the TERMS= and TERMDATA= options, the server uses the union of the two sets in the TF-IDF calculation. Note that the IMSTAT procedure assumes that the TERMDATA= table already exists in the server.
specifies the maximum number of seconds that the server should run the statement. If the time-out is reached, the server terminates the request and generates an error and error message. By default, there is no time-out.
specifies the tokens that separate terms when scanning character variables. If you do not specify the tokens (term delimiters), then the terms are compared against the full raw length of the character variable.
specifies an in-memory table in the server that contains the tokens list.
specifies the names of the variables to transfer to a temporary table in the server. This option is ignored unless you score an in-memory table and the TEMPTABLE option is specified. The observations with these variables are copied to the generated temporary table.