The CLUSTER Procedure

Miscellaneous Formulas

The root mean squared standard deviation of a cluster $C_ K$ is

\[  \mbox{RMSSTD} = \sqrt {\frac{W_ K}{v(N_ K - 1)}}  \]

The R-square statistic for a given level of the hierarchy is

\[  R^2 = 1 - \frac{P_ G}{T}  \]

The squared semipartial correlation for joining clusters $C_ K$ and $C_ L$ is

\[  \mbox{semipartial } R^2 = \frac{B_{KL}}{T}  \]

The bimodality coefficient is

\[  b = \frac{m_3^2 + 1}{m_4 + \frac{3(n-1)^2}{(n-2)(n-3)}}  \]

where $m_3$ is skewness and $m_4$ is kurtosis. Values of b greater than 0.555 (the value for a uniform population) can indicate bimodal or multimodal marginal distributions. The maximum of 1.0 (obtained for the Bernoulli distribution) is obtained for a population with only two distinct values. Very heavy-tailed distributions have small values of b regardless of the number of modes.

Formulas for the cubic-clustering criterion and approximate expected R square are given in Sarle (1983).

The pseudo F statistic for a given level is

\[  \mbox{pseudo } F = \frac{ \frac{T - P_ G}{G - 1}}{ \frac{P_ G}{n - G}}  \]

The pseudo $t^2$ statistic for joining $C_ K$ and $C_ L$ is

\[  \mbox{pseudo } t^2 = \frac{B_{KL}}{\frac{W_ K + W_ L}{N_ K + N_ L - 2}}  \]

The pseudo F and $t^2$ statistics can be useful indicators of the number of clusters, but they are not distributed as F and $t^2$ random variables. If the data are independently sampled from a multivariate normal distribution with a scalar covariance matrix and if the clustering method allocates observations to clusters randomly (which no clustering method actually does), then the pseudo F statistic is distributed as an F random variable with $v(G - 1)$ and $v(n - G)$ degrees of freedom. Under the same assumptions, the pseudo $t^2$ statistic is distributed as an F random variable with v and $v(N_ K + N_ L - 2)$ degrees of freedom. The pseudo $t^2$ statistic differs computationally from Hotelling’s $T^2$ in that the latter uses a general symmetric covariance matrix instead of a scalar covariance matrix. The pseudo F statistic was suggested by Caliński and Harabasz (1974). The pseudo $t^2$ statistic is related to the $J_ e(2)/J_ e(1)$ statistic of Duda and Hart (1973) by

\[  \frac{J_ e (2)}{J_ e (1)} = \frac{W_ K + W_ L}{W_ M} = \frac{1}{1 + \frac{t^2}{N_ K + N_ L - 2}}  \]

See Milligan and Cooper (1985) and Cooper and Milligan (1988) regarding the performance of these statistics in estimating the number of population clusters. Conservative tests for the number of clusters using the pseudo F and $t^2$ statistics can be obtained by the Bonferroni approach (Hawkins, Muller, and ten Krooden; 1982, pp. 337–340).