The HPBIN Procedure

Binning Computation and Formulas

For variable x, assume that the data set is { $x_ i$ }, where $i=1,2,\dots ,n$ . Let $\min (x) = \min _{i \in \{ 1..n\} }{\{ x_ i\} }$ , and let $\max (x) = \max _{i \in \{ 1..n\} }{\{ x_ i\} }$ . The range of the variable is $\mr{range}(x) = \max (x) - \min (x)$ .

The computations for the various binning methods are as follows:

For bucket binning, the length of the bucket is

$L = \frac{\max (x) - \min (x)}{n}$

The split points are

$s_ k = \min (x) + L*k$

where $k=1,2,\dots , \Argument{numbin}-1$ , and numbin is the value of the NUMBIN= option in the PROC HPBIN statement.

When the data are evenly distributed on the SAS appliance, the time complexity for bucket binning is $\mathcal{O}(n)/(\Mathtext{nodes}*\Mathtext{cpus})$ , where n is the number of observations, $\Mathtext{nodes}$ is the number of computer nodes on the appliance, and $\Mathtext{cpus}$ is the number of CPUs on each node.
For quantile binning, PROC HPBIN calculates a quantile table $P$ . Let $P = \{ p_ k\}$ , where $k=1,2,\dots ,\Argument{numbin}$ . Then $p_ k$ is described as

$p_ k = \begin{cases} 1.0/\Argument{numbin} & \text { if } k = 0 \\ 1.0/\Argument{numbin} + p_{k-1} & \text { if } 0 < k < \Argument{numbin} \\ 1.0 & \text { if } k = \Argument{numbin} \\ \end{cases}$

Quantile binning often requires data to be sorted in a particular way, and the sorting process usually consumes a significant amount of CPU time and memory. When the input data set is larger than the available memory, the sorting algorithm becomes more complicated. In distributed computing, data communications overhead also increases the sorting challenge. To avoid the time-consuming sorting process, the HPBIN procedure uses an iterative projection method for quantile binning, which runs much faster than sorting-based quantile binning method in most cases.

After calculating the quantile table, PROC HPBIN uses an iterative projection method to compute quantiles (percentiles) according to the formula that is described in the section Computing the Quantiles (Percentiles).

Quantile binning aims to assign the same number of observations to each bin, if the number of observations is evenly divisible by the number of bins. As a result, each bin should have the same number of observations, provided that there are no tied values at the boundaries of the bins. Because PROC HPBIN always assigns observations that have the same value to the same bin, quantile binning might create unbalanced bins if any variable has tied values. For example, if an observation whose value is x is assigned to bin k, then every observation whose value is x is assigned to bin k for this variable, and no observation whose value is x is assigned to the next bin, bin $k+1$ . Therefore, bin k might have more observations than bin $k+1$ , because the tied values at the boundaries between bin k and bin $k+1$ are all assigned to bin k. That is, tied values at the boundaries between two bins are always assigned to the lower-numbered bin.
For pseudo–quantile binning and Winsorized binning, the sorting algorithm is more complex to describe but very efficient to execute. For variable x, PROC HPBIN uses a simple bucket sorting method to obtain the basic information. Let $N$ = 10,000 be the number of buckets, ranging from $\min (x)$ to $\max (x)$ . For each bucket $B_ i$ , $i=1,2,\dots ,N$ , PROC HPBIN keeps following information:
- $c_ i$ : count of x in $B_ i$
- $\min _ i$ : minimum value of x in $B_ i$
- $\max _ i$ : maximum value of x in $B_ i$
- $\sum _ i$ : sum of x in $B_ i$
- $\sum 2_ i$ : sum of $x^2$ in $B_ i$
To calculate the quantile table, let $P = \{ 0.00, 0.01, 0.05, 0.10, 0.25, 0.50, 0.75, 0.90, 0.95, 0.99, 1.00\}$ . For each $p_ k \in P$ , $k=1,2,\dots ,11$ , find the smallest $I_ k$ such that $\sum _{i=1}^{I_ k}{c_ i} \geq p_ k*n$ . Therefore, the quantile value $Q_ k$ is obtained,

$Q_ k= \begin{cases} min_{I_ k} & \text { if } \sum _{i=1}^{I_ k}{c_ i} > p_ k*n \\ max_{I_ k} & \text { if } \sum _{i=1}^{I_ k}{c_ i} = p_ k*n \end{cases}$

where $k=1,2,\dots ,11$ .

PROC HPBIN calculates the split points as follows:
For pseudo–quantile binning, let the base count $bc = \mr{ceil}( \frac{n}{ \Argument{numbin}} )$ . PROC HPBIN finds those integers $\{ I_ k | k=1,2,\dots \}$ such that

$\begin{align*} \bigg(\sum _{i=1}^{I_ k}{c_ i} \geq \sum _{i=1}^{I_{k-1}}{c_ i} + bc & \quad \mathrm{or} \quad \sum _{i=1}^{I_ k}c_ i \geq \frac{n}{\Argument{numbin}}*k\bigg)\\ \mathrm{and} & \quad \sum _{i=1}^{I_ k}{c_ i} > \sum _{i=1}^{I_{k-1}}{c_ i} \\ \mathrm{and} & \quad \sum _{i=1}^{I_ k}{c_ i} < n \end{align*}$

where k is the kth split.

The split value is

$s_ k = \min (x) + \frac{\max (x) - \min (x)}{N}*I_ k$

where $k=1,2,\dots$ , and $k<\Argument{numbin}$ .

The time complexity for pseudo–quantile binning is $C\mathcal{O}(n)/(\Mathtext{nodes}*\Mathtext{cpus})$ , where $C$ is a constant that depends on the number of sorting bucket $N$ , n is the number of observations, $\Mathtext{nodes}$ is the number of computer nodes on the appliance, and $\Mathtext{cpus}$ is the number of CPUs on each node.
For Winsorized binning, the Winsorized statistics are computed first. After the minimum and maximum have been found, the split points are calculated the same way as in bucket binning.

Let the tail count $\mr{wc}$ be $\mr{ceil}(\mr{WinsorRate} * n)$ , and find the smallest $I$ such that $\sum _{i=1}^ Ic_ i \geq \mr{wc}$ . Then, the left tail count is $\mr{lwc} = \sum _{i=1}^ Ic_ i$ . Find the next $I_ l$ such that $\sum _{i=1}^{I_ l}c_ i > lwc$ . Therefore, the minimum value is $\mr{WinsorMin} = \min _{I_ l}$ . Similarly, find the largest $I$ such that $\sum _{i=I}^{N}c_ i \geq \mr{wc}$ . The right tail count is $\mr{rwc} = \sum _{i=I}^{N}c_ i$ . Find the next $I_ r$ such that $\sum _{i={I_ r}}^{N}c_ i > \mr{rwc}$ . Then the maximum value is $\mr{WinsorMax} = \max _{I_ r}$ . The mean is calculated by the formula

$\mr{WinsorMean} = \frac{\mr{lwc} * \mr{WinsorMin} + \sum _{i=I_ l}^{I_ r}{sum_ i} + \mr{rwc} * \mr{WinsorMax} }{n}$

The trimmed mean is calculated by the formula

$\mr{trimmedMean} = \frac{\sum _{i=I_ l}^{I_ r}{sum_ i}}{n - \mr{lwc} - \mr{rwc}}$

Note: PROC HPBIN prints an error or a warning message when the results might not be accurate.

Note: PROC HPBIN does not allow empty bins. If an empty bin is detected because of an insufficient number of nonmissing observations, PROC HPBIN issues an error and exits.

Note: If PROC HPBIN detects an empty bin followed by a bin that is not empty, it skips the empty bin and does not assign a number to it. In this case, the number of bins that PROC HPBIN generates is less than the specified number of bins.