The HPBIN Procedure

Binning Computation and Formulas

For variable x, assume that the data set is {$x_ i$}, where $i=1,2,...,n$. Let $min(x) = \min _{i \in \{ 1..n\} }{\{ x_ i\} }$, and let $max(x) = \max _{i \in \{ 1..n\} }{\{ x_ i\} }$. The range of the variable is $range(x) = max(x) - min(x)$.

The computations for the bucket, pseudo–quantile, and Winsorized binning methods are as follows:

  • For bucket binning, the length of the bucket is

    \[  L = \frac{max(x) - min(x)}{n}  \]

    and the split points are

    \[ s_ k = min(x) + L*k \]

    where $k=1,2,...,numbin-1$.

    When the data are evenly distributed on the SAS appliance, the time complexity for bucket binning is $\mathcal{O}(n)/(nodes*cpus)$, where $n$ is the number of observations, $nodes$ is the number of computer nodes on the appliance, and $cpus$ is the number of CPUs on each node.

  • For pseudo–quantile binning and Winsorized binning, the sorting algorithm is more complex. For variable $x$, a simple bucket sorting method is used to obtain the basic information. Let $N = 10,000$ be the number of buckets, ranging from $min(x)$ to $max(x)$. For each bucket $B_ i$, $i=1,2,...,N$, PROC HPBIN keeps following information:

    • $c_ i$: count of $x$ in $B_ i$

    • $min_ i$: minimum value of $x$ in $B_ i$

    • $max_ i$: maximum value of $x$ in $B_ i$

    • $sum_ i$: sum of $x$ in $B_ i$

    • $sum2_ i$: sum of $x^2$ in $B_ i$

    To calculate the quantile table, let $ P = \{ 0.00, 0.01, 0.05, 0.10, 0.25, 0.50, 0.75, 0.90, 0.95, 0.99, 1.00\} $. For each $p_ k \in P$, $k=1,2,...,11$, find the smallest $I_ k$, such that $\sum _{i=1}^{I_ k}{c_ i} >= p_ k*n$. Therefore, the quantile value $Q_ k$ is obtained,

    \[  Q_ k= \begin{cases}  min_{I_ k} &  \text { if } \sum _{i=1}^{I_ k}{c_ i} > p_ k*n \\ max_{I_ k} &  \text { if } \sum _{i=1}^{I_ k}{c_ i} = p_ k*n \end{cases}  \]

    where $k=1,2,...,11$.

    For pseudo–quantile binning, the split points are calculated. Let the base count $bc = ceil( \frac{n}{ numbin} )$. Find those integers $\{ I_ k | k=1,2,...\} $ such that:

    \begin{align*}  \bigg(\sum _{i=1}^{I_ k}{c_ i} >= \sum _{i=1}^{I_{k-1}}{c_ i} + bc & \quad \mathrm{or} \quad \sum _{i=1}^{I_ k}c_ i >= \frac{n}{numbin}*k\bigg)\\ \mathrm{and} &  \quad \sum _{i=1}^{I_ k}{c_ i} > \sum _{i=1}^{I_{k-1}}{c_ i} \\ \mathrm{and} &  \quad \sum _{i=1}^{I_ k}{c_ i} < n \end{align*}

    where $k$ is the $k$th split. The split value is

    \[  s_ k = min(x) + \frac{max(x) - min(x)}{N}*I_ k  \]

    where $k=1,2,...$, and $k<numbin$.

    The time complexity for pseudo–quantile binning is $C\mathcal{O}(n)/(nodes*cpus)$, where $C$ is a constant that depends on the number of sorting bucket $N$, $n$ is the number of observations, $nodes$ is the number of computer nodes on the appliance, and $cpus$ is the number of CPUs on each node.

  • For Winsorized binning, the Winsorized statistics are computed first. After the minimum and maximum have been found, the split points are calculated the same way as in bucket binning.

    Let the tail count $wc = ceil(WinsorRate * n)$, and find the smallest $I$, such that $\sum _{i=1}^ Ic_ i >= wc$. Then, the left tail count is $lwc = \sum _{i=1}^ Ic_ i$. Find the next $I_ l$, such that $\sum _{i=1}^{I_ l}c_ i > lwc$. Therefore, the minimum value is $WinsorMin = min_{I_ l}$. Similarly, find the largest $I$, such that $\sum _{i=I}^{N}c_ i >= wc$. The right tail count is $rwc = \sum _{i=I}^{N}c_ i $. Find the next $I_ r$, such that $\sum _{i={I_ r}}^{N}c_ i > rwc$. Then the maximum value is $WinsorMax = max_{I_ r}$. The mean is calculated by the formula

    \[  WinsorMean = \frac{lwc * WinsorMin + \sum _{i=I_ l}^{I_ r}{sum_ i} + rwc * WinsorMax }{n}  \]

    The trimmed mean is calculated by the formula

    \[  trimmedMean = \frac{\sum _{i=I_ l}^{I_ r}{sum_ i}}{n - lwc - rwc}  \]

Note: If PROC HPBIN prints an error or a warning message, the results may not be accurate.