For variable x, assume that the data set is {}, where . Let , and let . The range of the variable is .
The computations for the bucket, pseudo–quantile, and Winsorized binning methods are as follows:
For bucket binning, the length of the bucket is
and the split points are
where .
When the data are evenly distributed on the SAS appliance, the time complexity for bucket binning is , where is the number of observations, is the number of computer nodes on the appliance, and is the number of CPUs on each node.
For pseudo–quantile binning and Winsorized binning, the sorting algorithm is more complex. For variable , a simple bucket sorting method is used to obtain the basic information. Let be the number of buckets, ranging from to . For each bucket , , PROC HPBIN keeps following information:
: count of in
: minimum value of in
: maximum value of in
: sum of in
: sum of in
To calculate the quantile table, let . For each , , find the smallest , such that . Therefore, the quantile value is obtained,
where .
For pseudo–quantile binning, the split points are calculated. Let the base count . Find those integers such that:
|
|
|
|
|
|
where is the th split. The split value is
where , and .
The time complexity for pseudo–quantile binning is , where is a constant that depends on the number of sorting bucket , is the number of observations, is the number of computer nodes on the appliance, and is the number of CPUs on each node.
For Winsorized binning, the Winsorized statistics are computed first. After the minimum and maximum have been found, the split points are calculated the same way as in bucket binning.
Let the tail count , and find the smallest , such that . Then, the left tail count is . Find the next , such that . Therefore, the minimum value is . Similarly, find the largest , such that . The right tail count is . Find the next , such that . Then the maximum value is . The mean is calculated by the formula
The trimmed mean is calculated by the formula
Note: If PROC HPBIN prints an error or a warning message, the results may not be accurate.