The HPBIN Procedure

PROC HPBIN Statement

PROC HPBIN <options> ;

The PROC HPBIN statement invokes the procedure. Table 3.1 summarizes important options in the PROC HPBIN statement by function.

Table 3.1: PROC HPBIN Statement Options

Option

Description

Basic Options

DATA=

Specifies the input data set

OUTPUT=

Specifies the output data set

NOPRINT

Overwrites the ODS output

Binning Level Options

NUMBIN=

Specifies the global number of bins for all binning variables

Binning Method Options

BUCKET

Specifies the bucket binning method

WINSOR

Specifies the Winsorized binning method and the rate that it uses

WINSORRATE=

 

PSEUDO_QUANTILE

Specifies the pseudo–quantile binning method

Statistics Options

COMPUTESTATS

Computes the basic statistics of the variables

COMPUTEQUANTILE

Compute a quantile estimation of the variables

Weight-of-Evidence Options

BINS_META=

Specifies the BINS_META input data set, which contains the binning results

WOE

Computes the weight of evidence and information values

WOEADJUST=

Specifies the adjustment factor for the weight-of-evidence calculation


You can specify the following optional arguments:

BINS_META=SAS-data-set

specifies the BINS_META input data set, which contains the binning results. The BINS_META data set contains six variables: variable name, binned variable name, lower bound, upper bound, bin, and range. The mapping table that is generated by PROC HPBIN can be used as the BINS_META data set.

BUCKET | WINSOR WINSORRATE=number| PSEUDO_QUANTILE

specifies which binning method to use. If you specify BUCKET, then PROC HPBIN uses equal-length binning. If you specify WINSOR, PROC HPBIN uses Winsorized binning, and you must specify the WINSORRATE option with a value from 0.0 to 0.5 exclusive for number. If you specify PSEUDO_QUANTILE, then PROC HPBIN generates a result that approximates the quantile binning. You can specify only one option. The default is BUCKET.

However, when a BINS_META data set is specified, PROC HPBIN does not do binning and ignores the binning method options, binning level options, and INPUT statement. Instead, PROC HPBIN takes the binning results from the BINS_META data set and calculates the weight of evidence and information value.

COMPUTEQUANTILE

computes the quantile result. If you specify COMPUTEQUANTILE, PROC HPBIN generates the estimated quantiles and extremes table, which contains the following percentages: 0% (Min), 1%, 5%, 10%, 25% (Q1), 50% (Median), 75% (Q3), 90%, 95%, 99%, and 100% (Max).

COMPUTESTATS

computes the statistic result. If you specify COMPUTESTATS, basic statistical information is computed and ODS output can be provided. The output table contains six variables: the mean, pseudo-median, standard deviation, minimum, maximum, and number of bins for each binning variable.

DATA=SAS-data-set

specifies the input SAS data set or database table to be used by PROC HPBIN.

If the procedure executes in distributed mode, the input data are distributed to memory on the appliance nodes and analyzed in parallel, unless the data are already distributed in the appliance database. In this case, PROC HPBIN reads the data alongside the distributed database.

For single-machine mode, the input must be a SAS data set.

NOPRINT

suppresses the generation of ODS outputs.

NUMBIN=integer

specifies the global number of binning levels for all binning variables. The value of integer can be any integer between 2 and 1,000, inclusive. The default number of binning levels is 16.

The resulting number of binning levels might be less than the specified integer if the sample size is small or if the data are not normalized. In this case, PROC HPBIN provides a warning message.

You can specify a different number of binning levels for each different variable in an INPUT statement. The number of binning levels that you specify in an INPUT statement overwrites the global number of binning levels.

OUTPUT=SAS-data-set

creates an output SAS data set in single-machine mode or a database table that is saved alongside the distributed database in distributed mode. The output data set or table contains binning variables. To avoid data duplication for large data sets, the variables in the input data set are not included in the output data set.

WOE

computes the weight of evidence (WOE) and information value (IV).

WOEADJUST=number

specifies the adjustment factor for the weight-of-evidence calculation. You can specify any value from 0.0 to 1.0, inclusive, for number. The default is 0.5.