This example shows pseudo–quantile binning that is executed in distributed mode. The following DATA step generates 1,000,000 observations:
data ex12; length id 8; do id=1 to 1000000; x1 = ranuni(101); x2 = 10*ranuni(201); output; end; run;
You can run PROC HPBIN in distributed mode by specifying valid values for the NODES=, INSTALL=, and HOST= options in the PERFORMANCE statement. An alternative to specifying the INSTALL= and HOST= options in the PERFORMANCE statement is to set appropriate values for the GRIDHOST and GRIDINSTALLLOC environment variables by using OPTIONS SET commands. See the section Processing Modes in Chapter 3: Shared Concepts and Topics, for details about setting these options or environment variables.
The following statements provide an example. To run these statements successfully, you need to set the macro variables GRIDHOST
and GRIDINSTALLLOC
to resolve to appropriate values, or you can replace the references to macro variables with appropriate values.
ods output BinInfo=bininfo; ods output PerformanceInfo=perfInfo; ods output Mapping=mapTable; ods output Summary=Summary; ods output Quantile=Quantile; ods listing close; proc hpbin data=ex12 output=out numbin=10 pseudo_quantile computestats computequantile ; input x1-x2; performance nodes=4 nthreads=8 host="&GRIDHOST" install="&GRIDINSTALLLOC"; run; ods listing;
The "Performance Information" table in Output 4.2.1 shows the grid setting.
The "Binning Information" table in Output 4.2.2 shows the binning method, number of bins, and number of variables.
The "Mapping" table in Output 4.2.3 shows the level mapping of the input variables.
Output 4.2.3: PROC HPBIN Mapping
Mapping | ||||
---|---|---|---|---|
Variable | Binned Variable | Range | Frequency | Proportion |
x1 | BIN_x1 | x1 < 0.099900 | 100046 | 0.10004600 |
0.099900 <= x1 < 0.199500 | 100029 | 0.10002900 | ||
0.199500 <= x1 < 0.299300 | 100016 | 0.10001600 | ||
0.299300 <= x1 < 0.399500 | 99939 | 0.09993900 | ||
0.399500 <= x1 < 0.500000 | 100049 | 0.10004900 | ||
0.500000 <= x1 < 0.599800 | 99989 | 0.09998900 | ||
0.599800 <= x1 < 0.700400 | 99975 | 0.09997500 | ||
0.700400 <= x1 < 0.800300 | 100014 | 0.10001400 | ||
0.800300 <= x1 < 0.900299 | 100007 | 0.10000700 | ||
0.900299 <= x1 | 99936 | 0.09993600 | ||
x2 | BIN_x2 | x2 < 0.997008 | 100006 | 0.10000600 |
0.997008 <= x2 < 1.995006 | 100025 | 0.10002500 | ||
1.995006 <= x2 < 2.994005 | 99986 | 0.09998600 | ||
2.994005 <= x2 < 3.995004 | 100034 | 0.10003400 | ||
3.995004 <= x2 < 4.999002 | 99990 | 0.09999000 | ||
4.999002 <= x2 < 5.998001 | 100063 | 0.10006300 | ||
5.998001 <= x2 < 6.993000 | 99929 | 0.09992900 | ||
6.993000 <= x2 < 7.998998 | 100008 | 0.10000800 | ||
7.998998 <= x2 < 8.999997 | 100010 | 0.10001000 | ||
8.999997 <= x2 | 99949 | 0.09994900 |
The "Summary Statistics" table in Output 4.2.4 displays the basic statistical information, including the number of observations, number of missing observations, mean, median, and so on.
The "Quantiles and Extremes" table in Output 4.2.5 shows the quantile computation of the variables. The ODS table is generated only when the COMPUTESTATS option is specified in the PROC HPBIN statement.
Output 4.2.5: PROC HPBIN Quantile Computation
Quantiles and Extremes | ||
---|---|---|
Variable | Quantile Level | Quantile |
x1 | Max | 0.99999939 |
.99 | 0.99011639 | |
.95 | 0.95024946 | |
.90 | 0.90023557 | |
.75 Q3 | 0.75032495 | |
.50 Median | 0.49991238 | |
.25 Q1 | 0.24931534 | |
.10 | 0.09985729 | |
.05 | 0.04954403 | |
.01 | 0.01000524 | |
Min | 2.24449E-7 | |
x2 | Max | 9.99999537 |
.99 | 9.90136979 | |
.95 | 9.49989152 | |
.90 | 8.99939011 | |
.75 Q3 | 7.49894200 | |
.50 Median | 4.99851593 | |
.25 Q1 | 2.49431827 | |
.10 | 0.99691767 | |
.05 | 0.49879104 | |
.01 | 0.10062442 | |
Min | 9.10833E-6 |