This example shows the use of the HPCOUNTREG procedure with an emphasis on large data set processing and the performance improvements that are achieved by executing in the high-performance distributed environment.
The following DATA step generates one million replicates from the zero-inflated Poisson (ZIP) model. The model contains seven variables and three variables that correspond to the zero-inflated process.
data simulate; call streaminit(12345); array vars x1-x7; array zero_vars z1-z3; array parms{7} (.3 .4 .2 .4 -.3 -.5 -.3); array zero_parms{3} (-.6 .3 .2); intercept=2; z_intercept=-1; theta=0.5; do i=1 to 1000000; sum_xb=0; sum_gz=0; do j=1 to 7; vars[j]=rand('NORMAL',0,1); sum_xb=sum_xb+parms[j]*vars[j]; end; mu=exp(intercept+sum_xb); y_p=rand('POISSON', mu); do j=1 to 3; zero_vars[j]=rand('NORMAL',0,1); sum_gz = sum_gz+zero_parms[j]*zero_vars[j]; end; z_gamma = z_intercept+sum_gz; pzero = cdf('LOGISTIC',z_gamma); cut=rand('UNIFORM'); if cut<pzero then y_p=0; output; end; keep y_p x1-x7 z1-z3; run;
The following statements estimate a zero-inflated Poisson model.
option set=GRIDHOST="&GRIDHOST"; option set=GRIDINSTALLLOC="&GRIDINSTALLLOC"; proc hpcountreg data=simulate dist=zip; performance nthreads=2 nodes=1 details host="&GRIDHOST" install="&GRIDINSTALLLOC"; model y_p=x1-x7; zeromodel y_p ~ z1-z3; run;
The model is executed in the distributed computing environment on two threads and only one node. These settings are used to obtain a hypothetical environment that might resemble running the HPCOUNTREG procedure on a desktop workstation with a dual-core CPU. To run these statements successfully, you need to set the macro variables GRIDHOST and GRIDINSTALLLOC to resolve to appropriate values, or you can replace the references to the macro variables in the example with the appropriate values. Output 20.1.1 shows the "Performance Information" table for this hypothetical scenario.
Output 20.1.1: Performance Information with One Node and One Thread
Output 20.1.2 shows the results for the zero-inflated Poisson model. The "Model Fit Summary" table shows detailed information about the model and indicates that all one million observations were used to fit the model. All parameter estimates in the "Parameter Estimates" table are highly significant and correspond to their theoretical values set during the data generating process. The optimization of the model that contains one million observations took 40.77 seconds.
Output 20.1.2: Zero-Inflated Poisson Model Execution on One Node and Two Threads
Parameter Estimates | |||||
---|---|---|---|---|---|
Parameter | DF | Estimate | Standard Error |
t Value | Pr > |t| |
Intercept | 1 | 2.0005 | 0.000492 | 4069.80 | <.0001 |
x1 | 1 | 0.2995 | 0.000352 | 850.17 | <.0001 |
x2 | 1 | 0.3998 | 0.000353 | 1132.23 | <.0001 |
x3 | 1 | 0.2008 | 0.000352 | 570.27 | <.0001 |
x4 | 1 | 0.3994 | 0.000353 | 1132.85 | <.0001 |
x5 | 1 | -0.2995 | 0.000353 | -848.95 | <.0001 |
x6 | 1 | -0.5000 | 0.000353 | -1414.9 | <.0001 |
x7 | 1 | -0.3002 | 0.000352 | -852.14 | <.0001 |
Inf_Intercept | 1 | -0.9993 | 0.002521 | -396.45 | <.0001 |
Inf_z1 | 1 | -0.6024 | 0.002585 | -233.02 | <.0001 |
Inf_z2 | 1 | 0.2976 | 0.002454 | 121.25 | <.0001 |
Inf_z3 | 1 | 0.1974 | 0.002430 | 81.20 | <.0001 |
In the following statements, the PERFORMANCE statement is modified to use a grid with 10 nodes, with each node capable of spawning eight threads:
proc hpcountreg data=simulate dist=zip; performance nthreads=8 nodes=10 details host="&GRIDHOST" install="&GRIDINSTALLLOC"; model y_p=x1-x7; zeromodel y_p ~ z1-z3; run;
Because the two models being estimated are identical, it is reasonable to expect that Output 20.1.2 and Output 20.1.3 would show the same results. However, you can see a significant difference in performance between the two models. The second model, which was run on a grid that used 10 nodes with eight threads each, took only 3.54 seconds instead of 40.77 seconds to optimize.
In certain circumstances, you might observe slight numerical differences in the results, depending on the number of nodes and threads involved. This happens because the order in which partial results are accumulated can make a difference in the final result, owing to the limits of numerical precision and the propagation of error in numerical computations.
Output 20.1.3: Zero-Inflated Poisson Model Execution on 10 Nodes with Eight Threads Each
Parameter Estimates | |||||
---|---|---|---|---|---|
Parameter | DF | Estimate | Standard Error |
t Value | Pr > |t| |
Intercept | 1 | 2.0005 | 0.000492 | 4069.80 | <.0001 |
x1 | 1 | 0.2995 | 0.000352 | 850.17 | <.0001 |
x2 | 1 | 0.3998 | 0.000353 | 1132.23 | <.0001 |
x3 | 1 | 0.2008 | 0.000352 | 570.27 | <.0001 |
x4 | 1 | 0.3994 | 0.000353 | 1132.85 | <.0001 |
x5 | 1 | -0.2995 | 0.000353 | -848.95 | <.0001 |
x6 | 1 | -0.5000 | 0.000353 | -1414.9 | <.0001 |
x7 | 1 | -0.3002 | 0.000352 | -852.14 | <.0001 |
Inf_Intercept | 1 | -0.9993 | 0.002521 | -396.45 | <.0001 |
Inf_z1 | 1 | -0.6024 | 0.002585 | -233.02 | <.0001 |
Inf_z2 | 1 | 0.2976 | 0.002454 | 121.25 | <.0001 |
Inf_z3 | 1 | 0.1974 | 0.002430 | 81.20 | <.0001 |
As this example suggests, increasing the number of nodes and the number of threads per node improves performance significantly. When you use the parallelism afforded by a high-performance distributed environment, you can see an even more dramatic reduction in the time required for the optimization as the number of observations in the data set increases. When the data set is extremely large, the computations might not even be possible in some cases, given the typical memory resources and computational constraints of a desktop computer. Under such circumstances the high-performance distributed environment becomes a necessity.