This example shows the use of the HPCOUNTREG procedure with an emphasis on large data set processing and the performance improvements that are achieved by executing in the high-performance distributed environment.
The following DATA step generates one million replicates from the zero-inflated Poisson (ZIP) model. The model contains seven variables and three variables that correspond to the zero-inflated process.
data simulate; call streaminit(12345); array vars x1-x7; array zero_vars z1-z3; array parms{7} (.3 .4 .2 .4 -.3 -.5 -.3); array zero_parms{3} (-.6 .3 .2); intercept=2; z_intercept=-1; theta=0.5; do i=1 to 1000000; sum_xb=0; sum_gz=0; do j=1 to 7; vars[j]=rand('NORMAL',0,1); sum_xb=sum_xb+parms[j]*vars[j]; end; mu=exp(intercept+sum_xb); y_p=rand('POISSON', mu); do j=1 to 3; zero_vars[j]=rand('NORMAL',0,1); sum_gz = sum_gz+zero_parms[j]*zero_vars[j]; end; z_gamma = z_intercept+sum_gz; pzero = cdf('LOGISTIC',z_gamma); cut=rand('UNIFORM'); if cut<pzero then y_p=0; output; end; keep y_p x1-x7 z1-z3; run;
The following statements estimate a zero-inflated Poisson model. The model is executed in the distributed computing environment on two threads and only one node. These settings are used to obtain a hypothetical environment that might resemble running the HPCOUNTREG procedure on a desktop workstation with a dual-core CPU. To run these statements successfully, you need to set the macro variables GRIDHOST and GRIDINSTALLLOC to resolve to appropriate values, or you can replace the references to the macro variables in the example with the appropriate values.
option set=GRIDHOST="&GRIDHOST"; option set=GRIDINSTALLLOC="&GRIDINSTALLLOC";
proc hpcountreg data=simulate dist=zip; performance nthreads=2 nodes=1 details; model y_p=x1-x7; zeromodel y_p ~ z1-z3; run;
Output 3.1.1 shows the results for the zero-inflated Poisson model. The “Performance Information” table shows that the model was estimated on the grid defined in a macro variable named GRIDHOST in a distributed environment on only one node, and it shows that two threads were employed. The “Model Fit Summary” table shows detailed information about the model and indicates that all one million observations were used to fit the model. All parameter estimates in the “Parameter Estimates” table are highly significant and correspond to their theoretical values set during the data generating process. The optimization of the model that contains one million observations took 46.47 seconds.
Output 3.1.1: Zero-Inflated Poisson Model Execution on One Node and Two Threads
Performance Information | |
---|---|
Host Node | << your grid host >> |
Execution Mode | Distributed |
Grid Mode | Symmetric |
Number of Compute Nodes | 1 |
Number of Threads per Node | 2 |
Model Fit Summary | |
---|---|
Dependent Variable | y_p |
Number of Observations | 1000000 |
Data Set | WORK.SIMULATE |
Model | ZIP |
ZI Link Function | Logistic |
Log Likelihood | -2215238 |
Maximum Absolute Gradient | 2.0586E-8 |
Number of Iterations | 7 |
Optimization Method | Newton-Raphson |
AIC | 4430500 |
SBC | 4430642 |
Convergence criterion (FCONV=2.220446E-16) satisfied. |
Parameter Estimates | |||||
---|---|---|---|---|---|
Parameter | DF | Estimate | Standard Error |
t Value | Pr > |t| |
Intercept | 1 | 2.0005 | 0.000492 | 4069.80 | <.0001 |
x1 | 1 | 0.2995 | 0.000352 | 850.17 | <.0001 |
x2 | 1 | 0.3998 | 0.000353 | 1132.23 | <.0001 |
x3 | 1 | 0.2008 | 0.000352 | 570.27 | <.0001 |
x4 | 1 | 0.3994 | 0.000353 | 1132.85 | <.0001 |
x5 | 1 | -0.2995 | 0.000353 | -848.95 | <.0001 |
x6 | 1 | -0.5000 | 0.000353 | -1414.9 | <.0001 |
x7 | 1 | -0.3002 | 0.000352 | -852.14 | <.0001 |
Inf_Intercept | 1 | -0.9993 | 0.002521 | -396.45 | <.0001 |
Inf_z1 | 1 | -0.6024 | 0.002585 | -233.02 | <.0001 |
Inf_z2 | 1 | 0.2976 | 0.002454 | 121.25 | <.0001 |
Inf_z3 | 1 | 0.1974 | 0.002430 | 81.20 | <.0001 |
Procedure Task Timing | ||
---|---|---|
Task | Seconds | Percent |
Reading and Levelizing Data | 0.43 | 0.91% |
Communication to Client | 0.00 | 0.01% |
Optimization | 46.47 | 98.91% |
Post-Optimization | 0.08 | 0.17% |
In the following statements, the PERFORMANCE statement is modified to use a grid with 10 nodes, with each node capable of spawning eight threads:
proc hpcountreg data=simulate dist=zip; performance nthreads=8 nodes=10 details; model y_p=x1-x7; zeromodel y_p ~ z1-z3; run;
Because the two models being estimated are identical, it is reasonable to expect that Output 3.1.1 and Output 3.1.2 would show the same results. However, you can see a significant difference in performance between the two models. The second model, which was run on a grid that used 10 nodes with eight threads each, took only 2.49 seconds instead of 46.47 seconds to optimize.
In certain circumstances, you might observe slight numerical differences in the results, depending on the number of nodes and threads involved. This happens because the order in which partial results are accumulated can make a difference in the final result, owing to the limits of numerical precision and the propagation of error in numerical computations.
Output 3.1.2: Zero-Inflated Poisson Model Execution on 10 Nodes with Eight Threads Each
Performance Information | |
---|---|
Host Node | << your grid host >> |
Execution Mode | Distributed |
Grid Mode | Symmetric |
Number of Compute Nodes | 10 |
Number of Threads per Node | 8 |
Model Fit Summary | |
---|---|
Dependent Variable | y_p |
Number of Observations | 1000000 |
Data Set | WORK.SIMULATE |
Model | ZIP |
ZI Link Function | Logistic |
Log Likelihood | -2215238 |
Maximum Absolute Gradient | 2.0608E-8 |
Number of Iterations | 7 |
Optimization Method | Newton-Raphson |
AIC | 4430500 |
SBC | 4430642 |
Convergence criterion (FCONV=2.220446E-16) satisfied. |
Parameter Estimates | |||||
---|---|---|---|---|---|
Parameter | DF | Estimate | Standard Error |
t Value | Pr > |t| |
Intercept | 1 | 2.0005 | 0.000492 | 4069.80 | <.0001 |
x1 | 1 | 0.2995 | 0.000352 | 850.17 | <.0001 |
x2 | 1 | 0.3998 | 0.000353 | 1132.23 | <.0001 |
x3 | 1 | 0.2008 | 0.000352 | 570.27 | <.0001 |
x4 | 1 | 0.3994 | 0.000353 | 1132.85 | <.0001 |
x5 | 1 | -0.2995 | 0.000353 | -848.95 | <.0001 |
x6 | 1 | -0.5000 | 0.000353 | -1414.9 | <.0001 |
x7 | 1 | -0.3002 | 0.000352 | -852.14 | <.0001 |
Inf_Intercept | 1 | -0.9993 | 0.002521 | -396.45 | <.0001 |
Inf_z1 | 1 | -0.6024 | 0.002585 | -233.02 | <.0001 |
Inf_z2 | 1 | 0.2976 | 0.002454 | 121.25 | <.0001 |
Inf_z3 | 1 | 0.1974 | 0.002430 | 81.20 | <.0001 |
Procedure Task Timing | ||
---|---|---|
Task | Seconds | Percent |
Reading and Levelizing Data | 0.04 | 1.58% |
Communication to Client | 0.07 | 2.67% |
Optimization | 2.49 | 90.09% |
Post-Optimization | 0.16 | 5.66% |
As this example suggests, increasing the number of nodes and the number of threads per node improves performance significantly. When you use the parallelism afforded by a high-performance distributed environment, you can see an even more dramatic reduction in the time required for the optimization as the number of observations in the data set increases. When the data set is extremely large, the computations might not even be possible in some cases, given the typical memory resources and computational constraints of a desktop computer. Under such circumstances the high-performance distributed environment becomes a necessity.