The HPCOUNTREG Procedure

Example 20.1 High-Performance Zero-Inflated Poisson Model

This example shows the use of the HPCOUNTREG procedure with an emphasis on large data set processing and the performance improvements that are achieved by executing in the high-performance distributed environment.

The following DATA step generates one million replicates from the zero-inflated Poisson (ZIP) model. The model contains seven variables and three variables that correspond to the zero-inflated process.

    data simulate;
       call streaminit(12345);
       array vars x1-x7;
       array zero_vars z1-z3;

       array parms{7}  (.3 .4 .2 .4 -.3 -.5 -.3);
       array zero_parms{3} (-.6 .3 .2);

       intercept=2;
       z_intercept=-1;
       theta=0.5;

       do i=1 to 1000000;
          sum_xb=0;
          sum_gz=0;
          do j=1 to 7;
             vars[j]=rand('NORMAL',0,1);
             sum_xb=sum_xb+parms[j]*vars[j];
          end;
          mu=exp(intercept+sum_xb);
          y_p=rand('POISSON', mu);

          do j=1 to 3;
             zero_vars[j]=rand('NORMAL',0,1);
             sum_gz = sum_gz+zero_parms[j]*zero_vars[j];
          end;
          z_gamma = z_intercept+sum_gz;
          pzero = cdf('LOGISTIC',z_gamma);
          cut=rand('UNIFORM');
          if cut<pzero then y_p=0;
          output;
       end;
    keep y_p x1-x7 z1-z3;
    run;

The following statements estimate a zero-inflated Poisson model.

 option set=GRIDHOST="&GRIDHOST";
 option set=GRIDINSTALLLOC="&GRIDINSTALLLOC";

 proc hpcountreg data=simulate dist=zip;
    performance nthreads=2 nodes=1 details
            host="&GRIDHOST" install="&GRIDINSTALLLOC";
    model y_p=x1-x7;
    zeromodel y_p ~ z1-z3;
 run;

The model is executed in the distributed computing environment on two threads and only one node. These settings are used to obtain a hypothetical environment that might resemble running the HPCOUNTREG procedure on a desktop workstation with a dual-core CPU. To run these statements successfully, you need to set the macro variables GRIDHOST and GRIDINSTALLLOC to resolve to appropriate values, or you can replace the references to the macro variables in the example with the appropriate values. Output 20.1.1 shows the "Performance Information" table for this hypothetical scenario.

Output 20.1.1: Performance Information with One Node and One Thread

Performance Information
Host Node << your grid host >>
Install Location << your grid install location >>
Execution Mode Distributed
Number of Compute Nodes 1
Number of Threads per Node 2



Output 20.1.2 shows the results for the zero-inflated Poisson model. The "Model Fit Summary" table shows detailed information about the model and indicates that all one million observations were used to fit the model. All parameter estimates in the "Parameter Estimates" table are highly significant and correspond to their theoretical values set during the data generating process. The optimization of the model that contains one million observations took 40.77 seconds.

Output 20.1.2: Zero-Inflated Poisson Model Execution on One Node and Two Threads

Model Fit Summary
Dependent Variable y_p
Number of Observations 1000000
Data Set WORK.SIMULATE
Model ZIP
ZI Link Function Logistic
Log Likelihood -2215238
Maximum Absolute Gradient 2.044E-8
Number of Iterations 7
Optimization Method Newton-Raphson
AIC 4430500
SBC 4430642

Convergence criterion (FCONV=2.220446E-16) satisfied.

Parameter Estimates
Parameter DF Estimate Standard
Error
t Value Pr > |t|
Intercept 1 2.0005 0.000492 4069.80 <.0001
x1 1 0.2995 0.000352 850.17 <.0001
x2 1 0.3998 0.000353 1132.23 <.0001
x3 1 0.2008 0.000352 570.27 <.0001
x4 1 0.3994 0.000353 1132.85 <.0001
x5 1 -0.2995 0.000353 -848.95 <.0001
x6 1 -0.5000 0.000353 -1414.9 <.0001
x7 1 -0.3002 0.000352 -852.14 <.0001
Inf_Intercept 1 -0.9993 0.002521 -396.45 <.0001
Inf_z1 1 -0.6024 0.002585 -233.02 <.0001
Inf_z2 1 0.2976 0.002454 121.25 <.0001
Inf_z3 1 0.1974 0.002430 81.20 <.0001

Procedure Task Timing
Task Seconds Percent
Reading and Levelizing Data 0.32 0.78%
Communication to Client 0.03 0.06%
Optimization 40.77 98.52%
Post-Optimization 0.26 0.63%



In the following statements, the PERFORMANCE statement is modified to use a grid with 10 nodes, with each node capable of spawning eight threads:


 proc hpcountreg data=simulate dist=zip;
    performance nthreads=8 nodes=10 details
                host="&GRIDHOST" install="&GRIDINSTALLLOC";
    model y_p=x1-x7;
    zeromodel y_p ~ z1-z3;
 run;

Because the two models being estimated are identical, it is reasonable to expect that Output 20.1.2 and Output 20.1.3 would show the same results. However, you can see a significant difference in performance between the two models. The second model, which was run on a grid that used 10 nodes with eight threads each, took only 3.54 seconds instead of 40.77 seconds to optimize.

In certain circumstances, you might observe slight numerical differences in the results, depending on the number of nodes and threads involved. This happens because the order in which partial results are accumulated can make a difference in the final result, owing to the limits of numerical precision and the propagation of error in numerical computations.

Output 20.1.3: Zero-Inflated Poisson Model Execution on 10 Nodes with Eight Threads Each

The HPCOUNTREG Procedure

Model Fit Summary
Dependent Variable y_p
Number of Observations 1000000
Data Set WORK.SIMULATE
Model ZIP
ZI Link Function Logistic
Log Likelihood -2215238
Maximum Absolute Gradient 2.0608E-8
Number of Iterations 7
Optimization Method Newton-Raphson
AIC 4430500
SBC 4430642

Convergence criterion (FCONV=2.220446E-16) satisfied.

Parameter Estimates
Parameter DF Estimate Standard
Error
t Value Pr > |t|
Intercept 1 2.0005 0.000492 4069.80 <.0001
x1 1 0.2995 0.000352 850.17 <.0001
x2 1 0.3998 0.000353 1132.23 <.0001
x3 1 0.2008 0.000352 570.27 <.0001
x4 1 0.3994 0.000353 1132.85 <.0001
x5 1 -0.2995 0.000353 -848.95 <.0001
x6 1 -0.5000 0.000353 -1414.9 <.0001
x7 1 -0.3002 0.000352 -852.14 <.0001
Inf_Intercept 1 -0.9993 0.002521 -396.45 <.0001
Inf_z1 1 -0.6024 0.002585 -233.02 <.0001
Inf_z2 1 0.2976 0.002454 121.25 <.0001
Inf_z3 1 0.1974 0.002430 81.20 <.0001

Procedure Task Timing
Task Seconds Percent
Reading and Levelizing Data 0.02 0.61%
Communication to Client 0.06 1.44%
Optimization 3.54 90.99%
Post-Optimization 0.27 6.96%



As this example suggests, increasing the number of nodes and the number of threads per node improves performance significantly. When you use the parallelism afforded by a high-performance distributed environment, you can see an even more dramatic reduction in the time required for the optimization as the number of observations in the data set increases. When the data set is extremely large, the computations might not even be possible in some cases, given the typical memory resources and computational constraints of a desktop computer. Under such circumstances the high-performance distributed environment becomes a necessity.