The HPPANEL Procedure

Example 7.1 One-Way Random-Effects High-Performance Model

This example shows the use of the one-way random-effects model that is available in the HPPANEL procedure; the example emphasizes processing a large data set and the performance improvements that are achieved by executing in a high-performance distributed environment.

The following DATA step generates five million observations from one-way panel data that includes 50,000 cross sections and 100 time periods:

  
data hppan_ex01 (keep = cs ts y x1-x10);
   retain seed 55371;
   array x[10];
   label y  = 'Dependent Variable';
   do cs = 1 to 50000;
      dummy = 10 * rannor(seed);
      do ts = 1 to 100;
      /*- generate regressors and compute the structural */
      /*- part of the dependent variable                 */
         y = 5; 
         do k = 1 to 10;
            x[k] = -1 + 2 * ranuni(seed);
            y = y + x[k] * k;
         end;
  
         /*- add an error term, such that e - N(0,100)   -------*/
         y = y + 10 * rannor(seed);
         /*- add a random effect, such that v - N(0,100) -------*/
         y = y + dummy;
         output;
      end;
   end;
run;

The estimation is executed in distributed mode on a grid with ten nodes, with one thread per node. To run the following statements successfully, you need to set the macro variables GRIDHOST and GRIDINSTALLLOC to resolve to appropriate values, or you can replace the references to the macro variables in the example with the appropriate values.

  %let GRIDHOST      = <<your grid host>>;
  %let GRIDINSTALLOC = <<your grid install location>>;

  option set = GRIDHOST       = "&GRIDHOST";
  option set = GRIDINSTALLLOC = "&GRIDINSTALLLOC";
proc hppanel data=hppan_ex01;
   id  cs ts;
   model y = x1-x10 / ranone;
   performance nodes = 10 threads = 1 details
               host="&GRIDHOST" install="&GRIDINSTALLLOC";
run;

In Output 7.1.1, the "Performance Information" table shows that the model was estimated on the grid that is defined in the macro variable named GRIDHOST in a distributed environment with ten nodes, and one thread per node. The grid installation location is defined in the macro variable named GRIDINSTALLLOC.

Output 7.1.1: Grid Information with Ten Nodes and One Thread per Node

Performance Information
Host Node << your grid host >>
Install Location << your grid install location >>
Execution Mode Distributed
Number of Compute Nodes 10
Number of Threads per Node 1



Output 7.1.2 shows the results for the one-way random-effects model. The "Model Information" table shows detailed information about the model. The "Number of Observations" table indicates that all five million observations were used to fit the model. All parameter estimates in the "Parameter Estimates" table are highly significant and correspond to the theoretical values that were set for them during the data generating process. In the "Procedure Task Timing" table, you can see that for five million observations, computing the moments took 101.53 seconds, and the time taken for cross-product accumulation was negligible.

Output 7.1.2: One-Way Random-Effects Model

Model Information
Data Source HPPAN_EX01
Response Variable y
Model RANONE
Variance Component WANSBEEK
Execution Mode Distributed

Number of Observations
Number of Observations Read 5000000
Number of Observations Used 5000000
Number of Cross Sections 50000
Number of Time Series 100

Fit Statistics
Sum of Squared Error 4.9976E8
Degrees of Freedom 4999989
Mean Squared Error 99.952
Root Mean Squared Error 9.9976
R-Square 0.559771

Variance Component Estimates
Variance Component for Cross Sections 99.9117
Variance Component for Error 99.9520

Hausman Test for Random Effects
Coefficients DF m Value Pr > m
10 10 14.04 0.1713

Parameter Estimates
Parameter DF Estimate Standard
Error
t Value Pr > |t|
Intercept 1 4.96955 0.04492 110.62 <.0001
x1 1 1.00902 0.00778 129.69 <.0001
x2 1 1.99743 0.00778 256.66 <.0001
x3 1 3.00116 0.00778 385.64 <.0001
x4 1 3.99847 0.00778 513.68 <.0001
x5 1 4.99497 0.00778 641.81 <.0001
x6 1 6.01034 0.00778 772.12 <.0001
x7 1 6.99770 0.00778 899.39 <.0001
x8 1 7.98897 0.00778 1026.61 <.0001
x9 1 9.00692 0.00778 1157.12 <.0001
x10 1 10.00563 0.00778 1285.47 <.0001

Procedure Task Timing
Task Seconds Percent
Data Read and Variable Levelization 0.30 0.29%
Communication to Client 0.00 0.00%
Computing Moments 101.53 99.33%
Cross-Product Accumulation 0.38 0.37%



For comparison, you now fit a pooled regression estimation on the same data, again using a grid of 10 nodes with one thread each. The following SAS statements perform the estimation on the grid:

proc hppanel data=hppan_ex01;
   id  cs ts;
   model y = x1-x10 / pooled;
   performance nodes = 10 threads = 1 details
               host="&GRIDHOST" install="&GRIDINSTALLLOC";
run;

Based on Output 7.1.3, you find that the parameter estimates are similar to those from the random-effects estimator. You also find that the timings are similar, indicating that the bulk of the computational effort is due to tasks common to both random-effects estimation and standard OLS regression. In both cases, estimation is dominated by the calculation of sums of squares and other moment terms, over the whole data set.

Output 7.1.3: Pooled Regression Model

The HPPANEL Procedure

Model Information
Data Source HPPAN_EX01
Response Variable y
Model POOLED
Execution Mode Distributed

Parameter Estimates
Parameter DF Estimate Standard
Error
t Value Pr > |t|
Intercept 1 4.96957 0.00632 786.03 <.0001
x1 1 1.01251 0.01095 92.49 <.0001
x2 1 1.98374 0.01095 181.17 <.0001
x3 1 3.00294 0.01095 274.23 <.0001
x4 1 3.99649 0.01095 364.90 <.0001
x5 1 5.00187 0.01095 456.77 <.0001
x6 1 5.99952 0.01095 547.77 <.0001
x7 1 7.00478 0.01095 639.88 <.0001
x8 1 7.97232 0.01095 728.13 <.0001
x9 1 9.01244 0.01095 822.90 <.0001
x10 1 10.01578 0.01095 914.52 <.0001

Procedure Task Timing
Task Seconds Percent
Data Read and Variable Levelization 0.28 0.27%
Communication to Client 0.00 0.00%
Computing Moments 103.68 99.59%
Cross-Product Accumulation 0.15 0.14%