This example shows the use of the one-way random effects model that is available in the HPPANEL procedure with an emphasis on processing a large data set and on the performance improvements that are achieved by executing in a high-performance distributed environment.
The following DATA step generates 5 million replications from a one-way panel data that includes 50,000 cross sections and 100 time periods:
data hppan_ex01 (keep = cs ts y x1-x10); retain seed1 55371 seed2 97335 seed3 19412; array x[10]; label y = 'dependent var.'; label x1='first independent var.'; label x2='second independent var.'; label x3='third independent var.'; int = 1; do cs = 1 to 50000; dummy = 10000*rannor( seed3 ); do ts = 1 to 100; /*- generate regressors and compute the structural */ /*- part of the dependent variable */ y = 5; /* intercept */ do k = 1 to 10; x[k] = (cs + ts ) * (0.001*ranuni( k ) + 1) ; y = y + x[k] * k; end; /*- add an error term, such that e - N(0,100) -------*/ y = y + 10000*rannor( seed2 ); /*- add a random effect, such that e - N(0,100) -------*/ y = y + dummy; output; end; end; run;
The model is executed in the distributed computing environment with one thread and only one node. These settings are used to obtain a hypothetical environment that might resemble running the HPPANEL procedure on a desktop workstation with a single-core CPU. To run the following statements successfully, you need to set the macro variables GRIDHOST and GRIDINSTALLLOC to resolve to appropriate values, or you can replace the references to the macro variables in the example with the appropriate values.
option set=GRIDHOST="&GRIDHOST"; option set=GRIDINSTALLLOC="&GRIDINSTALLLOC";
proc hppanel data=hppan_ex01 ranone; id cs ts; model y = x1-x10; performance nodes = 1 threads = 1 details host="&GRIDHOST" install="&GRIDINSTALLLOC"; run;
In Output 7.1.1, the "Performance Information" table shows that the model was estimated on the grid that is defined in a macro variable named GRIDHOST in a distributed environment on only one node with one thread. The grid install location is defined in a macro variable named GRIDINSTALLLOC.
Output 7.1.2 shows the results for the one-way random effects model. The "Model Information" table shows detailed information about the model. The "Number of Observations" table indicates that all 5 million observations were used to fit the model. All parameter estimates in the "Parameter Estimates" table are highly significant and correspond to the theoretical values that were set for them during the data generating process. In the "Timing" table, you can see that for 5 million observations, computing moments took 5840.62 seconds, and the cross-product accumulation took 278.51 seconds.
Output 7.1.2: One-Way Random Effects Model
Parameter Estimates | |||||
---|---|---|---|---|---|
Parameter | DF | Estimate | Standard Error |
t Value | Pr > |t| |
Intercept | 1 | 27.06229 | 93.06534 | 0.29 | 0.7712 |
x1 | 1 | 0.44857 | 0.51089 | 0.88 | 0.3799 |
x2 | 1 | 2.18393 | 0.51098 | 4.27 | <.0001 |
x3 | 1 | 2.70052 | 0.51099 | 5.28 | <.0001 |
x4 | 1 | 4.49262 | 0.51100 | 8.79 | <.0001 |
x5 | 1 | 5.54728 | 0.51076 | 10.86 | <.0001 |
x6 | 1 | 6.50872 | 0.51088 | 12.74 | <.0001 |
x7 | 1 | 6.54937 | 0.51098 | 12.82 | <.0001 |
x8 | 1 | 7.09160 | 0.51090 | 13.88 | <.0001 |
x9 | 1 | 8.64988 | 0.51092 | 16.93 | <.0001 |
x10 | 1 | 10.82664 | 0.51051 | 21.21 | <.0001 |
In the following statements, the PERFORMANCE statement is modified to request a grid that has 10 nodes, where each node spawns one thread:
proc hppanel data=hppan_ex01 ranone; id cs ts; model y = x1-x10; performance nodes = 10 threads = 1 details host="&GRIDHOST" install="&GRIDINSTALLLOC"; run;
In Output 7.1.3, the "Performance Information" table shows that the model was estimated on the grid that is defined in a macro variable named GRIDHOST in a distributed environment on 10 nodes with one thread each. The grid install location is defined in a macro variable named GRIDINSTALLLOC.
Although the two models are identical, estimating the model took only 12 minutes for the second implementation, which was run on a grid that used 10 nodes with one thread each, instead of 1 hour and 37 minutes for the first implementation.