This example shows the use of the one-way random effects model that is available in the HPPANEL procedure with an emphasis on processing a large data set and on the performance improvements that are achieved by executing in a high-performance distributed environment.
The following DATA step generates 5 million replications from a one-way panel data that includes 50,000 cross sections and 100 time periods:
data hppan_ex01 (keep = cs ts y x1-x10); retain seed1 55371 seed2 97335 seed3 19412; array x[10]; label y = 'dependent var.'; label x1='first independent var.'; label x2='second independent var.'; label x3='third independent var.'; int = 1; do cs = 1 to 50000; dummy = 10000*rannor( seed3 ); do ts = 1 to 100; /*- generate regressors and compute the structural */ /*- part of the dependent variable */ y = 5; /* intercept */ do k = 1 to 10; x[k] = (cs + ts ) * (0.001*ranuni( k ) + 1) ; y = y + x[k] * k; end; /*- add an error term, such that e - N(0,100) -------*/ y = y + 10000*rannor( seed2 ); /*- add a random effect, such that e - N(0,100) -------*/ y = y + dummy; output; end; end; run;
The model is executed in the distributed computing environment with one thread and only one node. These settings are used to obtain a hypothetical environment that might resemble running the HPPANEL procedure on a desktop workstation with a single-core CPU. To run the following statements successfully, you need to set the macro variables GRIDHOST and GRIDINSTALLLOC to resolve to appropriate values, or you can replace the references to the macro variables in the example with the appropriate values.
option set=GRIDHOST="&GRIDHOST"; option set=GRIDINSTALLLOC="&GRIDINSTALLLOC";
proc hppanel data=hppan_ex01 ranone; id cs ts; model y = x1-x10; performance nodes = 1 threads = 1 details host="&GRIDHOST" install="&GRIDINSTALLLOC"; run;
In Output 7.1.1, the “Performance Information” table shows that the model was estimated on the grid that is defined in a macro variable named GRIDHOST in a distributed environment on only one node with one thread. The grid install location is defined in a macro variable named GRIDINSTALLLOC.
Output 7.1.1: Grid Information with One Node and One Thread
Performance Information | |
---|---|
Host Node | << your grid host >> |
Install Location | << your grid install location >> |
Execution Mode | Distributed |
Grid Mode | Symmetric |
Number of Compute Nodes | 1 |
Number of Threads per Node | 1 |
Output 7.1.2 shows the results for the one-way random effects model. The “Model Information” table shows detailed information about the model. The “Number of Observations” table indicates that all 5 million observations were used to fit the model. All parameter estimates in the “Parameter Estimates” table are highly significant and correspond to the theoretical values that were set for them during the data generating process. In the “Timing” table, you can see that for 5 million observations, computing moments took 5706.25 seconds, and the cross-product accumulation took 272.95 seconds.
Output 7.1.2: One-Way Random Effects Model
Model Information | |
---|---|
Data Source | WORK.HPPAN_EX01 |
Response Variable | y |
Model | RANONE |
Variance Component | WANSBEEK |
Execution Mode | Distributed |
Fit Statistics | |
---|---|
Sum of Squared Error | 5.00008E14 |
Degree of Freedom | 4999989 |
Mean Squared Error | 100001811 |
Root Mean Squared Error | 10000 |
R-Square | 0.98318 |
Variance Component Estimates | |
---|---|
Variance Component for Cross Sections | 1.0704E8 |
Variance Component for Error | 1.0007E8 |
Parameter Estimates | |||||
---|---|---|---|---|---|
Parameter | DF | Estimate | Standard Error |
t Value | Pr > |t| |
Intercept | 1 | 27.06229 | 93.06534 | 0.29 | 0.7712 |
x1 | 1 | 0.44857 | 0.51089 | 0.88 | 0.3799 |
x2 | 1 | 2.18393 | 0.51098 | 4.27 | <.0001 |
x3 | 1 | 2.70052 | 0.51099 | 5.28 | <.0001 |
x4 | 1 | 4.49262 | 0.51100 | 8.79 | <.0001 |
x5 | 1 | 5.54728 | 0.51076 | 10.86 | <.0001 |
x6 | 1 | 6.50872 | 0.51088 | 12.74 | <.0001 |
x7 | 1 | 6.54937 | 0.51098 | 12.82 | <.0001 |
x8 | 1 | 7.09160 | 0.51090 | 13.88 | <.0001 |
x9 | 1 | 8.64988 | 0.51092 | 16.93 | <.0001 |
x10 | 1 | 10.82664 | 0.51051 | 21.21 | <.0001 |
Procedure Task Timing | ||
---|---|---|
Task | Seconds | Percent |
Data Read and Variable Levelization | 2.00 | 0.03% |
Communication to Client | 0.00 | 0.00% |
Computing Moments | 5706.25 | 95.40% |
Cross-Product Accumulation | 272.95 | 4.56% |
In the following statements, the PERFORMANCE statement is modified to request a grid that has 10 nodes, where each node spawns one thread:
proc hppanel data=hppan_ex01 ranone; id cs ts; model y = x1-x10; performance nodes = 10 threads = 1 details host="&GRIDHOST" install="&GRIDINSTALLLOC"; run;
In Output 7.1.3, the “Performance Information” table shows that the model was estimated on the grid that is defined in a macro variable named GRIDHOST in a distributed environment on 10 nodes with one thread each. The grid install location is defined in a macro variable named GRIDINSTALLLOC.
Output 7.1.3: Grid Information for 10 Nodes with One Thread Each
Performance Information | |
---|---|
Host Node | << your grid host >> |
Install Location | << your grid install location >> |
Execution Mode | Distributed |
Grid Mode | Symmetric |
Number of Compute Nodes | 10 |
Number of Threads per Node | 1 |
Although the two models are identical, estimating the model took only 14 minutes for the second implementation, which was run on a grid that used 10 nodes with one thread each, instead of 1 hour and 40 minutes for the first implementation.
Output 7.1.4: Timing Information for 10 Nodes with One Thread Each
Procedure Task Timing | ||
---|---|---|
Task | Seconds | Percent |
Data Read and Variable Levelization | 0.48 | 0.06% |
Communication to Client | 0.00 | 0.00% |
Computing Moments | 784.07 | 96.34% |
Cross-Product Accumulation | 29.34 | 3.60% |