This example shows the use of the one-way random-effects model that is available in the HPPANEL procedure; the example emphasizes processing a large data set and the performance improvements that are achieved by executing in a high-performance distributed environment.
The following DATA step generates five million observations from one-way panel data that includes 50,000 cross sections and 100 time periods:
data hppan_ex01 (keep = cs ts y x1-x10); retain seed 55371; array x[10]; label y = 'Dependent Variable'; do cs = 1 to 50000; dummy = 10 * rannor(seed); do ts = 1 to 100; /*- generate regressors and compute the structural */ /*- part of the dependent variable */ y = 5; do k = 1 to 10; x[k] = -1 + 2 * ranuni(seed); y = y + x[k] * k; end; /*- add an error term, such that e - N(0,100) -------*/ y = y + 10 * rannor(seed); /*- add a random effect, such that v - N(0,100) -------*/ y = y + dummy; output; end; end; run;
The estimation is executed in distributed mode on a grid with ten nodes, with one thread per node. To run the following statements successfully, you need to set the macro variables GRIDHOST and GRIDINSTALLLOC to resolve to appropriate values, or you can replace the references to the macro variables in the example with the appropriate values.
%let GRIDHOST = <<your grid host>>; %let GRIDINSTALLOC = <<your grid install location>>; option set = GRIDHOST = "&GRIDHOST"; option set = GRIDINSTALLLOC = "&GRIDINSTALLLOC";
proc hppanel data=hppan_ex01; id cs ts; model y = x1-x10 / ranone; performance nodes = 10 threads = 1 details host="&GRIDHOST" install="&GRIDINSTALLLOC"; run;
In Output 7.1.1, the "Performance Information" table shows that the model was estimated on the grid that is defined in the macro variable named GRIDHOST in a distributed environment with ten nodes, and one thread per node. The grid installation location is defined in the macro variable named GRIDINSTALLLOC.
Output 7.1.1: Grid Information with Ten Nodes and One Thread per Node
Output 7.1.2 shows the results for the one-way random-effects model. The "Model Information" table shows detailed information about the model. The "Number of Observations" table indicates that all five million observations were used to fit the model. All parameter estimates in the "Parameter Estimates" table are highly significant and correspond to the theoretical values that were set for them during the data generating process. In the "Procedure Task Timing" table, you can see that for five million observations, computing the moments took 101.53 seconds, and the time taken for cross-product accumulation was negligible.
Output 7.1.2: One-Way Random-Effects Model
Parameter Estimates | |||||
---|---|---|---|---|---|
Parameter | DF | Estimate | Standard Error |
t Value | Pr > |t| |
Intercept | 1 | 4.96955 | 0.04492 | 110.62 | <.0001 |
x1 | 1 | 1.00902 | 0.00778 | 129.69 | <.0001 |
x2 | 1 | 1.99743 | 0.00778 | 256.66 | <.0001 |
x3 | 1 | 3.00116 | 0.00778 | 385.64 | <.0001 |
x4 | 1 | 3.99847 | 0.00778 | 513.68 | <.0001 |
x5 | 1 | 4.99497 | 0.00778 | 641.81 | <.0001 |
x6 | 1 | 6.01034 | 0.00778 | 772.12 | <.0001 |
x7 | 1 | 6.99770 | 0.00778 | 899.39 | <.0001 |
x8 | 1 | 7.98897 | 0.00778 | 1026.61 | <.0001 |
x9 | 1 | 9.00692 | 0.00778 | 1157.12 | <.0001 |
x10 | 1 | 10.00563 | 0.00778 | 1285.47 | <.0001 |
For comparison, you now fit a pooled regression estimation on the same data, again using a grid of 10 nodes with one thread each. The following SAS statements perform the estimation on the grid:
proc hppanel data=hppan_ex01; id cs ts; model y = x1-x10 / pooled; performance nodes = 10 threads = 1 details host="&GRIDHOST" install="&GRIDINSTALLLOC"; run;
Based on Output 7.1.3, you find that the parameter estimates are similar to those from the random-effects estimator. You also find that the timings are similar, indicating that the bulk of the computational effort is due to tasks common to both random-effects estimation and standard OLS regression. In both cases, estimation is dominated by the calculation of sums of squares and other moment terms, over the whole data set.
Output 7.1.3: Pooled Regression Model
Parameter Estimates | |||||
---|---|---|---|---|---|
Parameter | DF | Estimate | Standard Error |
t Value | Pr > |t| |
Intercept | 1 | 4.96957 | 0.00632 | 786.03 | <.0001 |
x1 | 1 | 1.01251 | 0.01095 | 92.49 | <.0001 |
x2 | 1 | 1.98374 | 0.01095 | 181.17 | <.0001 |
x3 | 1 | 3.00294 | 0.01095 | 274.23 | <.0001 |
x4 | 1 | 3.99649 | 0.01095 | 364.90 | <.0001 |
x5 | 1 | 5.00187 | 0.01095 | 456.77 | <.0001 |
x6 | 1 | 5.99952 | 0.01095 | 547.77 | <.0001 |
x7 | 1 | 7.00478 | 0.01095 | 639.88 | <.0001 |
x8 | 1 | 7.97232 | 0.01095 | 728.13 | <.0001 |
x9 | 1 | 9.01244 | 0.01095 | 822.90 | <.0001 |
x10 | 1 | 10.01578 | 0.01095 | 914.52 | <.0001 |