PROC HPPRINCOMP shows its real power when the computation is conducted with multiple threads or in a distributed environment. This example shows how you can run PROC HPPRINCOMP in single-machine and distributed modes. For more information about the execution modes of SAS high-performance analytics procedures, see the section Processing Modes in ChapterĀ 3: Shared Concepts and Topics. The focus of this example is to show how you can switch the modes of execution in PROC HPPRINCOMP. The following DATA step generates the data:
data ex2Data; array x{100}; do i = 1 to 5000000; do j = 1 to dim(x); x[j] = ranuni(1); end; output; end; run;
The following statements use PROC HPPRINCOMP to perform a principal component analysis and to output various statistics to
the Stats
data set (OUTSTAT= Stats
):
proc hpprincomp data=ex2Data n=20 outstat=Stats; var x:; performance details; run;
Output 12.2.1 shows the "Performance Information" table. This table shows that the HPPRINCOMP procedure executes in single-machine mode on four threads, because the client machine has four CPUs. You can force a certain number of threads on any machine to be involved in the computations by specifying the NTHREADS= option in the PERFORMANCE statement.
Output 12.2.2 shows timing information for the PROC HPPRINCOMP run. This table is produced when you specify the DETAILS option in the PERFORMANCE statement. You can see that, in this case, the majority of time is spent reading the data and computing the moments.
To switch to running PROC HPPRINCOMP in distributed mode, specify valid values for the NODES=, INSTALL=, and HOST= options in the PERFORMANCE statement. An alternative to specifying the INSTALL= and HOST= options in the PERFORMANCE statement is to use OPTIONS SET commands to set appropriate values for the GRIDHOST and GRIDINSTALLLOC environment variables. For information about setting these options or environment variables, see the section Processing Modes in ChapterĀ 3: Shared Concepts and Topics.
The following statements provide an example. To run these statements successfully, you need to set the macro variables GRIDHOST
and GRIDINSTALLLOC
to resolve to appropriate values, or you can replace the references to macro variables with appropriate values.
proc hpprincomp data=ex2Data n=20 outstat=Stats; var x:; performance details nodes = 4 host="&GRIDHOST" install="&GRIDINSTALLLOC"; run;
The execution mode in the "Performance Information" table shown in Output 12.2.3 indicates that the calculations were performed in a distributed environment that uses four nodes, each of which uses 32 threads.
Another indication of distributed execution is the following message in the SAS log, which is issued by all high-performance analytics procedures:
NOTE: The HPPRINCOMP procedure is executing in the distributed computing environment with 4 worker nodes.
Output 12.2.4 shows timing information for this distributed run of the HPPRINCOMP procedure. In contrast with the single-machine mode (where reading the data and computing the moments dominate the time spent), the majority of time in the distributed-mode run is spent distributing the data.
Output 12.2.4: Timing in Distributed Mode
Procedure Task Timing | ||
---|---|---|
Task | Seconds | Percent |
Obtaining Settings | 0.00 | 0.00% |
Distributing Data | 35.44 | 95.09% |
Reading Data and Computing Moments | 1.27 | 3.40% |
Computing Correlation Matrix | 0.29 | 0.78% |
Performing Eigenvalue Decomposition | 0.00 | 0.00% |
Producing Output Statistics Data Set | 0.01 | 0.04% |
Waiting on Client | 0.26 | 0.69% |