The HPSEVERITY Procedure

Example 5.6 Benefits of Distributed and Multithreaded Computing

One of the key features of the HPSEVERITY procedure is that is takes advantage of the distributed and multithreaded computing machinery in order to solve a given problem faster. This example illustrates the benefits of using multithreading and distributed computing.

The example uses a simulated data set Work.Largedata, which contains 10,000,000 observations, some of which are right-censored or left-truncated. The losses are affected by three external effects. The DATA step program that generates this data set is available in the accompanying sample program hpseve06.sas.

The following PROC HPSEVERITY step fits all the predefined distributions to the data in Work.Largedata data set on the client machine with just one thread of computation:

/* Fit all predefined distributions without any multithreading or 
   distributed computing */
proc hpseverity data=largedata criterion=aicc initsample(size=20000);
   loss y / lt=threshold rc=limit;
   scalemodel x1-x3;
   dist _predef_;
   performance nthreads=1 bufsize=1000000 details;
run;

The NTHREADS=1 option in the PERFORMANCE statement specifies that just one thread of computation be used. The absence of the NODES= option in the PERFORMANCE statement specifies that single-machine mode of execution be used. That is, this step does not use any multithreading or distributed computing. The BUFSIZE= option in the PERFORMANCE statement specifies the number of observations to read at one time. Specifying a larger value tends to decrease the time it takes to load the data. The DETAILS option in the performance statement enables reporting of the timing information. The INITSAMPLE option in the PROC HPSEVERITY statement specifies that a uniform random sample of maximum 20,000 observations be used for parameter initialization.

The “Performance Information” and “Procedure Task Timing” tables that PROC HPSEVERITY prepares are shown in Output 5.6.1. The “Performance Information” table contains the information about the execution environment. The “Procedure Task Timing” table indicates the total time and relative time taken by each of the four main steps of PROC HPSEVERITY. As that table shows, it takes around 25 minutes for the task of estimating parameters, which is usually the most time-consuming of all the tasks.

Output 5.6.1: Performance for Single-Machine Mode with No Multithreading

The HPSEVERITY Procedure

Performance Information
Execution Mode	Single-Machine
Number of Threads	1

Procedure Task Timing
Task	Seconds	Percent
Load and Prepare Models	4.41	0.29%
Load and Prepare Data	1.36	0.09%
Initialize Parameters	0.81	0.05%
Estimate Parameters	1513.85	99.48%
Compute Fit Statistics	1.26	0.08%

If the grid appliance is not available, you can improve the performance by using multiple threads of computation; this is in fact the default. The following PROC HPSEVERITY step fits all the predefined distributions by using all the logical CPU cores of the machine:

options cpucount=actual;

/* Fit all predefined distributions with multithreading, but no
   distributed computing */
proc hpseverity data=largedata criterion=aicc initsample(size=20000);
   loss y / lt=threshold rc=limit;
   scalemodel x1-x3;
   dist _predef_;
   performance bufsize=1000000 details;
run;

When you do not specify the NTHREADS= option in the PERFORMANCE statement, the HPSEVERITY procedure uses the value of the CPUCOUNT= system option to decide the number of threads to use in single-machine mode. Setting the CPUCOUNT= option to ACTUAL before the PROC HPSEVERITY step enables the procedure to use all the logical cores of the machine. The machine that is used to obtain these results (and the earlier results in Output 5.6.1) has four physical CPU cores, each with a clock speed of 3.4 GHz. Hyperthreading is enabled on the CPUs to yield eight logical CPU cores; this number is confirmed by the “Performance Information” table in Output 5.6.2. The results in the “Procedure Task Timing” table in Output 5.6.2 indicate that the use of multithreading has improved the performance significantly by reducing the time to estimate parameters to around 5.5 minutes.

Output 5.6.2: Performance for Single-Machine Mode with Eight Threads

The HPSEVERITY Procedure

Performance Information
Execution Mode	Single-Machine
Number of Threads	8

Procedure Task Timing
Task	Seconds	Percent
Load and Prepare Models	0.34	0.10%
Load and Prepare Data	1.01	0.30%
Initialize Parameters	0.65	0.19%
Estimate Parameters	335.37	99.14%
Compute Fit Statistics	0.89	0.26%

When a grid appliance is available, performance can be further improved by using more than one node in the distributed mode of execution. Large data sets are usually predistributed on the grid appliance that hosts a distributed database. In other words, large problems are best suited for the alongside-the-database model of execution. However, for the purpose of illustration, this example assumes that the data set is available on the client machine and is then distributed to the grid nodes by the HPSEVERITY procedure according to the options that are specified in the PERFORMANCE statement.

The next few PROC HPSEVERITY steps are run on a grid appliance by varying the number of nodes and the number of threads that are used within each node.

First, the following statements are submitted to specify the appliance host (GRIDHOST= system option) and the installation location of shared libraries on the appliance (GRIDINSTALLLOC= system option):

option set=GRIDHOST      ="&GRIDHOST";
option set=GRIDINSTALLLOC="&GRIDINSTALLLOC";

To run the preceding statements successfully, you need to set the macro variables GRIDHOST and GRIDINSTALLLOC to resolve to appropriate values, or you can replace the references to macro variables with the appropriate values. For more information about the GRIDHOST= and GRIDINSTALLLOC= options, see the section PERFORMANCE Statement.

To establish a reference point for the performance of one CPU of a grid node, the results of using only one node of the grid appliance without any multithreading are presented first. The particular grid appliance that is used to obtain these results has eight nodes. Each node has 24 logical CPU cores with a clock speed of 2.93 GHz. The following PROC HPSEVERITY step fits all the predefined distributions to the data in the Work.Largedata data set:

/* Fit all predefined distributions on 1 grid node without 
   any multithreading */
proc hpseverity data=largedata criterion=aicc initsample(size=20000);
   loss y / lt=threshold rc=limit;
   scalemodel x1-x3;
   dist _predef_;
   performance nthreads=1 nodes=1 details;
run;

The PERFORMANCE statement specifies that only one node be used to fit the models, with only one thread of computation on that node. The “Performance Information” and “Procedure Task Timing” tables that PROC HPSEVERITY prepares are shown in Output 5.6.3. It takes around 36 minutes to complete the task of estimating parameters.

The computations and time taken to fit each model are also shown in the “Estimation Details” table, which is generated whenever you specify the DETAILS option in the PERFORMANCE statement. This table can be useful for comparing the relative effort required to fit each model and drawing some broader conclusions. For example, even if the Pareto distribution takes a larger number of iterations, function calls, and gradient and Hessian updates than the gamma distribution, it takes less time to complete; this indicates that the individual PDF and CDF computations of the gamma distribution are more expensive than those of the Pareto distribution.

Output 5.6.3: Performance on One Grid Appliance Node with No Multithreading

The HPSEVERITY Procedure

Performance Information
Host Node	<< your grid host >>
Execution Mode	Distributed
Grid Mode	Symmetric
Number of Compute Nodes	1
Number of Threads per Node	1

Estimation Details
Distribution	Converged	Iterations	Function Calls	Gradient Updates	Hessian Updates	Time (Seconds)
Burr	Yes	11	28	104	90	290.37
Exp	Yes	4	12	27	20	29.98
Gamma	Yes	5	15	35	27	777.45
Igauss	Yes	4	12	27	20	271.45
Logn	Yes	4	12	27	20	114.95
Pareto	Maybe	50	137	1430	1377	461.36
Gpd	Yes	6	17	44	35	116.90
Weibull	Yes	4	12	27	20	70.84

Procedure Task Timing
Task	Seconds	Percent
Load and Prepare Models	0.48	0.02%
Load and Prepare Data	0.70	0.03%
Initialize Parameters	1.22	0.06%
Estimate Parameters	2133.31	99.80%
Compute Fit Statistics	1.91	0.09%

To obtain the next reference point for performance, the following PROC HPSEVERITY step specifies that 24 computation threads be used on one node of the grid appliance:

/* Fit all predefined distributions on 1 grid node with multithreading */
proc hpseverity data=largedata criterion=aicc initsample(size=20000);
   loss y / lt=threshold rc=limit;
   scalemodel x1-x3;
   dist _predef_;
   performance nthreads=24 nodes=1 details;
run;

The performance tables that are prepared by the preceding statements are shown in Output 5.6.4. As the “Procedure Task Timing” table shows, use of multithreading has improved the performance significantly over that of the single-threaded case. Now, it takes around 3 minutes to complete the task of estimating parameters.

Output 5.6.4: Performance Information with Multithreading but No Distributed Computing

The HPSEVERITY Procedure

Performance Information
Host Node	<< your grid host >>
Execution Mode	Distributed
Grid Mode	Symmetric
Number of Compute Nodes	1
Number of Threads per Node	24

Procedure Task Timing
Task	Seconds	Percent
Load and Prepare Models	0.36	0.20%
Load and Prepare Data	0.39	0.21%
Initialize Parameters	0.98	0.53%
Estimate Parameters	181.10	98.40%
Compute Fit Statistics	1.21	0.66%

You can combine the power of multithreading and distributed computing by specifying that multiple nodes of the grid and all available threads of execution within each node be used to accomplish the task. The following PROC HPSEVERITY step specifies that all eight nodes of the grid appliance be used:

/* Fit all predefined distributions with distributed computing and 
   multithreading within each node */
proc hpseverity data=largedata criterion=aicc initsample(size=20000);
   loss y / lt=threshold rc=limit;
   scalemodel x1-x3;
   dist _predef_;
   performance nodes=8 details;
run;

Omitting the NTHREADS= option from the PERFORMANCE statement in distributed mode results in the use of all 24 logical CPU cores on each node of the grid.

When the DATA= data set is local to the client machine, as it is in this example, you must specify a nonzero value for the NODES= option in the PERFORMANCE statement in order to enable the distributed mode of execution. In other words, for the distributed mode that is not executing alongside the database, omitting the NODES= option is equivalent to specifying NODES=0, which is single-machine mode.

The performance tables that are prepared by the preceding statements are shown in Output 5.6.5. If you compare these tables to the tables in Output 5.6.3 and Output 5.6.4, you see that the task that would have taken a long time with a single thread of execution on a single machine can be performed in a much shorter time by using the full computational resources of the grid appliance to combine the power of multithreaded and distributed computing.

Output 5.6.5: Performance Information with Distributed Computing and Multithreading

The HPSEVERITY Procedure

Performance Information
Host Node	<< your grid host >>
Execution Mode	Distributed
Grid Mode	Symmetric
Number of Compute Nodes	8
Number of Threads per Node	24

Procedure Task Timing
Task	Seconds	Percent
Load and Prepare Models	0.92	1.66%
Load and Prepare Data	0.08	0.14%
Initialize Parameters	1.23	2.22%
Estimate Parameters	51.80	93.25%
Compute Fit Statistics	1.52	2.73%

The machines that were used to obtain these performance results are relatively modest machines, and PROC HPSEVERITY was executed in a multiuser environment; that is, background processes were running in single-machine mode or other users were using the grid in distributed mode. For time-critical applications, you can use a larger, dedicated grid that consists of more powerful machines to achieve more dramatic performance improvement.