The HPCDM Procedure (Experimental)

Example 4.2 Using Externally Simulated Count Data

The COUNTREG procedure enables you to estimate count regression models that are based on the most commonly used discrete distributions, such as the Poisson, negative binomial (both $p=1$ and $p=2$), and Conway-Maxwell-Poisson distributions. PROC COUNTREG also enables you to fit zero-inflated models that are based on Poisson, negative binomial (p=2), and Conway-Maxwell-Poisson distributions. However, there might be situations in which you want to use some other method of fitting count regression models. For example, if you are modeling the number of loss events that are incurred by two financial instruments such that there is some dependency between the two, then you might use some multivariate frequency modeling methods and simulate the counts for each instrument by using the dependency structure between the count model parameters of the two instruments. As another example, you might want to use different types of count models for different BY groups in your data; this is not possible in PROC COUNTREG in SAS/ETS 13.1 and earlier. So you need to simulate the counts for such BY groups externally. PROC HPCDM enables you to supply externally simulated counts by using the EXTERNALCOUNTS statement. PROC HPCDM then does not need to simulate the counts internally; it simulates only the severity of each loss event by using the severity model estimates in the SEVERITYEST= data set. The process is described and illustrated in the section Simulation with External Counts.

Consider that you are a bank, and as part of quantifying your operational risk, you want to estimate the aggregate loss distributions for two lines of business, retail banking and commercial banking, by using some key risk indicators (KRIs). Assume that your model fitting and model selection process has determined that the Poisson regression model and negative binomial regression model are the best-fitting count models for number of loss events that are incurred in the retail banking and commercial banking businesses, respectively. Let CorpKRI1, CorpKRI2, CbKRI1, CbKRI2, and CbKRI3 be the KRIs that are used in the count regression model of the commercial banking business, and let CorpKRI1, RbKRI1, and RbKRI2 be the KRIs that are used in the count regression model of the retail banking business. Some examples of corporate-level KRIs (CorpKRI1 and CorpKRI2 in this example) are the ratio of temporary to permanent employees and the number of security breaches that are reported during a year. Some examples of KRIs that are specific to the commercial banking business (CbKRI1, CbKRI2, and CbKRI3 in this example) are number of credit defaults, proportion of financed assets that are movable, and penalty claims against your bank because of processing delays. Some examples of KRIs that are specific to the retail banking business (RbKRI1 and RbKRI2 in this example) are number of credit cards that are reported stolen, fraction of employees who have not undergone fraud detection training, and number of forged drafts and checks that are presented in a year.

Let the severity of each loss event in the commercial banking business be dependent on two KRIs, CorpKRI1 and CbKRI2. Let the severity of each loss event in the retail banking business be dependent on three KRIs, CorpKRI2, RbKRI1, and RbKRI3. Note that for each line of business, the set of KRIs that are used for the severity model is different from the set of KRIs that are used for the count model, although there is some overlap between the two sets. Further, the severity model for retail banking includes a new regressor (RbKRI3) that is not used for any of the count models. Such use of different sets of KRIs for count and severity models is typical of real-world applications.

Let the parameter estimates of the negative binomial and Poisson regression models, as determined by PROC COUNTREG, be available in the Work.CountEstEx2NB2 and Work.CountEstEx2Poisson data sets, respectively. These data sets are produced by using the OUTEST= option in the respective PROC COUNTREG statements. Let the parameter estimates of the best-fitting severity models, as determined by PROC SEVERITY, be available in the Work.SevEstEx2Best data set. You can find the code to prepare these data sets in the PROC HPCDM sample program hcdmex02.sas.

Now, consider that you want to estimate the distribution of the aggregate loss for a scenario, which is represented by a specific set of KRI values. The following DATA step illustrates one such scenario:

/* Generate a scenario data set for a single operating condition */
data singleScenario (keep=corpKRI1 corpKRI2 cbKRI1 cbKRI2 cbKRI3
                          rbKRI1 rbKRI2 rbKRI3);
   array x{8} corpKRI1 corpKRI2 cbKRI1 cbKRI2 cbKRI3 rbKRI1 rbKRI2 rbKRI3;
   call streaminit(5151);
   do i=1 to dim(x);
      x(i) = rand('NORMAL');
   end;
   output;
run;

The Work.SingleScenario data set contains all the KRIs that are included in the count and severity models of both business lines. Note that if you standardize or scale the KRIs while fitting the count and severity models, then you must apply the same standardization or scaling method to the values of the KRIs that you specify in the scenario. In this particular example, all KRIs are assumed to be standardized.

The following DATA step uses the scenario in the Work.SingleScenario data set to simulate 10,000 replications of the number of loss events that you might observe for each business line and writes the simulated counts to the NumLoss variable of the Work.LossCounts1 data set:

/* Simulate multiple replications of the number of loss events that
   you can expect in the scenario being analyzed */
data lossCounts1 (keep=line corpKRI1 corpKRI2 cbKRI2 rbKRI1 rbKRI3 numloss);
   array cxR{3} corpKRI1 rbKRI1 rbKRI2;
   array cbetaR{4} _TEMPORARY_;
   array cxC{5} corpKRI1 corpKRI2 cbKRI1 cbKRI2 cbKRI3;
   array cbetaC{6} _TEMPORARY_;

   retain theta;

   if _n_ = 1 then do;
      call streaminit(5151);
      * read count model estimates *;
      set countEstEx2NB2(where=(line='CommercialBanking' and _type_='PARM'));
      cbetaC(1) = Intercept;
      do i=1 to dim(cxC);
         cbetaC(i+1) = cxC(i);
      end;
      alpha = _Alpha;
      theta = 1/alpha;

      set countEstEx2Poisson(where=(line='RetailBanking' and _type_='PARM'));
      cbetaR(1) = Intercept;
      do i=1 to dim(cxR);
         cbetaR(i+1) = cxR(i);
      end;
   end;

   set singleScenario;
   do iline=1 to 2;
      if (iline=1) then line = 'CommercialBanking';
      else line = 'RetailBanking';
      do repid=1 to 10000;
         nnz = 1;
         maxtries = 5*nnz;
         nc = 0;
         ntries = 0;
         do while (nc < nnz and ntries < maxtries);
            * draw from count distribution *;
            if (iline=1) then do;
               xbeta = cbetaC(1);
               do i=1 to dim(cxC);
                  xbeta = xbeta + cxC(i) * cbetaC(i+1);
               end;
               Mu = exp(xbeta);
               p = theta/(Mu+theta);
               numloss = rand('NEGB',p,theta);
            end;
            else do;
               xbeta = cbetaR(1);
               do i=1 to dim(cxR);
                  xbeta = xbeta + cxR(i) * cbetaR(i+1);
               end;
               numloss = rand('POISSON', exp(xbeta));
            end;
            if (numloss > 0) then do;
               output;
               nc = nc + 1;
            end;
            ntries = ntries + 1;
         end;
      end;
   end;
run;

The Work.LossCounts1 data set contains the NumLoss variable in addition to the KRIs that are used by the severity regression model, which are needed by PROC HPCDM to simulate the aggregate loss.

By default, PROC HPCDM computes an aggregate loss distribution by using each of the severity models that you specify in the SEVERITYMODEL statement. However, you can restrict PROC HPCDM to use only a subset of the severity models for a given BY group by modifying the SEVERITYEST= data set to include only the estimates of the desired severity models in each BY group, as illustrated in the following DATA step:

/* Keep only the best severity model for each business line
   and set coefficients of unused regressors in each model to 0 */
data sevestEx2Best;
    set sevestEx2;
    if ((line = 'CommercialBanking' and _model_ = 'Logn')) then do;
        corpKRI2 = 0; rbKRI1 = 0; rbKRI3 = 0;
        output;
    end;
    else if ((line = 'RetailBanking' and _model_ = 'Gamma')) then do;
        corpKRI1 = 0; cbKRI2 = 0;
        output;
    end;
run;

Note that the preceding DATA step also sets the coefficients of the unused regressors in each model to 0. This is important because PROC HPCDM uses all the regressors that it detects from the SEVERITYEST= data set for each severity model.

Now, you are ready to estimate the aggregate loss distribution for each line of business by submitting the following PROC HPCDM step, in which you specify the EXTERNALCOUNTS statement to request that external counts in the NumLoss variable of the DATA= data set be used for simulation of the aggregate loss:

/* Estimate the distribution of the aggregate loss for both
   lines of business by using the externally simulated counts */
proc hpcdm data=lossCounts1 seed=13579 print=all
           severityest=sevestEx2Best;
   by line;
   externalcounts count=numloss;
   severitymodel logn gamma;
run;

Each observation in the Work.LossCounts1 data set represents one replication of the external counts simulation process. For each such replication, the preceding PROC HPCDM step makes as many severity draws from the severity distribution as the value of the NumLoss variable and adds the severity values from those draws to compute one sample point of the aggregate loss. The severity distribution that is used for making the severity draws has a scale parameter value that is decided by the KRI values in the given observation and the regression parameter values that are read from the Work.SevEstEx2Best data set.

The summary statistics and percentiles of the aggregate loss distribution for the commercial banking business, which uses the lognormal severity model, are shown in Output 4.2.1. The Input Data Summary table indicates that each of the 9,954 observations in the BY group is treated as one replication and that there are a total of 19,241 loss events produced by all the replications together. For the scenario in the Work.SingleScenario data set, you can expect the commercial banking business to incur an average aggregate loss of 651 units, as shown in the Sample Summary Statistics table, and the chance that the loss will exceed 4,337 units is 0.5%, as shown in the Sample Percentiles table.

Output 4.2.1: Aggregate Loss Summary for Commercial Banking Business

The HPCDM Procedure

Input Data Summary
Name WORK.LOSSCOUNTS1
Observations 9954
Valid Observations 9954
Replications 9954
Total Count 19241

Sample Summary Statistics
Mean 651.00065 Median 418.36937
Standard Deviation 726.61443 Interquartile Range 653.67139
Variance 527968.5 Minimum 8.00493
Skewness 3.15153 Maximum 12726.4
Kurtosis 19.08843 Sample Size 9954

Sample Percentiles
Percentile Value
0 8.00493
1 29.52879
5 59.63848
25 188.00888
50 418.36937
75 841.68028
95 2037.3
99 3472.2
99.5 4337.0
Percentile Method = 5


For the retail banking business, which uses the gamma severity model, the Sample Percentiles table in Output 4.2.2 indicates that the median operational loss of that business is about 85 units and the chance that the loss will exceed 344 units is about 1%.

Output 4.2.2: Aggregate Loss Percentiles for Retail Banking Business

Sample Percentiles
Percentile Value
0 1.19575
1 9.88436
5 19.76335
25 48.97570
50 84.78094
75 141.38838
95 250.92488
99 343.85721
99.5 381.70522
Percentile Method = 5


When you conduct the simulation and estimation for a scenario that contains only one observation, you assume that the operating environment does not change over the period of time that is being analyzed. That assumption might be valid for shorter durations and stable business environments, but often the operating environments change, especially if you are estimating the aggregate loss over a longer period of time. So you might want to include in your scenario all the possible operating environments that you expect to see during the analysis time period. Each environment is characterized by its own set of KRI values. For example, the operating conditions might change from quarter to quarter, and you might want to estimate the aggregate loss distribution for the entire year. You start the estimation process for such scenarios by creating a scenario data set. The following DATA step creates the Work.MultiConditionScenario data set, which consists of four operating environments, one for each quarter:

/* Generate a scenario data set for multiple operating conditions */
data multiConditionScenario (keep=opEnvId corpKRI1 corpKRI2
      cbKRI1 cbKRI2 cbKRI3 rbKRI1 rbKRI2 rbKRI3);
   array x{8} corpKRI1 corpKRI2 cbKRI1 cbKRI2 cbKRI3 rbKRI1 rbKRI2 rbKRI3;
   call streaminit(5151);
   do opEnvId=1 to 4;
      do i=1 to dim(x);
         x(i) = rand('NORMAL');
      end;
      output;
   end;
run;

All four observations of the Work.MultiConditionScenario data set together form one scenario. When simulating the external counts for such multi-entity scenarios, one replication consists of the possible number of loss events that can occur as a result of each of the four operating environments. In any given replication, some operating environments might not produce any loss event or all four operating environments might produce some loss events. Assume that you use a DATA step to create the Work.LossCounts2 data set that contains, for each business line, 10,000 replications of the loss counts and that you identify each replication by using the RepId variable. You can find the DATA step code to prepare the Work.LossCounts2 data set in the PROC HPCDM sample program hcdmex02.sas.

Output 4.2.3 shows some observations of the Work.LossCounts2 data set for each business line. For the first replication (RepId=1) of the commercial banking business, only operating environment 3 incurs two loss events, whereas the other environments incur no loss events. For the second replication (RepId=2), all operating environments incur at least one loss event. For the first replication (RepId=1) of the retail banking business, operating environments 2, 3, and 4 incur four, one, and four loss events, respectively.

Output 4.2.3: Snapshot of the External Counts Data with Replication Identifier

line opEnvId corpKRI1 corpKRI2 cbKRI2 rbKRI1 rbKRI3 repid numloss
CommercialBanking 3 -0.29120 -0.45239 0.98855 -0.37208 -1.51534 1 2
CommercialBanking 1 0.45224 0.40661 -0.33680 -1.08692 -2.20557 2 1
CommercialBanking 2 -0.03799 0.98670 -0.03752 1.94589 1.22456 2 3
CommercialBanking 3 -0.29120 -0.45239 0.98855 -0.37208 -1.51534 2 9
CommercialBanking 4 0.87499 -0.67812 -0.04839 -1.44881 0.78221 2 8
CommercialBanking 1 0.45224 0.40661 -0.33680 -1.08692 -2.20557 3 5
CommercialBanking 2 -0.03799 0.98670 -0.03752 1.94589 1.22456 3 1
CommercialBanking 3 -0.29120 -0.45239 0.98855 -0.37208 -1.51534 3 2
RetailBanking 2 -0.03799 0.98670 -0.03752 1.94589 1.22456 1 4
RetailBanking 3 -0.29120 -0.45239 0.98855 -0.37208 -1.51534 1 1
RetailBanking 4 0.87499 -0.67812 -0.04839 -1.44881 0.78221 1 4
RetailBanking 1 0.45224 0.40661 -0.33680 -1.08692 -2.20557 2 2
RetailBanking 2 -0.03799 0.98670 -0.03752 1.94589 1.22456 2 5
RetailBanking 4 0.87499 -0.67812 -0.04839 -1.44881 0.78221 2 3
RetailBanking 1 0.45224 0.40661 -0.33680 -1.08692 -2.20557 3 2
RetailBanking 2 -0.03799 0.98670 -0.03752 1.94589 1.22456 3 3


You can now use this simulated count data to estimate the distribution of the aggregate loss that is incurred in all four operating environments by submitting the following PROC HPCDM step, in which you specify the replication identifier variable RepId in the ID= option of the EXTERNALCOUNTS statement:

/* Estimate the distribution of the aggregate loss for both
   lines of business by using the externally simulated counts
   for the multiple operating environments */
proc hpcdm data=lossCounts2 seed=13579 print=all
           severityest=sevestEx2Best plots=density;
   by line;
   distby repid;
   externalcounts count=numloss id=repid;
   severitymodel logn gamma;
run;

Note that when you specify the ID= variable in the EXTERNALCOUNTS statement, you must also specify that variable in the DISTBY statement. Within each BY group, for each value of the RepId variable, one point of the aggregate loss sample is simulated by using the process that is described in the section Simulation with External Counts.

The summary statistics and percentiles of the distribution of the aggregate loss, which is the aggregate of the losses across all four operating environments, are shown in Output 4.2.4 for the commercial banking business. The Input Data Summary table indicates that there are 10,000 replications in the BY group and that a total of 98,480 loss events are generated across all replications. The Sample Percentiles table indicates that you can expect a median aggregate loss of 3,075 units and a worst-case loss, as defined by the 99.5th percentile, of 13,150 units from the commercial banking business when you combine losses that result from all four operating environments.

Output 4.2.4: Aggregate Loss Summary for the Commercial Banking Business in Multiple Operating Environments

The HPCDM Procedure

Input Data Summary
Name WORK.LOSSCOUNTS2
Observations 32526
Valid Observations 32526
Replications 10000
Total Count 98480

Sample Percentiles
Percentile Value
1 342.53328
5 792.48797
25 1876.8
50 3075.2
75 4694.6
95 8058.6
99 11575.4
99.5 13149.7
Percentile Method = 5


The probability density functions of the aggregate loss for the commercial and retail banking businesses are shown in Output 4.2.5. In addition to the difference in scales of the losses in the two businesses, you can see that the aggregate loss that is incurred in the commercial banking business has a heavier right tail than the aggregate loss that is incurred in the retail banking business.

Output 4.2.5: Density Plots of the Aggregate Losses for Commercial Banking (left) and Retail Banking (right) Businesses

cdmex2o1g

cdmex2o1g1