The PARETO Procedure

Example 15.12 Managing a Large Number of Categories

Note: See Managing a Large Number of Categories in the SAS/QC Sample Library.

The Centers for Disease Control publish a variety of public health statistics. The numbers of deaths in 2010 in the United States that were caused by various types of cancer are recorded in the SAS data set CancerDeaths2010:

data CancerDeaths2010;
   length Type $ 45;
   input Type & @47 Deaths comma7.;
   datalines;
Lip, oral cavity and pharynx                    8,474
Esophagus                                      14,490
Stomach                                        11,390
Colon, rectum and anus                         52,622
Liver and intrahepatic bile ducts              20,305
Pancreas                                       36,888
Larynx                                          3,691
Trachea, bronchus and lung                    158,318
Skin                                            9,154
Breast                                         41,435
Cervix                                          3,939
Uterus                                          8,402
Ovary                                          14,572
Prostate                                       28,561
Kidney and renal pelvis                        13,219
Bladder                                        14,731
Meninges, brain, other central nervous system  14,164
Hodgkin's disease                               1,231
Non-Hodgkin's lymphoma                         20,294
Leukemia                                       22,569
Multiple myeloma and immunoproliferative       11,428
Other lymphoid, hematopoietic and related          68
All other and unspecified                      64,798
;

The following statements produce a Pareto chart for the data in CancerDeaths2010:

proc pareto data=CancerDeaths2010;
   vbar Type / freq = Deaths;
run;

The resulting Pareto chart is shown in Output 15.12.1.

Output 15.12.1: Cancer Deaths Pareto Chart with Default Width


Note that PROC PARETO has labeled the category axis tick marks with numbers and produced a corresponding category legend. This is done by default when there is not enough room to use category names as tick labels on the category axis. Unfortunately, because some of the category names are long, the legend has room for only one column of entries and therefore occupies an inordinate amount of space. Among the alternatives for addressing this problem are the following:

  • replacing the original category names with shorter ones

  • increasing the space available for the graph

You can implement the second alternative by specifying the WIDTH= option in the ODS GRAPHICS statement prior to invoking the procedure. (The ODS GRAPHICS statement is documented in the SAS Output Delivery System: User's Guide.) Output 15.12.2 shows the Pareto chart that is produced after the graph width is increased.

Output 15.12.2: Cancer Deaths Pareto Chart with Increased Width


In a standard Pareto chart, the cumulative percentage curve is anchored at the top of the first category bar. In Output 15.12.2 PROC PARETO has automatically relaxed that rule to avoid excessive compression of the bars. You can use the FREQAXIS= option to specify that the frequency axis extend to 100%, which restores the anchoring of the curve. (For more information about scaling the frequency and cumulative percentage axes, see the section Scaling the Cumulative Percentage Curve.)

Note also in Output 15.12.2 that the category All other and unspecified has the second highest frequency. To better indicate the specific types of cancer responsible for the most deaths, you can use the LAST= option to display the All other and unspecified category last.

The following statements incorporate these changes and add other enhancements to the chart:

ods graphics / width=800px;
title 'U.S. Cancer Deaths in 2010 by Type';
proc pareto data=CancerDeaths2010;
   vbar Type / freq        = Deaths
               barlabel    = value
               last        = 'All other and unspecified'
               nocatlabel
               catleglabel = 'Cancer Type'
               freqaxis    = 0 to 100 by 10
               nlegend     = 'Total Cancer Deaths'
               odstitle    = title
               out         = CSummary;
               ;
run;

The BARLABEL= option labels each bar with its value in frequency axis units, which in this case is the percentage of cancer deaths that were caused by that type of cancer. The NOCATLABEL option saves some space by eliminating the category axis label, and the CATLEGLABEL= option produces a more informative label for the category legend. The NLEGEND= option displays the total sample size with an appropriate label. The ODSTITLE= option replaces the default graph title with the one specified in the TITLE statement. The OUT= option saves a summary of the Pareto chart in the data set CSummary.

The improved Pareto chart is shown in Output 15.12.3, and a listing of CSummary is shown in Output 15.12.4.

Output 15.12.3: Improved Pareto Chart of 2010 Cancer Deaths


Output 15.12.4: CSummary Data Set

U.S. Cancer Deaths in 2010 by Type

Obs Type _COUNT_ _PCT_ _CMPCT_
1 Trachea, bronchus and lung 158318 27.5459 27.546
2 Colon, rectum and anus 52622 9.1557 36.702
3 Breast 41435 7.2093 43.911
4 Pancreas 36888 6.4182 50.329
5 Prostate 28561 4.9694 55.298
6 Leukemia 22569 3.9268 59.225
7 Liver and intrahepatic bile ducts 20305 3.5329 62.758
8 Non-Hodgkin's lymphoma 20294 3.5310 66.289
9 Bladder 14731 2.5631 68.852
10 Ovary 14572 2.5354 71.388
11 Esophagus 14490 2.5211 73.909
12 Meninges, brain, other central nervous system 14164 2.4644 76.373
13 Kidney and renal pelvis 13219 2.3000 78.673
14 Multiple myeloma and immunoproliferative 11428 1.9884 80.661
15 Stomach 11390 1.9818 82.643
16 Skin 9154 1.5927 84.236
17 Lip, oral cavity and pharynx 8474 1.4744 85.710
18 Uterus 8402 1.4619 87.172
19 Cervix 3939 0.6853 87.858
20 Larynx 3691 0.6422 88.500
21 Hodgkin's disease 1231 0.2142 88.714
22 Other lymphoid, hematopoietic and related 68 0.0118 88.726
23 All other and unspecified 64798 11.2743 100.000


The Pareto chart in Output 15.12.3 has 23 categories, some of which account for only a small percentage of the total deaths. Often only a relatively few categories that have the highest frequencies are of interest. The PARETO procedure provides options for limiting the number of categories that are displayed on a chart. For an example of restricting the number of categories by using the MAXNCAT= and OTHER= options, see the section Restricting the Number of Pareto Categories.

The original CancerDeaths2010 data set appears to have been summarized in advance, with the All other and unspecified category containing the total count for unspecified cancers plus those types that account for fewer deaths than the 22 distinct types that are shown in Output 15.12.3. The All other and unspecified category has the second highest frequency, accounting for 11.3% of all deaths.

The chart statement options that limit the number of categories to be displayed omit or merge low-frequency categories. In this case, it is more useful to merge the low-frequency categories into the existing All other and unspecified category. The following DATA step merges each type that accounts for less than 2% of cancer deaths into the All other and unspecified category:

data CSummary;
   set CSummary;
   if _PCT_ < 2.0 then Type='All other and unspecified';
run;

The modified CSummary data set is shown in Output 15.12.5.

Output 15.12.5: Modified CSummary Data Set

U.S. Cancer Deaths in 2010 by Type

Obs Type _COUNT_ _PCT_ _CMPCT_
1 Trachea, bronchus and lung 158318 27.5459 27.546
2 Colon, rectum and anus 52622 9.1557 36.702
3 Breast 41435 7.2093 43.911
4 Pancreas 36888 6.4182 50.329
5 Prostate 28561 4.9694 55.298
6 Leukemia 22569 3.9268 59.225
7 Liver and intrahepatic bile ducts 20305 3.5329 62.758
8 Non-Hodgkin's lymphoma 20294 3.5310 66.289
9 Bladder 14731 2.5631 68.852
10 Ovary 14572 2.5354 71.388
11 Esophagus 14490 2.5211 73.909
12 Meninges, brain, other central nervous system 14164 2.4644 76.373
13 Kidney and renal pelvis 13219 2.3000 78.673
14 All other and unspecified 11428 1.9884 80.661
15 All other and unspecified 11390 1.9818 82.643
16 All other and unspecified 9154 1.5927 84.236
17 All other and unspecified 8474 1.4744 85.710
18 All other and unspecified 8402 1.4619 87.172
19 All other and unspecified 3939 0.6853 87.858
20 All other and unspecified 3691 0.6422 88.500
21 All other and unspecified 1231 0.2142 88.714
22 All other and unspecified 68 0.0118 88.726
23 All other and unspecified 64798 11.2743 100.000


Note that although CSummary contains frequency data, it can contain multiple observations that have the same category value. The following statements create a Pareto chart from the modified CSummary data set:

proc pareto data=CSummary;
   vbar Type / freq        = _COUNT_
               last        = 'All other and unspecified'
               barlabel    = value
               nocatlabel
               catleglabel = 'Cancer Type'
               freqaxis    = 0 to 100 by 10
               odstitle    = title;
   inset n='Total Cancer Deaths:'(comma7.) / noframe;
run;

Note that the sample size legend in Output 15.12.3 displays the sample size as an unformatted integer. By using an INSET statement instead of the NLEGEND= option, you can specify a format for the sample size. (For a complete description of the INSET statement, see the section INSET Statement.) The resulting chart is shown in Output 15.12.6.

Output 15.12.6: Cancer Deaths Pareto Chart with Fewer Categories


Output 15.12.6 shows that 21.3% of deaths are assigned to All other and unspecified and that the bar frequencies sum to 100%.