# The SURVEYSELECT Procedure

### Example 115.3 PPS (Dollar-Unit) Sampling

A small company wants to audit employee travel expenses in an effort to improve the expense reporting procedure and possibly reduce expenses. The company does not have resources to examine all expense reports and wants to use statistical sampling to objectively select expense reports for audit.

The data set `TravelExpense` contains the dollar amount of all employee travel expense transactions during the past month:

```data TravelExpense;
input ID\$ Amount @@;
if (Amount < 500) then Level='1_Low ';
else if (Amount > 1500) then Level='3_High';
else Level='2_Avg ';
datalines;
110  237.18   002  567.89   234  118.50
743   74.38   411 1287.23   782  258.10
216  325.36   174  218.38   568 1670.80
302  134.71   285 2020.70   314   47.80
139 1183.45   775  330.54   425  780.10
506  895.80   239  620.10   011  420.18
672  979.66   142  810.25   738  670.85
192  314.58   243   87.50   263 1893.40
496  753.30   332  540.65   486 2580.35
614  230.56   654  185.60   308  688.43
784  505.14   017  205.48   162  650.42
289 1348.34   691   30.50   545 2214.80
517  940.35   382  217.85   024  142.90
478  806.90   107  560.72
;
```

In the SAS data set `TravelExpense`, the variable `ID` identifies the travel expense report. The variable `Amount` contains the dollar amount of the reported expense. The variable `Level` equals '1_Low', '2_Avg', or '3_High', depending on the value of `Amount`.

In the sample design for this audit, expense reports are stratified by `Level`. This ensures that each of these expense levels is included in the sample and also permits a disproportionate allocation of the sample, selecting proportionately more of the expense reports from the higher levels. Within strata, the sample of expense reports is selected with probability proportional to the amount of the expense, thus giving a greater chance of selection to larger expenses. In auditing terms, this is known as monetary-unit sampling. For more information, see Wilburn (1984).

PROC SURVEYSELECT requires that the input data set be sorted by the STRATA variables. The following PROC SORT statements sort the `TravelExpense` data set by the stratification variable `Level`.

```proc sort data=TravelExpense;
by Level;
run;
```

Output 115.3.1 displays the sampling frame data set `TravelExpense`, which contains 41 observations.

Output 115.3.1: Sampling Frame

 Travel Expense Audit

Obs ID Amount Level
1 110 237.18 1_Low
2 234 118.50 1_Low
3 743 74.38 1_Low
4 782 258.10 1_Low
5 216 325.36 1_Low
6 174 218.38 1_Low
7 302 134.71 1_Low
8 314 47.80 1_Low
9 775 330.54 1_Low
10 011 420.18 1_Low
11 192 314.58 1_Low
12 243 87.50 1_Low
13 614 230.56 1_Low
14 654 185.60 1_Low
15 017 205.48 1_Low
16 691 30.50 1_Low
17 382 217.85 1_Low
18 024 142.90 1_Low
19 002 567.89 2_Avg
20 411 1287.23 2_Avg
21 139 1183.45 2_Avg
22 425 780.10 2_Avg
23 506 895.80 2_Avg
24 239 620.10 2_Avg
25 672 979.66 2_Avg
26 142 810.25 2_Avg
27 738 670.85 2_Avg
28 496 753.30 2_Avg
29 332 540.65 2_Avg
30 308 688.43 2_Avg
31 784 505.14 2_Avg
32 162 650.42 2_Avg
33 289 1348.34 2_Avg
34 517 940.35 2_Avg
35 478 806.90 2_Avg
36 107 560.72 2_Avg
37 568 1670.80 3_High
38 285 2020.70 3_High
39 263 1893.40 3_High
40 486 2580.35 3_High
41 545 2214.80 3_High

The following PROC SURVEYSELECT statements select a probability sample of expense reports from the `TravelExpense` data set by using the stratified design with PPS selection within strata:

```title1 'Travel Expense Audit';
title2 'Stratified PPS (Dollar-Unit) Sampling';
proc surveyselect data=TravelExpense method=pps n=(6 10 4)
seed=47279 out=AuditSample;
size Amount;
strata Level;
run;
```

The STRATA statement names the stratification variable `Level`. The SIZE statement specifies the size measure variable `Amount`. In the PROC SURVEYSELECT statement, the METHOD=PPS option requests sample selection with probability proportional to size and without replacement. The N=(6 10 4) option specifies the stratum sample sizes by listing the sample sizes in the same order as the strata appear in the `TravelExpense` data set. The sample size of 6 corresponds to the first stratum (`Level` = '1_Low'); the sample size of 10 corresponds to the second stratum (`Level` = '2_Avg'); and the sample size of 4 corresponds to the last stratum (`Level` = '3_High'). The SEED= option specifies 47279 as the initial seed for random number generation.

Output 115.3.2 displays the output from PROC SURVEYSELECT. A total of 20 expense reports are selected for audit. The data set `AuditSample` contains the sample of travel expense reports.

Output 115.3.2: Sample Selection Summary

 Travel Expense Audit Stratified PPS (Dollar-Unit) Sampling

The SURVEYSELECT Procedure

Selection Method PPS, Without Replacement Amount Level

Input Data Set TRAVELEXPENSE 47279 3 20 AUDITSAMPLE

The following PROC PRINT statements display the audit sample, which is shown in Output 115.3.3:

```title1 'Travel Expense Audit';
title2 'Sample Selected by Stratified PPS Design';
proc print data=AuditSample;
run;
```

Output 115.3.3: Audit Sample

 Travel Expense Audit Sample Selected by Stratified PPS Design

Obs Level ID Amount SelectionProb SamplingWeight
1 1_Low 024 142.90 0.23949 4.17553
2 1_Low 614 230.56 0.38640 2.58797
3 1_Low 110 237.18 0.39750 2.51574
4 1_Low 782 258.10 0.43256 2.31183
5 1_Low 192 314.58 0.52721 1.89676
6 1_Low 216 325.36 0.54528 1.83392
7 2_Avg 239 620.10 0.42503 2.35278
8 2_Avg 308 688.43 0.47186 2.11925
9 2_Avg 496 753.30 0.51633 1.93676
10 2_Avg 478 806.90 0.55307 1.80810
11 2_Avg 142 810.25 0.55536 1.80063
12 2_Avg 517 940.35 0.64454 1.55151
13 2_Avg 672 979.66 0.67148 1.48925
14 2_Avg 139 1183.45 0.81116 1.23280
15 2_Avg 411 1287.23 0.88229 1.13341
16 2_Avg 289 1348.34 0.92418 1.08204
17 3_High 568 1670.80 0.64385 1.55316
18 3_High 263 1893.40 0.72963 1.37056
19 3_High 545 2214.80 0.85348 1.17167
20 3_High 486 2580.35 0.99435 1.00568