FOCUS AREAS

SAS/STAT Examples

Estimating Geometric Means Using Data from a Complex Survey Sampling Design


Contents | SAS Program | PDF

Overview

Geometric means are widely used in a variety of scientific disciplines. They are the natural parameter of interest for a lognormal random variable because a ratio of lognormal random variables has a known lognormal distribution, and the geometric mean of a lognormal ratio is equal to the ratio of the individual geometric means. Some common uses for the geometric mean with survey data include estimating average population growth rates, bacterial contamination rates, and chemical concentration rates.

If you use SAS/STAT® 12.1 or later, you can estimate geometric means by specifying the ALLGEO statistic keyword in the PROC SURVEYMEANS statement. If you use a release prior to SAS/STAT12.1, none of the SAS/STAT survey procedures directly compute geometric means. However, with a little programming you can still estimate a geometric mean and its variance from sample survey data.

Analysis

Following Wolter (1985), suppose denotes the population mean of a characteristic and denotes an estimator of based on a sample of fixed size . The natural estimator for the exponential function is

Suppose denotes an estimator of the variance of that is appropriate to the particular sampling design. Then, the Taylor series estimator of variance is

These results can be applied directly to the problem of estimating the geometric mean of a finite population characteristic because the geometric mean is the exponentiation of the mean of the natural logarithm. That is, the geometric mean can be expressed as a finite population quantity by

where for . Substituting , then

and the natural estimator of is

You can estimate the variance of by

where the estimator is appropriate to the particular sampling design and estimator and is based on the variable . You can use PROC SURVEYMEANS to estimate the arithmetic mean of ; then using the result, you can compute the geometric mean of . Similarly, you can use the variance of the arithmetic mean of and the computed estimate of the geometric mean of to compute the variance and standard error of the geometric mean of .

Example

This example uses hypothetical data that represents survey results from an inspection of meat processing plants. The sampling plan is designed to estimate the levels of the Clostridium perfringens bacteria in processed ground beef. The survey has a stratified, two-stage sampling design and includes the variables PLANT, SHIFT, CLOSTRIDIUM, and WGT. PLANT identifies the strata, SHIFT identifies the clusters, WGT contains the sampling weights, and CLOSTRIDIUM measures the levels of bacteria per gram of product.

data example;
   input plant shift clostridium wgt;
   datalines;
 1     1       60    10.3676
 1     1      129    11.4145
 1     1        5    10.3055
 1     2      159    10.3626
 1     2       38    11.1399
 1     2       72    10.7285
 2     1     6335    11.2469
 2     1      605    10.7137
 2     1        1    11.4907

   ... more lines ...

20     2     1017    10.7235
20     2       23    11.1932
20     2        3    11.1999
20     2        1    11.6239
;
run;

After you put your data in a SAS data set, generate a variable that contains the natural logarithm of the variable CLOSTRIDIUM:

data example;
   set example;
   logClostridium = log(clostridium);
run;

Use PROC SURVEYMEANS to estimate the arithmetic mean of the newly generated variable LOGCLOSTRIDIUM. The WEIGHT statement specifies that the variable WGT contains the sampling weights. The STRATA statement specifies that the variable PLANT identifies the strata. The CLUSTER statement specifies that the variable PLANT identifies the clusters. The VAR statement specifies the variable LOGCLOSTRIDIUM as the variable whose mean you want to estimate. An ODS OUTPUT statement generates a data set that contains the estimation results.

proc surveymeans data=example;
   weight wgt;
   strata plant;
   cluster shift;
   var logClostridium;
   ods output statistics = estimates;
run;

Figure 1 PROC SURVEYMEANS Results
The SURVEYMEANS Procedure

Data Summary
Number of Strata 20
Number of Clusters 40
Number of Observations 168
Sum of Weights 1854.5865

Statistics
Variable N Mean Std Error of Mean 95% CL for Mean
logClostridium 168 3.277231 0.176325 2.90942422 3.64503856

Table 1 shows the contents of the ODS output data set ESTIMATES.

Table 1 ODS Output Data Set ESTIMATES

Variable Name

Contents

VarName

Variable name

N

Number of observations

Mean

Estimated mean

StdErr

Standard error

LowerCLMean

Lower confidence limit

UpperCLMean

Upper confidence limit

Use the following DATA step to perform the required transformations on the estimates in the ESTIMATES output data set. The first assignment statement replaces the contents of the variable MEAN with its exponentiated value; this transforms the estimated arithmetic mean of the variable LOGCLOSTRIDIUM into the geometric mean of the variable CLOSTRIDIUM. The next assignment statement transforms the standard error of the arithmetic mean of LOGCLOSTRIDIUM into the standard error of the geometric mean of CLOSTRIDIUM. The next two assignment statements replace the lower and upper confidence limits of the arithmetic mean of LOGCLOSTRIDIUM with their exponentiated values; this transforms the confidence limits for the arithmetic mean of LOGCLOSTRIDIUM into confidence limits for the geometric mean of CLOSTRIDIUM. The last assignment statement changes the contents of VARNAME from LOGCLOSTRIDIUM to CLOSTRIDIUM. The LABEL statement relabels the variables Mean, StdErr, LowerCLMean, and UpperCLMean.

data estimates;
   set estimates;
   Mean = exp(Mean);
   StdErr  = sqrt((Mean**2)*(StdErr**2));
   LowerCLMean = exp(LowerCLMean);
   UpperCLMean = exp(UpperCLMean);
   VarName = 'Clostridium';
   label Mean='Geometric Mean'
         StdErr='Standard Error'
         LowerCLMean='Lower 95% Confidence Limit'
         UpperCLMean='Upper 95% Confidence Limit';
run;

Finally, print the transformed output data set. The output from the PRINT procedure in Figure 2 displays the variable name, the estimate of the geometric mean, the standard error of the estimated geometric mean, and the lower and upper 95% confidence limits.

proc print data=estimates label noobs;
run;

Figure 2 Geometric Mean Estimation
Variable Name N Geometric Mean Standard Error Lower 95% Confidence
Limit
Upper 95% Confidence
Limit
Clostridium 168 26.502297 4.673013 18.3462321 38.2842492

The estimated geometric mean for Clostridium perfringens bacteria per gram of product is 26.50, the standard error of the estimate is 4.67, and a 95% confidence interval for the estimate is (18.34, 38.28).

If you are using SAS/STAT 12.1 or later, you estimate the geometric mean of CLOSTRIDIUM by simply specifying the ALLGEO statistic keyword in the PROC SURVEYMEANS statement; this requests all available statistics associated with geometric means.

proc surveymeans data=example allgeo;
   weight wgt;
   strata plant;
   cluster shift;
   var Clostridium;
run;

Figure 3 PROC SURVEYMEANS Results
The SURVEYMEANS Procedure

Data Summary
Number of Strata 20
Number of Clusters 40
Number of Observations 168
Sum of Weights 1854.5865

Geometric Means
Variable Geometric Mean Std Error 95% CL for Mean Lower 95% One-Sided
CL for Mean
Upper 95% One-Sided
CL for Mean
clostridium 26.502297 4.673013 18.3462321 38.2842492 19.552843 35.921718

The estimated geometric mean for Clostridium perfringens bacteria per gram of product is 26.50, the standard error of the estimate is 4.67, a two-sided 95% confidence interval for the estimate is (18.34, 38.28), and a one-sided 95% confidence interval for the estimate is (19.55, 35.92).

References

Wolter, K. M. (1985), Introduction to Variance Estimation, New York: Springer-Verlag.