24982 - Jackknife and Bootstrap Analyses

Sample 24982: Jackknife and Bootstrap Analyses

Jackknife and Bootstrap Analyses

Contents:

Purpose / History / Requirements / Usage / Details / References

PURPOSE:

The %JACK macro does jackknife analyses for simple random samples, computing approximate standard errors, bias-corrected estimates, and confidence intervals assuming a normal sampling distribution.

The %BOOT macro does elementary nonparametric bootstrap analyses for simple random samples, computing approximate standard errors, bias-corrected estimates, and confidence intervals assuming a normal sampling distribution. Also, for regression models, the %BOOT macro can resample either observations or residuals.

The %BOOTCI macro computes several varieties of confidence intervals that are suitable for sampling distributions that are not normal.

REQUIREMENTS:

The macros require only Base SAS software. However, to apply them to a statistic computed by a procedure in another product, such as a procedure in SAS/STAT software, that product will also be required.

USAGE:

Follow the instructions in the Downloads tab of this sample to save the %JACK and %BOOT macro definitions. Replace the text within quotes in the following statement with the location of the macro definitions file on your system. In your SAS program or in the SAS editor window, specify this statement to define the macros and make them available for use:

   %inc "<location of your file containing the JACK and BOOT macros>";

Following this statement, you may call the %JACK and %BOOT macros. See the Results tab for an example.

To use the %JACK or %BOOT macros, you must write a macro called %ANALYZE to do the data analysis that you want to bootstrap. The %ANALYZE macro must have two arguments:

   DATA=   the name of the input data set to analyze
   OUT=    the name of the output data set containing the statistics
           for which you want to compute bootstrap distributions.

The %ANALYZE macro is run once by the %JACK or %BOOT macros to analyze the original data set. Then the resampled data sets are generated, and the %JACK or %BOOT macros run the %ANALYZE macro again to analyze each resample. There are two ways to analyze the resamples:

with a macro loop
with BY processing

If you don't do anything special, a macro loop will be used. A macro loop takes much more computer time than BY processing but requires less disk space. It is usually better to use BY processing unless you have run out of disk space.

To use BY processing for the resamples, you must write the %ANALYZE macro to use BY processing. But instead of having ordinary BY statements in the %ANALYZE macro, you must use the %BYSTMT macro instead, which automatically generates an appropriate BY statement or not; that is, for the analysis of the resamples, the %BYSTMT macro generates a BY statement, but for the analysis of the original data, the %BYSTMT macro produces no BY statement (actually, it produces an empty BY statement, which has the same effect as not having a BY statement).

If the %ANALYZE macro uses the %BYSTMT macro, the %JACK and %BOOT macros create a huge data set (actually a view) containing all of the resampled data sets, and analyze all the resamples by running the %ANALYZE macro once with BY processing. If the %ANALYZE macro does not use the %BYSTMT macro, the %JACK and %BOOT macros execute a macro loop that generates and analyzes the resamples one at a time.

BY Processing in the Analysis

Ordinarily, you do not need to worry about what happens inside the %BYSTMT macro. However, if the analysis of the original data set requires BY processing, you will have to modify the %BYSTMT macro to include your BY variable(s) as well as the variable that the %JACK and %BOOT macros use to distinguish the resamples (which is referred to as &by in the %BYSTMT macro). If you're not in a hurry, you may find it simpler to forego use of the %BYSTMT macro.

DETAILS:

Introduction

The %JACK macro does jackknife analyses for simple random samples, computing approximate standard errors, bias-corrected estimates, and confidence intervals assuming a normal sampling distribution.

The %BOOTCI macro computes several varieties of confidence intervals that are suitable for sampling distributions that are not normal.

In order to use the %JACK or %BOOT macros, you need to know enough about the SAS macro language to write simple macros yourself. See The SAS Guide to Macro Processing for information on the SAS macro language.

This document does not explain how the jackknife and bootstrap are performed or how the various confidence intervals are computed, but does provide some advice and caveats regarding usage. For an elementary introduction, see Dixon in the bibliography below. There is a thorough exposition in E&T (see references below) that should be accessible to anyone who has done a year or more of statistical study.

There is a widespread myth that bootstrapping is a magical spell to perform valid statistical inference on anything. S&T dispell this myth very effectively and very technically. For an elementary demonstration of the dangers of bootstrapping, see the "Cautionary Example" below.

The Jackknife

The jackknife works only for statistics that are smooth functions of the data. Statistics that are not smooth functions of the data, such as quantiles, may yield inconsistent jackknife estimates. The best results are obtained with statistics that are linear functions of the data. For highly nonlinear statistics, the jackknife can be inaccurate. See S&T, chapter 2, for a detailed discussion of the validity of the jackknife.

The Bootstrap

Bootstrap estimates of standard errors are valid for many commonly-used statistics, generally requiring no major assumptions other than simple random sampling and finite variance. There do exist some statistics for which the standard error estimates will fail, such as the maximum or minimum. The bootstrap standard error is consistent for some nonsmooth statistics such as the median. However, the bootstrap standard error may not be consistent even for very smooth statistics when the population distribution has very heavy tails. Inconsistency of the usual bootstrap estimators can often be remedied by using a resample size m(n) that is smaller than the sample size n, so that m(n)->infinity and m(n)/n->0 as n->infinity. Theoretical results on the consistency of the bootstrap standard error are not extensive. See S&T, chapter 3, for details.

The bootstrap estimates of bias provided by the %BOOT macro are valid under simple random sampling for many commonly-used plug-in estimators. A plug-in estimator is one that uses the same formula to compute an estimate from a sample that is used to compute a parameter from the population. For example, if the sample variance is computed with a divisor of n (VARDEF=N), it is a plug-in estimate; if it is computed with a divisor of n-1 (VARDEF=DF, the default), it is not a plug-in estimate. R² is a plug-in estimator; adjusted R² is not. Estimating the bias of a non-plug-in estimators requires special treatment; see "Bias Estimation" below. If you are using an estimator that is known to be unbiased, use the BIASCORR=0 argument with %BOOT. See E&T, chapter 10, for more discussion of bootstrap estimation of bias.

The approximate normal confidence intervals computed by the %BOOT macro are valid if both the bias and standard error estimates are valid and if the sampling distribution is approximately normal. For non-normal sampling distributions, you should use the %BOOTCI macro, which requires a much larger number of resamples for adequate approximation. If you plan to use only %BOOT, 200 resamples will typically be enough. If you plan to use %BOOTCI, 1000 or more resamples are likely to be needed for a 90% confidence interval; greater confidence levels require even more resamples. The proper use of bootstrap confidence intervals is a matter of considerable controversy; see S&T, chapter 4, for a review.

The %BOOT macro does balanced resampling when possible. Balanced resampling yields more accurate approximations to the ideal bootstrap estimators of bias and standard errors than does uniform resampling. Of course, both balanced resampling and uniform resampling produce approximations that converge to the same ideal bootstrap estimators as the number of resamples goes to infinity. Balanced resampling is of little benefit with %BOOTCI. See Hall, appendix II, for a discussion of balanced resampling and other methods from improving the computational efficiency of the bootstrap.

Output Data Sets

If the %ANALYZE macro uses the %BYSTMT macro, two output data sets are created by the %JACK macro:

JACKDATA: contains the jackknife resamples. The variable _SAMPLE_ gives the resample number, and _OBS_ gives the original observation number.
JACKDIST: contains the resampling distributions of the statistics in the OUT= data set created by the %ANALYZE macro. The variable _SAMPLE_ gives the resample number.

Two similar data sets are also created by the %BOOT macro when the %BYSTMT macro is used:

BOOTDATA: contains the bootstrap resamples. The variable _SAMPLE_ gives the resample number, and _OBS_ gives the original observation number.
BOOTDIST: contains the resampling distributions of the statistics in the OUT= data set created by the %ANALYZE macro. The variable _SAMPLE_ gives the resample number.

In addition, the %JACK macro creates a data set JACKSTAT and the %BOOT macro creates a data set BOOTSTAT regardless of whether the %BYSTMT macro is used. These data sets contain the approximate standard errors, bias-corrrected estimates, and 95% confidence intervals assuming a normal sampling distribution. The %BOOTCI macro creates a data set BOOTCI containing the confidence intervals.

If the OUT= data set contains more than one observation per BY group, you must specify a list of ID= variables when you run the %JACK or %BOOT macros. These ID= variables identify observations that correspond to the same statistic in different BY groups. For many procedures, these ID= variables would naturally be _TYPE_ and _NAME_, but those names are not allowed to be used as ID= variables--you must use the RENAME= data set option to rename them. (Renaming variables can be tricky. You must use the old name with the DROP= and KEEP= data set options, but you must use the new name with the WHERE= data set option.)

Bias Estimation

The sample correlation is a plug-in estimator and hence is suitable for the bias estimator in %BOOT. The sample variance computed with a divisor of n-1 is not a plug-in estimator and therefore requires special treatment. In some procedures, you can use the VARDEF= option to obtain a plug-in estimate of the variance. The default value of VARDEF= is DF, which yields the usual adjustment for degrees of freedom, instead of the plug-in estimate. For example:

   title2 'The unbiased variance estimator is not a plug-in estimator';
   proc means data=law var vardef=df;
      var LSAT GPA;
   run;

The following %ANALYZE macro could be used to jackknife the unbiased variance estimator, but the bootstrap over-corrects for the nonexistent bias:

   title2 'Estimating the bias of the unbiased estimator of variance';
   %macro analyze(data=,out=);
      proc means noprint data=&data vardef=df;
         output out=&out(drop=_freq_ _type_) var=var_LSAT var_GPA;
         var LSAT GPA;
         %bystmt;
      run;
   %mend;

   title3 'The jackknife computes the correct bias of zero';
   %jack(data=law)

   title3 'The bootstrap over-corrects for bias';
   %boot(data=law,random=123)

By specifying VARDEF=N instead of VARDEF=DF, you can tell the MEANS procedure to compute a plug-in estimate of the variance:

   title2 'Estimating the bias of the plug-in estimator of variance';
   %macro analyze(data=,out=);
      proc means noprint data=&data vardef=n;
         output out=&out(drop=_freq_ _type_) var=var_LSAT var_GPA;
         var LSAT GPA;
         %bystmt;
      run;
   %mend;

With the above %ANALYZE macro, %JACK yields an exact bias correction, while the bias-corrected estimates from %BOOT are very close to the unbiased estimates:

   title3 'Jacknife Analysis';
   %jack(data=law)

   title3 'Bootstrap Analysis';
   %boot(data=law,random=123)

If the procedure you are using supports the VARDEF= option to produce plug-in estimates, you can use the %VARDEF macro to obtain correct bootstrap bias estimates of the corresponding non-plug-in estimates. The %VARDEF macro generates a VARDEF= option with a value of either N or DF as appropriate for use with the %BOOT macro (The %JACK macro ignores the %VARDEF macro). In the %ANALYZE macro, use %VARDEF in the procedure statement where the VARDEF= option would be syntactically correct. For example:

   title2 'Estimating the bias of the unbiased variance estimator';
   %macro analyze(data=,out=);
      proc means noprint data=&data %vardef;
         output out=&out(drop=_freq_ _type_) var=var_LSAT var_GPA;
         var LSAT GPA;
         %bystmt;
      run;
   %mend;

   title3 'Bootstrap Analysis';
   %boot(data=law,random=123)

The variance estimator using VARDEF=DF is unbiased, so the bias correction estimated by bootstrapping is much smaller than in the previous example, in which the biased plug-in estimator was used.

Confidence Intervals

The normal bootstrap confidence interval computed by %BOOT or %BOOTSE is accurate only for statistics with an approximately normal sampling distribution. The %BOOTCI macro provides the most commonly used types of bootstrap confidence intervals that:

are suitable for statistics with nonnormal sampling distributions and
require only a single level of resampling.

You must run %BOOT before %BOOTCI, and it is advisable to specify at least 1000 resamples in %BOOT for a 90% confidence interval. For a higher level of confidence or for the BC and BCa methods, even more resamples should be used.

The terminology for bootstrap confidence intervals is confused. The keywords used with the %BOOTCI macro follow S&T:

   Keyword      Terms from the references
   -------      -------------------------
   PCTL or      "bootstrap percentile" in S&T;
   PERCENTILE   "percentile" in E&T;
                "other percentile" in Hall;
                "Efron's ‘backwards’ pecentile" in Hjorth

   HYBRID       "hybrid" in S&T;
                 no term in E&T;
                "percentile" in Hall;
                "simple" in Hjorth

   T            "bootstrap-t" in S&T and E&T;
                "percentile-t" in Hall;
                "studentized" in Hjorth

   BC           "BC" in all

   BCA          "BCa" in S&T, E&T, and Hjorth; "ABC" in Hall
                (cannot be used for bootstrapping residuals in
                regression models)

There is considerable controversy concerning the use of bootstrap confidence intervals. To fully appreciate the issues, it is important to read S&T and Hall in addition to E&T. Asymptotically in simple random samples, the T and BCa methods work better than the traditional normal approximation, while the percentile, hybrid, and BC methods have the same accuracy as the traditional normal approximation. In small samples, things get much more complicated:

The percentile method simply uses the alpha/2 and 1-alpha/2 percentiles of the bootstrap distribution to define the interval. This method performs well for quantiles and for statistics that are unbiased and have a symmetric sampling distribution. For a statistic that is biased, the percentile method amplifies the bias. The main virtue of the percentile method and the closely related BC and BCa methods is that the intervals are equivariant under transformation of the parameters. One consequence of this equivariance is that the interval cannot extend beyond the possible range of values of the statistic. In some cases, however, this property can be a vice--see the "Cautionary Example" below.
The BC method corrects the percentile interval for bias--median bias, not mean bias. The correction is performed by adjusting the percentile points to values other than alpha/2 and 1-alpha/2. If a large correction is required, one of the percentile points will be very small; hence a very large number of resamples will be required to approximate the interval accurately. See the "Cautionary Example" below.
The BCa method corrects the percentile interval for bias and skewness. This method requires an estimate of the acceleration, which is related to the skewness of the sampling distribution. The acceleration can be estimated by jackknifing for simple random samples which, of course, requires extra computation. For bootstrapping residuals in regression models, no general method for estimating the acceleration is known. If the acceleration is not estimated accurately, the BCa interval will perform poorly. The length of the BCa interval is not monotonic with respect to alpha (Hall, pp 134-135, 137). For large values of the acceleration and large alpha, the BCa interval is excessively short. The BCa interval is no better than the BC interval for nonsmooth statistics such as the median.
The HYBRID method is the reverse of the percentile method. While the percentile method amplifies bias, the HYBRID method automatically adjusts for bias and skewness. The HYBRID method works well if the standard error of the statistic does not depend on any unknown parameters; otherwise, the T method works better if a good estimate of the standard error is available. Of all the methods in %BOOTCI, the HYBRID method seems to be the least likely to yield spectacularly wrong results, but often suffers from low coverage in relatively easy cases. The HYBRID method and the closely related T method are not equivariant under transformation of the parameters.
The T method requires an estimate of the standard error (or a constant multiple thereof) of each statistic being bootstrapped. This requires more work from the user. If the standard errors are not estimated accurately, the T method may perform poorly. In simulation studies, T intervals are often found to be very long. E&T (p 160) claim that the T method is erratic and sensitive to outliers.

Numerous other methods exist for bootstrap confidence intervals that require nested resampling, i.e., each resample of the original sample is itself reresampled multiple times. Since the total number of reresamples required is typically 25,000 or more, these methods are extremely expensive and have not yet been implemented in the %BOOT and %BOOTCI macros.

The following example replicates the nonparametric confidence intervals shown in E&T, p 183. This example analyzes the variances of two variables, A and B, while E&T analyze only A. E&T do not show the hybrid interval, the normal ("standard") interval with bias correction, or the jackknife interval.

   title 'Spatial Test Data from Efron and Tibshirani, pp 180 & 183';
   data spatial;
      input a b @@;
   cards;
   48 42 36 33 20 16 29 39 42 38 42 36 20 15 42 33 22 20 41 43 45 34
   14 22  6  7  0 15 33 34 28 29 34 41  4 13 32 38 24 25 47 27 41 41
   24 28 26 14 30 28 41 40
   ;

   %macro analyze(data=,out=);
      proc means noprint data=&data vardef=n;
         output out=&out(drop=_freq_ _type_) var=var_a var_b;
         var a b;
         %bystmt;
      run;
   %mend;

   title2 'Jackknife Interval with Bias Correction';
   %jack(data=spatial,alpha=.10);

   title2 'Normal ("Standard") Confidence Interval with Bias Correction';
   %boot(data=spatial,alpha=.10,samples=2000,random=123);

   title2 'Normal ("Standard") Confidence Interval without Bias Correction';
   %bootse(alpha=.10,biascorr=0);

   title2 'Efron''s Percentile Confidence Interval';
   %bootci(percentile,alpha=.10)

   title2 'Hybrid Confidence Interval';
   %bootci(hybrid,alpha=.10)

   title2 'BC Confidence Interval';
   %bootci(bc,alpha=.10)

   title2 'BCa Confidence Interval';
   %bootci(bca,alpha=.10)


   title2 'Resampling with Computation of Studentizing Statistics';
   %macro analyze(data=,out=);
      proc means noprint data=&data vardef=n;
         output out=&out(drop=_freq_ _type_)
            var=var_a var_b kurtosis=kurt_a kurt_b;
         var a b;
         %bystmt;
      run;
      data &out;
         set &out;
         stud_a=var_a*sqrt(kurt_a+2);
         stud_b=var_b*sqrt(kurt_b+2);
         drop kurt_a kurt_b;
      run;
   %mend;

   %boot(data=spatial,stat=var_a var_b,samples=2000,random=123);

   title2 'T Confidence Interval';
   %bootci(t,stat=var_a var_b,student=stud_a stud_b,alpha=.10)

If you want to compute all the varieties of confidence intervals, you can use the %ALLCI macro:

   title2 'All Jackknife and Bootstrap Confidence Intervals';
   %allci(stat=var_a var_b,student=stud_a stud_b,alpha=.10)

Bootstrapping Regression Models

In regression models, there are two main ways to do bootstrap resampling, depending on whether the predictor variables are random or fixed. Stine provides an elementary introduction to bootstrapping regressions, including discussion of outliers, robust estimators, and heteroscedasticity.

If the predictors are random, you resample observations just as you would for any simple random sample. This method is usually called "bootstrapping pairs".

If the predictors are fixed, the resampling process should keep the same values of the predictors in every resample and change only the values of the response variable by resampling the residuals. To do this with the %BOOT macro, you must do a preliminary analysis in which you fit the regression model using the complete sample and create an output data set containing residuals and predicted values; it is this output data set that is used as input to the %BOOT macro. You must also specify the name of the residual variable and provide an equation for computing the response variable from the residual and predicted values.

   title 'Cement Hardening Data from Hjorth, p 31';
   data cement;
      input x1-x4 y;
      label x1='3CaOAl2O3'
            x2='3CaOSiO2'
            x3='4CaOAl2O3Fe2O3'
            x4='2CaOSiO2';
   cards;
    7 26  6 60  78.5
    1 29 15 52  74.3
   11 56  8 20 104.3
   11 31  8 47  87.6
    7 52  6 33  95.9
   11 55  9 22 109.2
    3 71 17  6 102.7
    1 31 22 44  72.5
    2 54 18 22  93.1
   21 47  4 26 115.9
    1 40 23 34  83.8
   11 66  9 12 113.3
   10 68  8 12 109.4
   ;

   proc reg data=cement;
      model y=x1-x4;
      output out=cemout r=resid p=pred;
   run;

   %macro analyze(data=,out=);
      options nonotes;
      proc reg data=&data noprint
               outest=&out(drop=Y _IN_ _P_ _EDF_);
         model y=x1-x4/selection=rsquare start=4;
         %bystmt;
      run;
      options notes;
   %mend;

   title2 'Resampling Observations';
   title3 '(bias correction for _RMSE_ is wrong)';
   %boot(data=cement,random=123)

   title2 'Resampling Residuals';
   title3 '(bias correction for _RMSE_ is wrong)';
   %boot(data=cemout,residual=resid,equation=y=pred+resid,random=123)

Either method of resampling for regression models (observations or residuals) can be used regardless of the form of the error distribution. However, residuals should be resampled only if the errors are independent and identically distributed and if the functional form of the model is correct to within a reasonable approximation. If these assumptions are questionable, it is safer to resample observations.

In the above example, R² is a plug-in estimator, so the bias correction is appropriate. The root mean squared error, _RMSE_, is not a plug-in estimator, so the bias correction for _RMSE_ is wrong. Unfortunately, the REG procedure does not support the VARDEF= option. _RMSE_ is not very biased, so you could choose to ignore the bias and run the %BOOTSE macro to compute the standard error without a bias correction:

   title2 'Resampling Observations';
   title3 'Without bias correction';
   %bootse(stat=_rmse_,biascorr=0)

To get the proper bias correction for _RMSE_, you have to use a DATA step that checks the macro variable &VARDEF and unadjusts for degrees of freedom when &VARDEF=N. You must also invoke the %VARDEF macro, but since you don't want to generate a VARDEF= option in this case, just assign the value returned by %VARDEF to an unused macro variable:

   %macro analyze(data=,out=);
      options nonotes;
      proc reg data=&data noprint outest=&out;
         model y=x1-x4/selection=rsquare start=4;
         %bystmt;
      run;
      %let junk=%vardef;
      data &out(drop=y _in_ _p_ _edf_);
         set &out;
         _mse_=_rmse_**2;
         %if &vardef=N %then %do;
            _mse_=_mse_*_edf_/(_edf_+_p_);
            _rmse_=sqrt(_mse_);
         %end;
         label _mse_='Mean Squared Error';
      run;
      options notes;
   %mend;

   title2 'Resampling Observations';
   %boot(data=cement,random=123)

Note that _MSE_ is an unbiased estimate, so its estimated bias is very small. _RMSE_ is slightly biased and thus has a larger estimated bias.

REFERENCES:

Dixon

Dixon, P.M. (1993), "The bootstrap and the jackknife: Describing the precision of ecological indices," in Scheiner, S.M. and Gurevitch, J., eds., Design and Analysis of Ecological Experiments, New York: Chapman & Hall, pp 290-318.

E&T

Efron, B. and Tibshirani, R.J. (1993), An Introduction to the Bootstrap, New York: Chapman & Hall.

Hall

Hall, P. (1992), The Bootstrap and Edgeworth Expansion, New York: Springer-Verlag.

Hjorth

Hjorth, J.S.U. (1994), Computer Intensive Statistical Methods, London: Chapman & Hall.

S&T

Shao, J. and Tu, D. (1995), The Jackknife and Bootstrap, New York: Springer-Verlag.

Stine

Stine, R. (1990), "An introduction to bootstrap methods: Examples and ideas," Sociological Methods & Research, 18, 243-291.

These sample files and code examples are provided by SAS Institute Inc. "as is" without warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability and fitness for a particular purpose. Recipients acknowledge and agree that SAS Institute shall not be liable for any damages whatsoever arising out of their use of this material. In addition, SAS Institute will provide no support for the materials contained herein.

EXAMPLE:

Consider analyzing the correlation of the LSAT and GPA variables from Efron and Tibshirani (1993):

   title 'Law School Data from Efron and Tibshirani, p. 19';
   data law; input LSAT GPA; cards;
   576 3.39
   635 3.30
   558 2.81
   578 3.03
   666 3.44
   580 3.07
   555 3.00
   661 3.43
   651 3.36
   605 3.13
   653 3.12
   575 2.74
   545 2.76
   572 2.88
   594 2.96
   ;

The following %ANALYZE macro could be used to process all the statistics in the OUT= data set from PROC CORR:

   %macro analyze(data=,out=);
      proc corr noprint data=&data
         out=&out(rename=(_type_=stat _name_=with));
         var LSAT GPA;
         %bystmt;
      run;
   %mend;

   %inc "<location of your file containing the JACK and BOOT macros>";

   title2 'Jacknife Analysis';
   %jack(data=law,id=stat with)

   title2 'Bootstrap Analysis';
   %boot(data=law,id=stat with,random=123)

However, if you are interested only in the correlation, it is more efficient to extract only the relevant observations and variables. It is also helpful to provide descriptive names and labels, as in this example:

   %macro analyze(data=,out=);
      proc corr noprint data=&data out=&out;
         var LSAT GPA;
         %bystmt;
      run;
      %if &syserr=0 %then %do;
         data &out;
            set &out;
            where _type_='CORR' & _name_='LSAT';
            corr=GPA;
            label corr='Correlation';
            keep corr &by;
         run;
      %end;
   %mend;

   title2 'Jacknife Analysis';
   %jack(data=law)

   title2 'Bootstrap Analysis';
   %boot(data=law,random=123)

It is advisable to make the OUT= data set as small as possible to conserve computer time and disk space. If you are running release 6.11 or later, you can use a WHERE= data set option on an output data set:

   title2 'Using WHERE= with an output data set--6.11 only';
   %macro analyze(data=,out=);
      proc corr noprint data=&data
         out=&out(where=(_type_='CORR' & _name_='LSAT')
                  rename=(GPA=corr)
                  keep=GPA _type_ _name_ &by);
         var LSAT GPA;
         %bystmt;
      run;
   %mend;

   title3 'Jacknife Analysis';
   %jack(data=law)

   title3 'Bootstrap Analysis';
   %boot(data=law,random=123)

Unfortunately, you may not DROP any variable used in a WHERE= data set option.

A Cautionary Example

Jackknifing and bootstrapping are no remedy for an inadequate sample size. For nonparametric resampling methods, the sample distribution must be reasonably close in some sense to the population distribution to obtain accurate inferences. In parametric methods, only the estimated parameters need be reasonably close to the population parameters to obtain accurate inferences. The smaller the sample size, the greater the fluctuations in the distribution of the sample. Nonparametric methods that are sensitive to a wide variety of such fluctuations will suffer more from small sample sizes than will parametric methods if the assumptions of the parametric methods are valid.

In this example, the purpose of the analysis is to find a 95% confidence interval for R² in a linear regression with 20 observations and 10 predictors. The predictors and response are generated from a multivariate normal distribution, so normal-theory methods are applicable. With real data, if the distribution were not known to be normal, you might be tempted to use the jackknife or bootstrap on the theory that normal approximations could not be trusted in such a small sample size. In fact, most of the jackknife and bootstrap methods cannot be trusted either.

This example computes a 95% confidence interval with each of the methods available in %JACK and %BOOT using 1000 resamples. The results are assembled into a single data set called CI for comparison. PROC CANCORR is also used to obtain a normal-theory 95% confidence interval. Two versions of the %ANALYZE macro are shown, one with CANCORR and one with REG; either version can be used for the analysis.

   title 'A Cautionary Example';

   %let n=20;
   %let p=10;

   *** generate multivariate normal data with true R**2=0.1;
   data x; array x x1-x&p;
      do n=1 to &n; drop n;
         do over x; x=rannor(123); end;
         p=sum(of x1-x&p)/sqrt(&p);
         e=rannor(123);
         y=p*sqrt(.1)+e*sqrt(.9);
         output;
      end;
   run;

   *** normal-theory confidence interval for R**2;
   proc cancorr data=x smc vdep short outstat=stat; var y; with x:; run;

   *** extract confidence interval from cancorr output;
   proc transpose data=stat(where=(_type_ in ('LCLRSQ' 'UCLRSQ')))
                  out=ci(rename=(LCLRSQ=alcl UCLRSQ=aucl));
      id _type_;
      var y;
   run;
   data ci; set ci; drop _name_;
      length method $20; method='Normal theory';
   run;

   %*** macro for bootstrapping using cancorr;
   %macro analyze(data=,out=);
      options nonotes;
      proc cancorr data=&data smc vdep noprint
         outstat=&out(where=(_type_='RSQUARED') keep=_type_ y &by);
         var y; with x:;
         %bystmt;
      run;
      options notes;
   %mend;

   %*** macro for bootstrapping using reg;
   %macro analyze(data=,out=);
      options nonotes;
      proc reg data=&data noprint
               outest=&out(keep=_rsq_ &by);
         model y=x:/selection=rsquare start=&p;
         %bystmt;
      run;
      options notes;
   %mend;

   %jack(data=x)
   proc append data=jackstat(keep=alcl aucl method) base=ci; run;

   %boot(data=x,samples=1000,random=123)
   proc append data=bootstat(keep=alcl aucl method) base=ci; run;

   %bootci(Hybrid)
   proc append data=bootci(keep=alcl aucl method) base=ci; run;

   %bootci(PCTL)
   proc append data=bootci(keep=alcl aucl method) base=ci; run;

   %bootci(BC)
   proc append data=bootci(keep=alcl aucl method) base=ci; run;

   %bootci(BCa)
   proc append data=bootci(keep=alcl aucl method) base=ci; run;


   %*** macro for bootstrapping, with a data step to estimate the
        standard error of R**2;
   %macro analyze(data=,out=);

      options nonotes;
      proc reg data=&data noprint
               outest=&out(keep=_rsq_ &by);
         model y=x:/selection=rsquare start=&p;
         %bystmt;
      run;
      options notes;

      data &out; set &out(rename=(_rsq_=rsquare));
         retain n &n p %eval(&p+1); drop n p;

         * standard error of rsquare--see Kendall & Stuart (1979),
           The Advanced Theory of Statistics, Vol 2, p 363,
           London: Charles Griffin & Company Ltd. Surprisingly,
           the bootstrap t interval seems to be slightly less
           conservative when rsquare is plugged into the formula
           for the standard error than when max(0,adjrsq) is
           plugged in;

         stud=max(.01,sqrt( (n-p)/(n**2-1)/(n-1) * (1-rsquare)**2 *
                ( 2*(p-1) + 4*rsquare*((n-p)*(n-1)+4*(p-1))/(n+3) )
                  ));
      run;

   %mend;

   %boot(data=x,stat=rsquare,samples=1000,random=123)
   %bootci(t,stat=rsquare,student=stud)
   proc append data=bootci(keep=alcl aucl method) base=ci; run;

   proc print data=ci; id method; run;

The actual sampling distribution of R², based on 10000 simulated data sets, looks like this:

      Frequency

      1000 +                               *
           |                             * * *
           |                           * * * *
           |                           * * * * *
       800 +                         * * * * * *
           |                         * * * * * * *
           |                       * * * * * * * *
           |                       * * * * * * * *
       600 +                       * * * * * * * * *
           |                     * * * * * * * * * *
           |                     * * * * * * * * * *
           |                   * * * * * * * * * * *
       400 +                   * * * * * * * * * * * *
           |                   * * * * * * * * * * * *
           |                 * * * * * * * * * * * * *
           |                 * * * * * * * * * * * * *
       200 +               * * * * * * * * * * * * * * *
           |             * * * * * * * * * * * * * * * *
           |             * * * * * * * * * * * * * * * * *
           |         * * * * * * * * * * * * * * * * * * * *
           ------------------------------------------------------
             0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
             . . . . . . . . . . . . . . . . . . . . . . . . . .
             0 0 0 1 1 2 2 2 3 3 4 4 4 5 5 6 6 6 7 7 8 8 8 9 9 0
             0 4 8 2 6 0 4 8 2 6 0 4 8 2 6 0 4 8 2 6 0 4 8 2 6 0

The bootstrap distribution computed from the one data set in this example is not even close to the true sampling distribution:

       Frequency

           |                                                   *
       300 +                                                   *
           |                                                   *
           |                                                   *
           |                                                   *
           |                                                   *
       200 +                                                   *
           |                                                   *
           |                                                   *
           |                                                 * *
           |                                               * * *
       100 +                                           * * * * *
           |                                         * * * * * *
           |                                       * * * * * * *
           |                                   * * * * * * * * *
           |                               * * * * * * * * * * *
           ------------------------------------------------------
             0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
             . . . . . . . . . . . . . . . . . . . . . . . . . .
             0 0 0 1 1 2 2 2 3 3 4 4 4 5 5 6 6 6 7 7 8 8 8 9 9 0
             0 4 8 2 6 0 4 8 2 6 0 4 8 2 6 0 4 8 2 6 0 4 8 2 6 0

The table of confidence intervals printed by the final PROC PRINT step is:

               METHOD                ALCL        AUCL

               Normal theory        0.00000    0.62876
               Jackknife           -0.44648    0.54393
               Bootstrap Normal     0.08193    0.51356
               Bootstrap Hybrid     0.18391    0.55838
               Bootstrap PCTL       0.62552    1.00000
               Bootstrap BC         0.47696    0.55898
               Bootstrap BCa        0.47696    0.55898
               Bootstrap t         -3.11368    0.55590

The true value of R² in this example is 0.10, the sample plug-in estimate is 0.59, and the adjusted estimate is 0.14. The normal-theory interval can be considered the "right answer".

The jackknife interval has a negative lower limit, and the upper limit is rather low, but the interval covers the true value.

The bootstrap interval based on a normal approximation is short but does cover the true value. However, a glance at the chart of the bootstrap distribution shows that a normal approximation is suspect.

The bootstrap hybrid interval is even shorter and does not cover the true value. The hybrid interval is poor because the bootstrap distribution is less variable and far more skewed than the true sampling distribution.

The plug-in estimate is very biased, so it is no surprise that the bootstrap PCTL method works poorly. However, the PCTL interval lies entirely above the plug-in estimate, a dramatic illustration of Hall's claim that the PCTL interval is "backwards"!

The bootstrap BC interval is extremely short and is not even close to the true value. The lower percentile point for computing the BC interval is .00000000010453, so billions of resamples would be required for an accurate approximation. The lower percentile point for the BCa interval is even smaller at 7.3099E-17, and would require an astronomical number of resamples for an accurate approximation.

The bootstrap t interval has a wildy negative lower limit, and the upper limit is rather low, but the interval covers the true value.

A simulation was performed by repeating the above analysis 2000 times on randomly generated data sets. For each method, the coverage probability (COVERAGE), the average length (LENGTH), and the positive part of the length (POSLEN=AUCL-MAX(0,ALCL)) were computed. Among the jackknife and bootstrap methods, the only acceptable coverage probability is for the bootstrap t interval, which is nevertheless very poor with regard to the length of the interval. Considering only the positive part of the interval, the bootstrap t interval is quite good, but it works well in this example only because we know a lower bound for the parameter and have an analytic expression for the standard error.

      METHOD               COVERAGE      LENGTH      POSLEN
      ------------------------------------------------------
      Bootstrap BC          0.00426    0.148825    0.148825
      Bootstrap BCa         0.00426    0.148575    0.148575
      Bootstrap Hybrid      0.42997    0.418778    0.351726
      Bootstrap Normal      0.54917    0.485234    0.361531
      Bootstrap PCTL              0    0.418778    0.418778
      Bootstrap T           0.95828    4.351963    0.542228
      Jackknife             0.66050    0.561788    0.333603
      Normal theory         0.95360    0.573977    0.573977

Date Modified:	2010-12-03 11:16:02
Date Created:	2005-01-13 15:02:40

Product Family	Product	Host	SAS Release
			Starting	Ending
SAS System	SAS/STAT	All	n/a	n/a

Support

Sample 24982: Jackknife and Bootstrap Analyses

Jackknife and Bootstrap Analyses

Operating System and Release Information