The TPSPLINE Procedure |
This example illustrates how you can construct a bootstrap confidence interval by using the multiple responses feature in PROC TPSPLINE.
Numerous epidemiological observations have indicated that exposure to solar radiation is an important factor in the etiology of melanoma. The following data present age-adjusted melanoma incidences for 37 years from the Connecticut Tumor Registry (Houghton, Flannery, and Viola 1980). The data are analyzed by Ramsay and Silverman (1997).
data melanoma; input year incidences @@; datalines; 1936 0.9 1937 0.8 1938 0.8 1939 1.3 1940 1.4 1941 1.2 1942 1.7 1943 1.8 1944 1.6 1945 1.5 1946 1.5 1947 2.0 1948 2.5 1949 2.7 1950 2.9 1951 2.5 1952 3.1 1953 2.4 1954 2.2 1955 2.9 1956 2.5 1957 2.6 1958 3.2 1959 3.8 1960 4.2 1961 3.9 1962 3.7 1963 3.3 1964 3.7 1965 3.9 1966 4.1 1967 3.8 1968 4.7 1969 4.4 1970 4.8 1971 4.8 1972 4.8 ;
The variable incidences records the number of melanoma cases per people for the years 1936 to 1972. The following model fits the data and requests a Bayesian confidence interval along with the estimate:
proc tpspline data=melanoma; model incidences = (year) /alpha = 0.1; output out = result pred uclm lclm; run;
The output is displayed in Output 89.5.1
Summary of Input Data Set | |
---|---|
Number of Non-Missing Observations | 37 |
Number of Missing Observations | 0 |
Unique Smoothing Design Points | 37 |
The following statements produce a plot of the estimated curve:
ods graphics on; proc sgplot data=result; title "Age-adjusted Melanoma Incidence for 37 Years"; xaxis label="year"; yaxis label="Incidences"; band x=year lower=lclm_incidences upper=uclm_incidences/name="range" legendlabel="90% CI of Predicted incidences"; scatter x=year y=incidences/name="obs" legendlabel="incidences"; series x=year y=p_incidences/name="pred" legendlabel="predicted values of incidences" lineattrs=graphfit(thickness=1px); discretelegend "range" "obs" "pred"; run;
The estimated curve is displayed with confidence interval bands in Output 89.5.2. The number of melanoma incidences exhibits a periodic pattern and increases over the years. The periodic pattern is related to sunspot activity and the accompanying fluctuations in solar radiation.
Wang and Wahba (1995) compare several bootstrap confidence intervals to Bayesian confidence intervals for smoothing splines. Both bootstrap and Bayesian confidence intervals are across-the-curve intervals, not pointwise intervals. They concluded that bootstrap confidence intervals work as well as Bayesian intervals concerning average coverage probability. Additionally, bootstrap confidence intervals appear to be better for small sample sizes. Based on their simulation, the "percentile- interval" bootstrap interval performs better than the other types of bootstrap intervals.
Suppose that and are the estimates of and from the data. Assume that is the "true" , and generate the bootstrap sample as follows:
where . Denote as the random variable of the bootstrap estimate at . Repeat this process times, so that at each point , you have bootstrap estimates or realizations of . For each fixed , consider the following statistic , which is similar to the Student’s statistic:
where is the estimate of based on the th bootstrap sample.
Suppose and are the lower and upper points, respectively, of the empirical distribution of . The 100% bootstrap confidence interval is defined as
Bootstrap confidence intervals are easy to interpret and can be used with any distribution. However, because they require model fits, their construction is computationally intensive.
The multiple dependent variables feature in PROC TPSPLINE enables you to fit multiple models with the same independent variables. The procedure calculates the matrix decomposition part of the calculations only once, regardless of the number of dependent variables in the model. These calculations are responsible for most of the computing time used by the TPSPLINE procedure. This feature is particularly useful when you need to generate a bootstrap confidence interval.
To construct a bootstrap confidence interval, perform the following tasks:
Fit the data by using PROC TPSPLINE and obtain estimates and .
Generate bootstrap samples based on and .
Fit the bootstrap samples with the TPSPLINE procedure to obtain estimates of and .
Compute and the values and .
The following statements illustrate this process:
proc tpspline data=melanoma; model incidences = (year) /alpha = 0.05; output out = result pred uclm lclm; run;
The output from the initial PROC TPSPLINE analysis is displayed in Output 89.5.3. The data set result contains the predicted values and confidence limits from the analysis.
Summary of Input Data Set | |
---|---|
Number of Non-Missing Observations | 37 |
Number of Missing Observations | 0 |
Unique Smoothing Design Points | 37 |
Summary of Final Model | |
---|---|
Number of Regression Variables | 0 |
Number of Smoothing Variables | 1 |
Order of Derivative in the Penalty | 2 |
Dimension of Polynomial Space | 2 |
Summary Statistics of Final Estimation | |
---|---|
log10(n*Lambda) | -0.0607 |
Smoothing Penalty | 0.5171 |
Residual SS | 1.2243 |
Tr(I-A) | 22.5852 |
Model DF | 14.4148 |
Standard Deviation | 0.2328 |
The following statements illustrate how you can obtain a bootstrap confidence interval for the Melanoma data set. The following statements create the data set bootstrap. The observations are created with information from the preceding PROC TPSPLINE execution; as displayed in Output 89.5.3, . The values of are stored in the data set result in the variable P_incidence.
data bootstrap; set result; array y{1070} y1-y1070; do i=1 to 1070; y{i} = p_incidences + 0.232823*rannor(123456789); end; keep y1-y1070 p_incidences year; run; proc tpspline data=bootstrap; ods output FitStatistics=FitResult; id p_incidences; model y1-y1070 = (year); output out=result2; run;
The DATA step generates 1,070 bootstrap samples based on the previous estimate from PROC TPSPLINE. For this data set, some of the bootstrap samples result in s (selected by the GCV function) that cause problematic behavior. Thus, an additional 70 bootstrap samples are generated.
The ODS listing destination is closed before PROC TPSPLINE is invoked. The model fits all the y1...y1070 variables as dependent variables, and the models are fit for all bootstrap samples simultaneously. The output data set result2 contains the variables year, y1...y1070, p_y1...p_y1070, and p_incidences.
The ODS OUTPUT statement writes the FitStatistics table to the data set FitResult. The data set FitResult contains the two variables Parameter and Value. The FitResult data set is used in subsequent calculations for .
In the data set FitResult, there are 63 estimates with a standard deviation of zero, suggesting that the estimates provide perfect fits of the data and are caused by s that are approximately equal to zero. For small sample sizes, there is a positive probability that the chosen by the GCV function will be zero (Wang and Wahba 1995).
In the following steps, these cases are removed from the bootstrap samples as "bad" samples: they represent failure of the GCV function.
The following SAS statements manipulate the data set FitResult, retaining the standard deviations for all bootstrap samples and merging FitResult with the data set result2, which contains the estimates for bootstrap samples. In the final data set boot, the statistics are calculated.
data FitResult; set FitResult; if Parameter="Standard Deviation"; keep Value; run; proc transpose data=FitResult out=sd prefix=sd; data result2; if _N_ = 1 then set sd; set result2; data boot; set result2; array y{1070} p_y1-p_y1070; array sd{1070} sd1-sd1070; do i=1 to 1070; if sd{i} > 0 then do; d = (y{i} - P_incidences)/sd{i}; obs = _N_; output; end; end; keep d obs P_incidences year; run;
The following SAS statements retain the first 1,000 bootstrap samples and calculate the values and with .
proc sort data=boot; by obs; run; data boot; set boot; by obs; retain n; if first.obs then n=1; else n=n+1; if n > 1000 then delete; run; proc sort data=boot; by obs d; run; data chi1 chi2 ; set boot; if (_N_ = (obs-1)*1000+50) then output chi1; if (_N_ = (obs-1)*1000+950) then output chi2; run; proc sort data=result; by year; run; proc sort data=chi1; by year; run; proc sort data=chi2; by year; run; data result; merge result chi1(rename=(d=chi05)) chi2(rename=(d=chi95)); keep year incidences P_incidences lower upper LCLM_incidences UCLM_incidences; lower = -chi95*0.232823 + P_incidences; upper = -chi05*0.232823 + P_incidences; label lower="Lower 90% CL (Bootstrap)" upper="Upper 90% CL (Bootstrap)" lclm_incidences="Lower 90% CL (Bayesian)" uclm_incidences="Upper 90% CL (Bayesian)"; run;
The data set result contains the variables year and incidences, the PROC TPSPLINE estimate P_incidences, and the Bayesian and bootstrap confidence intervals.
The following statements produce Output 89.5.4:
proc sgplot data=result; title "Age-adjusted Melanoma Incidence for 37 Years"; xaxis label="year"; yaxis label="Incidences"; band x=year lower=lclm_incidences upper=uclm_incidences/name="bayesian" legendlabel="90% Bayesian CI of Predicted incidences" fillattrs=(color=red); band x=year lower=lower upper=upper/name="bootstrap" legendlabel="90% Bootstrap CI of Predicted incidences" transparency=0.05; scatter x=year y=incidences/name="obs" legendlabel="incidences"; series x=year y=p_incidences/name="pred" legendlabel="predicted values of incidences" lineattrs=graphfit(thickness=1px); discretelegend "bayesian" "bootstrap" "obs" "pred"; run; ods graphics off;
Output 89.5.4 displays the plot of the variable incidences, the predicted values, and the Bayesian and bootstrap confidence intervals.
The plot shows that the bootstrap confidence interval is similar to the Bayesian confidence interval. However, the Bayesian confidence interval is symmetric around the estimates, while the bootstrap confidence interval is not.
Copyright © 2009 by SAS Institute Inc., Cary, NC, USA. All rights reserved.