What's New Table of Contents |

This release brings several new procedures to SAS/STAT software. The MI and MIANALYZE procedures implement the multiple imputation strategy for missing data. Experimental in Releases 8.1 and 8.2, these procedures are now production. The ROBUSTREG procedure analyzes data that may include outliers; it provides stable results in their presence. The TPHREG procedure is a test release of the PHREG procedure that incorporates the CLASS statement.

Power and sample size computations also become available in SAS 9.1. New procedures POWER and GLMPOWER provide these computations for a number of analyses, and the Power and Sample Size Application surfaces them through a point-and-click interface.

SAS 9.1 introduces two new procedures for the analysis of survey data. The SURVEYFREQ procedure produces one-way to *n*-way frequency and crosstabulation tables for data collected from surveys. These tables include estimates of totals and proportions (overall, row percentages, column percentages) and the corresponding standard errors. The SURVEYLOGISTIC procedure performs logistic regression for survey data, and it can also fit links such as the cumulative logit, generalized logit, probit, and complementary log-log functions. Both of these procedures incorporate complex survey sample designs, including designs with stratification, clustering, and unequal weighting, in their computations.

In addition, this release includes numerous enhancements to existing procedures. For example, conditional logistic regression is available in the LOGISTIC procedure through the new STRATA statement, and scoring of data sets is available through the new SCORE statement. The GLM procedure now provides the ability to form classification groups using the full formatted length of the CLASS variable levels. In addition, the SURVIVAL statement in the LIFETEST procedure enables you to create confidence bands (also known as simultaneous confidence intervals) for the survivor function and to specify a transformation for computing the confidence bands and the pointwise confidence intervals.

Several new procedures have been made avaiable for the Windows release via web download. See below for more information.

More information about the changes and enhancements to SAS/STAT software follows. Features new in SAS 9.1 are indicated with a 9.1 icon; other features were available with SAS 9.0. Details can be found in the documentation for the individual procedures.

Selected functionalities in the GLM, LOESS, REG, and ROBUSTREG procedures have been multithreaded to exploit hardware with multiple CPUs. Refer to Cohen (2002) for more details.

A number of SAS/STAT procedures are using an experimental extension to the Output Delivery System (ODS) that enables them to create statistical graphics automatically. The facility is invoked when you include an ODS GRAPHICS statement before your procedure statements. Graphics are then created automatically, or when you specify procedure options for graphics. Procedures taking advantage of ODS graphics are the ANOVA, CORRESP, GAM, GENMOD, GLM, KDE, LIFETEST, LOESS, LOGISTIC, MI, MIXED, PHREG, PLS, PRINCOMP, PRINQUAL, REG, ROBUSTREG, and TPSLINE procedures. The plots produced and the corresponding options are described in the documentation for the individual procedures.

Several procedures are now available via Web download for the Windows platform. They only work with SAS 9.1. PROC GLIMMIX fits generalized linear mixed models. The experimental QUANTREG procedure performs quantile regression. The experimental GLMSELECT procedure performs effect selection in the framework of general linear models. For more information about these procedures and a link to the download site, see www.sas.com/statistics/

Memory handling has been improved in the CATMOD procedure. The PARAM=REFERENCE option has been added to the MODEL statement and produces reference cell parameterization. Other new options include the ITPRINT, DESIGN, and PROFILE|POPPROFILE options in the PROC statement.

The new DISTANCE procedure computes various measures of distance, dissimilarity, or similarity between the observations (rows) of a SAS data set. These proximity measures are stored as a lower triangular matrix or a square matrix in an output data set (depending on the SHAPE= option) that can then be used as input to the CLUSTER, MDS, and MODECLUS procedures. The input data set may contain numeric or character variables, or both, depending on which proximity measure is used. PROC DISTANCE also provides various nonparametric and parametric methods for standardizing variables. Distance matrices are used frequently in data mining, genomics, marketing, financial analysis, management science, education, chemistry, psychology, biology, and various other fields.

The NOPROMAXNORM option turns off the default row normalization of the pre-rotated factor pattern, which is used in computing the promax target matrix.

You can now produce standard errors and confidence limits with the METHOD=ML option for the PROMAX factor solutions. You can obtain the standard errors with the SE option, control the coverage displays with the COVER= option, and set the coverage level with the ALPHA= option.

The BDT option includes Tarone's adjustment in the Breslow-Day test for homogeneity of odds ratios. Refer to Agresti (1996) and Tarone (1985).

The ZEROS option in the WEIGHT statement includes zero-weight observations in the analysis. (By default, PROC FREQ does not process zero-weight observations.) With the ZEROS option, PROC FREQ displays zero-weight levels in crosstabulation and frequency tables. For one-way tables, the ZEROS option includes zero-weight levels in chi-square tests and binomial statistics. For multiway tables, the ZEROS option includes zero-weight levels in kappa statistics.

The CROSSLIST option displays crosstabulation tables in ODS column format. Unlike the default crosstabulation table, the CROSSLIST table has a table definition that you can customize with PROC TEMPLATE. The NLEVELS option provides a table with the number of levels for all TABLES statement variables.

The FREQ procedure now produces exact confidence limits for the common odds ratio and related tests.

The GENMOD procedure now forms classification groups using the full formatted length of the CLASS variable levels. Several new full-rank CLASS variable parameterizations are now available: polynomial, orthogonal polynomial, effect, orthogonal effect, reference, orthogonal reference, ordinal, and orthogonal ordinal. The default parameterization remains the same less-than-full-rank parameterization used in previous releases.

Zero is now a valid value for the negative binomial dispersion parameter corresponding to the Poisson distribution. If a fixed value of zero is specified, a score test for overdispersion (Cameron and Trivedi 1998) is computed.

As an experimental feature, PROC GENMOD now provides model assessment based on aggregates of residuals.

The GLM procedure now forms classification groups using the full formatted length of the CLASS variable levels.

In addition, you can compute exact *p*-values for three of the four multivariate tests
(Wilks' Lambda, the Hotelling-Lawley Trace, and Roy's Greatest Root) and an improved -approximation
for the fourth (Pillai's Trace).
The default MSTAT=FAPPROX in the MANOVA and REPEATED statements produces multivariate tests using
approximations based on the distribution. Specifying MSTAT=EXACT computes exact *p*-values
for three of the
four tests (Wilks' Lambda, the Hotelling-Lawley Trace, and Roy's
Greatest Root) and an improved -approximation for the fourth
(Pillai's Trace).

The GLMPOWER procedure performs prospective analyses for linear models, with a variety of goals:

- determining the sample size required to obtain a significant result with adequate probability (power)
- characterizing the power of a study to detect a meaningful effect
- conducting what-if analyses to assess sensitivity of the power or required sample size to other factors

You specify the design and the cell means using an exemplary data set, a data set of artificial values constructed to represent the intended sampling design and the surmised response means in the underlying population. You specify the model and contrasts using MODEL and CONTRAST statements similar to those in the GLM procedure. You specify the remaining parameters with the POWER statement, which is similar to analysis statements in the new POWER procedure.

The new UNIVAR and BIVAR statements provide improved syntax. The BIVAR statement lists variables in the input data set for which bivariate kernel density estimates are to be computed. The UNIVAR statement lists variables in the input data set for which univariate kernel density estimates are to be computed.

The new SURVIVAL statement enables you to create confidence bands (also known as simultaneous confidence intervals) for the survivor function and to specify a transformation for computing the confidence bands and the pointwise confidence intervals. It contains the following options.

- The OUT= option names the output SAS data set that contains survival estimates as in the OUTSURV= option in the PROC LIFETEST statement.
- The CONFTYPE= option specifies the transformation applied to to obtain the pointwise confidence intervals and the confidence bands. Four transforms are available: the arcsine-square root transform, the complementary log-log transform, the logarithmic transform, and the logit transform.
- The CONFBAND= option specifies the confidence bands to add to the OUT= data set. You can choose the equal precision confidence bands (Nair 1984), or the Hall-Wellner bands (Hall and Wellner 1980), or both.
- The BANDMAX= option specifies the maximum time for the confidence bands.
- The BANDMIN= option specifies the minimum time for the confidence bands.
- The STDERR option adds the column of standard error of the estimated survivor function to the OUT= data set.
- The ALPHA= option sets the confidence level for pointwise confidence intervals as well as the confidence bands.

The LIFETEST procedure now provides additional tests for comparing two or more samples of survival data, including the Tarone-Ware test, Peto-Peto test, modified Peto-Peto test, and the Fleming-Harrington family of tests. Trend tests for ordered alternatives can be requested. Also available are stratified tests for comparing survival function while adjusting for prognostic factors that affect the event rates.

The LOESS procedure now performs DF computations using a sparse method when appropriate. In addition, the DFMETHOD=APPROX option is available.

The new SCORE statement enables you to score new data sets and compute fit statistics and ROC curves without refitting the model. Information for a fitted model can be saved to a SAS data set with the OUTMODEL= option, while the INMODEL= option inputs the model information required for the scoring.

The new STRATA statement enables you to perform conditional logistic regression on highly stratified data using the method of Gail, Lubin, and Rubenstein (1981). The OFFSET option is now enabled for logistic regression.

The LOGISTIC procedure now forms classification groups using the full formatted length of the CLASS variable levels.

Several new CLASS parameterizations are available: ordinal, orthogonal effect, orthogonal reference, and orthogonal ordinal.

You can now output the design matrix using the new OUTDESIGN= option.

The definition of concordance has been changed to make it more meaningful for ordinal models. The new definition is consistent with that used in previous releases for the binary response model.

Enhancements for the exact computations include:

- improved performance
- Monte Carlo method
- mid-
*p*confidence intervals

For an exact conditional analysis, specifying the STRATA statement performs an efficient stratified analysis. The method of Mehta, Patel, and Senchaudhuri (1992), which is more efficient than the Hirji, Tsiatis, and Mehta (1989) algorithm for many problems, is now available with the METHOD=NETWORK option.

The INITIAL= option in the EM statement sets the initial estimates for the EM algorithm. Either the means and covariances from complete cases or the means and standard deviations from available cases can be used as the initial estimates for the EM algorithm. You can also specify the correlations for the initial estimates from available cases.

For data sets with monotone missingness, the REGPMM option in the MONOTONE statement uses the predictive mean matching method to impute a value randomly from a set of observed values whose predicted values are closest to the predicted value for the missing value from the simulated regression model.

You can specify more than one method in the MONOTONE statement, and for each imputed variable, the covariates can be specified separately.

The DETAILS option in the MONOTONE statement requests the display of the model parameters used for each imputation.

The experimental CLASS statement is now available to specify categorical variables. These classification variables are used either as covariates for imputed variables or as imputed variables for data sets with monotone missing patterns.

The experimental options LOGISTIC and DISCRIM in the MONOTONE statement impute missing categorical variables by logistic and discriminant methods, respectively.

You can now specify the PARMS= data set without specifying either the COVB= or XPXI= option when the data set contains the standard errors for the parameter estimates.

The DATA= option includes data sets that contain both parameter estimates and their associated standard errors in each observation of the data set.

The BCOV, WCOV, and TCOV options control the display of the between-imputation, within-imputation, and total covariance matrices.

A TEST statement tests linear hypotheses about the parameters, . For each TEST statement, the procedure combines the estimate and associated standard error for each linear component ( a row of ). It can also combine the estimates and associated covariance matrix for all linear components.

The MODELEFFECTS statement lists the effects in the data set to be analyzed. Each effect is a variable or a combination of variables, and is specified with a special notation using variable names and operators. The STDERR statement lists the standard errors associated with the effects in the MODELEFFECTS statement when both parameter estimates and standard errors are saved as variables in the same DATA= data set.

The experimental CLASS statement specifies categorical variables. PROC MIANALYZE reads and combines parameter estimates and covariance matrices for parameters with CLASS variables.

The MIXED procedure now supports geometrically anisotropic covariance structures and covariance models in the Matern class. The LCOMPONENTS option in the MODEL statement produces one degree of freedom tests for fixed effects that correspond to individual estimable functions for Type I, II, and III effects.

The experimental RESIDUAL option of the MODEL statement computes Pearson-type and (internally) studentized residuals. The experimental INFLUENCE option in the MODEL statement computes influence diagnostics by noniterative or iterative methods. Experimental ODS graphics display the results for both of these options.

The new D option provides the one-sided and statistics for the asymptotic two-sample Kolmogorov-Smirnov test, in addition to the two-sided statistic given by the EDF option. The KS option in the EXACT statement gives exact tests for the Kolmogorov-Smirnov , , and for two-sample problems.

The new WEIGHT statement enables you to specify case weights when you are using the BRESLOW or EFRON method for handling ties. Robust sandwich variance estimators of Binder (1992) are computed for the estimated regression parameters. You can specify the option NORMALIZE to normalize the weights so that they add up the actual sample size.

Two options have been added to the TEST statement: AVERAGE and E. The AVERAGE option enables you to compute a combined estimate of all the effects in the given TEST statement. This option gives you an easy way to carry out inferences of the common value of (say) the treatment effects had they been assumed equal. The E option specifies that the linear coefficients and constants be printed. When the AVERAGE option is specified along with the E option, the optimal weights of the average effect are also printed in the same tables as the coefficients.

The recurrence algorithm of Gail, Lubin, and Rubinstein (1981) for computing the exact discrete partial likelihood and its partial derivatives has been modified to use the logarithmic scale. This enables a much larger number of ties to be handled without the numeric problems of overflowing and underflowing.

You can use the PHREG procedure to fit the rate/mean model for the recurrent events data and obtain prediction of the cumulative mean function for a given pattern of fixed covariates.

As an experimental feature, the PHREG procedure now can produce model assessments based on cumulative residuals.

The POWER procedure performs prospective analyses for a variety of goals such as the following:

- determining the sample size required to get a significant result with adequate probability (power)
- characterizing the power of a study to detect a meaningful effect
- conducting what-if analyses to assess sensitivity of the power or required sample size to other factors

This procedure covers a variety of statistical analyses such as *t* tests, equivalence tests,
and confidence intervals for means; exact binomial, chi-square, Fisher's exact, and McNemar
tests for proportions; multiple regression and correlation; one-way analysis of variance;
and rank tests for comparing survival curves.

The POWER procedure is one of several tools available in SAS/STAT software for power and sample size analysis. PROC GLMPOWER covers more complex linear models, and the Power and Sample Size Application provides a user interface and implements many of the analyses supported in the procedures.

The Power and Sample Size Application (PSS) is an interface that provides power and sample size
computations. The application includes tasks for determining sample size and power for a variety of
statistical analyses, including *t*-tests, ANOVA, proportions, equivalence testing,
linear models, survival analysis, and table statistics. The application provides
multiple input parameter options, stores results in a project format, displays power curves,
and produces appropriate narratives for the results. Note that this application is included with
SAS/STAT software but needs to be installed from the Mid Tier CD.

The ROBUSTREG procedure provides resistant (stable) results in the presence of outliers by limiting the influence of outliers. In statistical applications of outlier detection and robust regression, the methods most commonly used today are Huber (1973) M estimation, high breakdown value estimation, and combinations of these two methods. The ROBUSTREG procedure provides four such methods: M estimation, LTS estimation, S estimation, and MM estimation. With these four methods, the ROBUSTREG procedure acts as an integrated tool for outlier detection and robust regression with various contaminated data. The ROBUSTREG procedure is scalable such that it can be used for applications in data cleansing and data mining.

The SURVEYFREQ procedure produces one-way to -way frequency and crosstabulation tables for survey data. These tables include estimates of totals and proportions (overall, row percentages, column percentages) and the corresponding standard errors. Like the other survey procedures, PROC SURVEYFREQ computes these variance estimates based on the sample design used to obtain the survey data. The design can be a complex sample survey design with stratification, clustering, and unequal weighting. PROC SURVEYFREQ also provides design-based tests of association between variables.

The SURVEYLOGISTIC procedure performs logistic regression on data that arise from a survey sampling scheme. PROC SURVEYLOGISTIC incorporates complex survey sample designs, including designs with stratification, clustering, and unequal weighting, in its estimation process. Variances of the regression parameters and odds ratios are computed using a Taylor expansion approximation. The SURVEYLOGISTIC procedure is similar in syntax to the LOGISTIC procedure, and it can fit link functions such as the logit, cumulative logit, generalized logit, probit, and complementary log-log functions. Maximum likelihood estimation of the regression coefficients is carried out with either the Fisher-scoring algorithm or the Newton-Raphson algorithm.

The STACKING option requests the procedure to produce the output data sets using a stacking table structure, which was the default in earlier releases. The new default is to produce a rectangular table structure in the output data sets. The STACKING option affects the Domain, Ratio, Statistics, and StrataInfo tables.

One-sided confidence limits are now available for descriptive statistics.

The SURVEYREG procedure now provides the ability to form classification groups using the full formatted length of the CLASS variable levels, instead of just the first 16 characters of the levels. The ANOVA option in the MODEL statement requests that the ANOVA table be included in the output.

The OUTALL option produces an output data set that includes all observations from the DATA= input data set, both those observations selected for the sample and those observations not selected. With the OUTALL option, the OUT= data set contains a variable Selected that indicates whether or not the observation was selected. The OUTALL option is available for equal probability selection methods (METHOD=SRS, URS, SYS, and SEQ).

The SELECTALL option includes all stratum observations in the sample when the stratum sample size exceeds the number of observations in the stratum. The SELECTALL option is available for without-replacement selection methods (METHOD=SRS, SYS, SEQ, PPS, and PPS_SAMPFORD). It is not available for with-replacement or with-minimum-replacement methods, or for those PPS methods that select two units per stratum.

The OUTSEED option includes the initial seed for each stratum in the output data set. Additionally, you can input initial seeds by strata with the SEED=SAS-data-set option.

The experimental TPHREG procedure adds the CLASS statement to the PHREG procedure. The CLASS statement enables you to specify categorical variables (also known as CLASS variables) as explanatory variables. Explanatory effects for the model, including covariates, main effects, interactions, and nested effects, can be specified in the same way as in the GLM procedure. The CLASS statement supports less-than-full-rank parameterization as well as various full-rank parameterizations such as reference coding and effect coding. Other CLASS statement features that are found in PROC LOGISTIC, such as specifying specific categories as reference levels, are also available.

The TPHREG procedure also enables you to specify CONTRAST statements for testing customized hypotheses concerning the regression parameters. Each CONTRAST statement also provides estimation of individual rows of contrasts, which is particularly useful in comparing the hazards between the categories of a CLASS explanatory variable.

The COEF option in the OUTPUT statement enables you to output coefficients of the fitted function.

The TRANSREG procedure has new transformation options for centering and standardizing variables, CENTER and Z, before the transformations. The new EXKNOTS= option specifies exterior knots for SPLINE and MSPLINE transformations and BSPLINE expansions.

The new algorithm option INDIVIDUAL with METHOD=MORALS fits each model for each dependent variable individually and independently of the other dependent variables.

With hypothesis tests, the TRANSREG procedure now produces a table with the number of observations, and, when there are CLASS variables, a class level information table.

Agresti, A. (1996), *An Introduction to Categorical Data Analysis*,
New York: John Wiley & Sons, Inc.

Binder, D.A. (1983), ''On the Variances of
Asymptotically Normal Estimators from Complex
Surveys,'' *International Statistical Review*,
51, 279 - 292.

Binder, D.A. (1992), ''Fitting Cox's Proportional Hazards Models
from Survey Data,'' *Biometrika,* 79, 139 - 47.

Cameron, A.C. and Trivedi, P.K. (1998), "Regression Analysis of Count Data," Cambridge: Cambridge University Press.

Cohen, R. (2002), ''SAS Meets Big Iron:
High Performance Computing in SAS Analytical Procedures,''
*Proceedings of the Twenty-seventh Annual SAS Users Group
International Conference*.

Gail, M.H., Lubin, J.H., and Rubinstein, L.V. (1981), ''Likelihood
Calculations for Matched Case-Control Studies and Survival Studies with
Tied Survival Times,'' *Biometrika*, 78, 703 - 7.

Hall, W.J. and Wellner, J.A. (1980), ''Confidence Bands for a Survival Curve
for Censored Data,'' *Biometrika 69,* 133 - 143.

Hirji, K.F., Mehta, C.R., and Patel, N.R. (1987), "Computing Distributions for Exact Logistic Regression," *Journal of the American Statistical Association*, 82, 1110 - 1117.

Hirji, K.F., Tsiatis, A.A., and Mehta, C.R. (1989),
"Median Unbiased Estimation for Binary Data,"
*American Statistician*, 43, 7 - 11.

Huber, P.J. (1973), ''Robust Regression: Asymptotics, Conjectures and
Monte Carlo,'' *Annals of Statistics*, 1, 799-821.

Mehta, C.R., Patel, N., and Senchaudhuri, P. (1992), "Exact Stratified Linear Rank Tests for Ordered Categorical and Binary Data," *Journal of Computational and Graphical Statistics*, 1, 21 - 40.

Nair, V.N. (1984), ''Confidence Bands for Survival Functions with
Censored Data:
A Comparative Study,'' *Technometrics,* 14, 265 - 275.

Tarone, R. (1985), ''On Heterogeneity Tests Based on Efficient
Scores,'' *Biometrika*, 72, 91 - 95.