Statistics / Biostatistics Papers A-Z

A
Paper 1603-2015:
A Model Family for Hierarchical Data with Combined Normal and Conjugate Random Effects
Non-Gaussian outcomes are often modeled using members of the so-called exponential family. Notorious members are the Bernoulli model for binary data, leading to logistic regression, and the Poisson model for count data, leading to Poisson regression. Two of the main reasons for extending this family are (1) the occurrence of overdispersion, meaning that the variability in the data is not adequately described by the models, which often exhibit a prescribed mean-variance link, and (2) the accommodation of hierarchical structure in the data, stemming from clustering in the data which, in turn, might result from repeatedly measuring the outcome, for various members of the same family, and so on. The first issue is dealt with through a variety of overdispersion models such as the beta-binomial model for grouped binary data and the negative-binomial model for counts. Clustering is often accommodated through the inclusion of random subject-specific effects. Though not always, one conventionally assumes such random effects to be normally distributed. While both of these phenomena might occur simultaneously, models combining them are uncommon. This paper proposes a broad class of generalized linear models accommodating overdispersion and clustering through two separate sets of random effects. We place particular emphasis on so-called conjugate random effects at the level of the mean for the first aspect and normal random effects embedded within the linear predictor for the second aspect, even though our family is more general. The binary, count, and time-to-event cases are given particular emphasis. Apart from model formulation, we present an overview of estimation methods, and then settle for maximum likelihood estimation with analytic-numerical integration. Implications for the derivation of marginal correlations functions are discussed. The methodology is applied to data from a study of epileptic seizures, a clinical trial for a toenail infection named onychomycosis, and survival data in children with asthma.
Read the paper (PDF). | Watch the recording.
Geert Molenberghs, Universiteit Hasselt & KU Leuven
Paper 2141-2015:
A SAS® Macro to Compare Predictive Values of Diagnostic Tests
Medical tests are used for various purposes including diagnosis, prognosis, risk assessment and screening. Statistical methodology is used often to evaluate such types of tests, most frequent measures used for binary data being sensitivity, specificity, positive and negative predictive values. An important goal in diagnostic medicine research is to estimate and compare the accuracies of such tests. In this paper I give a gentle introduction to measures of diagnostic test accuracy and introduce a SAS® macro to calculate generalized score statistic and weighted generalized score statistic for comparison of predictive values using formula's generalized and proposed by Andrzej S. Kosinski.
Read the paper (PDF).
Lovedeep Gondara, University of Illinois Springfield
Paper 2980-2015:
A Set of SAS® Macros for Generating Survival Analysis Reports for Lifetime Data with or without Competing Risks
The paper introduces users to how they can use a set of SAS® macros, %LIFETEST and %LIFETESTEXPORT, to generate survival analysis reports for data with or without competing risks. The macros provide a wrapper of PROC LIFETEST and an enhanced version of the SAS autocall macro %CIF to give users an easy-to-use interface to report both survival estimates and cumulative incidence estimates in a unified way. The macros also provide a number of parameters to enable users to flexibly adjust how the final reports should look without the need to manually input or format the final reports.
Read the paper (PDF). | Download the data file (ZIP).
Zhen-Huan Hu, Medical College of Wisconsin
Paper 3311-2015:
Adaptive Randomization Using PROC MCMC
Based on work by Thall et al. (2012), we implement a method for randomizing patients in a Phase II trial. We accumulate evidence that identifies which dose(s) of a cancer treatment provide the most desirable profile, per a matrix of efficacy and toxicity combinations rated by expert oncologists (0-100). Experts also define the region of Good utility scores and criteria of dose inclusion based on toxicity and efficacy performance. Each patient is rated for efficacy and toxicity at a specified time point. Simulation work is done mainly using PROC MCMC in which priors and likelihood function for joint outcomes of efficacy and toxicity are defined to generate posteriors. Resulting joint probabilities for doses that meet the inclusion criteria are used to calculate the mean utility and probability of having Good utility scores. Adaptive randomization probabilities are proportional to the probabilities of having Good utility scores. A final decision of the optimal dose will be made at the end of the Phase II trial.
Read the paper (PDF).
Qianyi Huang, McDougall Scientific Ltd.
John Amrhein, McDougall Scientific Ltd.
Paper SAS1919-2015:
Advanced Techniques for Fitting Mixed Models Using SAS/STAT® Software
Fitting mixed models to complicated data, such as data that include multiple sources of variation, can be a daunting task. SAS/STAT® software offers several procedures and approaches for fitting mixed models. This paper provides guidance on how to overcome obstacles that commonly occur when you fit mixed models using the MIXED and GLIMMIX procedures. Examples are used to showcase procedure options and programming techniques that can help you overcome difficult data and modeling situations.
Read the paper (PDF).
Kathleen Kiernan, SAS
Paper 3412-2015:
Alternative Methods of Regression When Ordinary Least Squares Regression Is Not Right
Ordinary least squares regression is one of the most widely used statistical methods. However, it is a parametric model and relies on assumptions that are often not met. Alternative methods of regression for continuous dependent variables relax these assumptions in various ways. This paper explores procedures such as QUANTREG, ADAPTIVEREG, and TRANSREG for these kinds of data.
Read the paper (PDF). | Watch the recording.
Peter Flom, Peter Flom Consulting
Paper 3300-2015:
An Empirical Comparison of Multiple Imputation Approaches for Treating Missing Data in Observational Studies
Missing data are a common and significant problem that researchers and data analysts encounter in applied research. Because most statistical procedures require complete data, missing data can substantially affect the analysis and the interpretation of results if left untreated. Methods to treat missing data have been developed so that missing values are imputed and analyses can be conducted using standard statistical procedures. Among these missing data methods, multiple imputation has received considerable attention and its effectiveness has been explored (for example, in the context of survey and longitudinal research). This paper compares four multiple imputation approaches for treating missing continuous covariate data under MCAR, MAR, and NMAR assumptions, in the context of propensity score analysis and observational studies. The comparison of the four MI approaches in terms of bias in parameter estimates, Type I error rates, and statistical power is presented. In addition, complete case analysis (listwise deletion) is presented as the default analysis that would be conducted if missing data are not treated. Issues are discussed, and conclusions and recommendations are provided.
Read the paper (PDF).
Patricia Rodriguez de Gil, University of South Florida
Shetay Ashford, University of South Florida
Chunhua Cao, University of South Florida
Eun-Sook Kim, University of South Florida
Rheta Lanehart, University of South Florida
Reginald Lee, University of South Florida
Jessica Montgomery, University of South Florida
Yan Wang, University of South Florida
Paper SAS2082-2015:
Analyzing Messy and Wide Data on a Desktop
Data comes from a rich variety of sources in a rich variety of types, shapes, sizes, and properties. The analysis can be challenged by data that is too tall or too wide; too full of miscodings, outliers, or holes; or that contains funny data types. Wide data, in particular, has many challenges, requiring the analysis to adapt with different methods. Making covariance matrices with 2.5 billion elements is just not practical. JMP® 12 will address these challenges.
Read the paper (PDF).
John Sall, SAS
Paper SAS1332-2015:
Analyzing Spatial Point Patterns Using the New SPP Procedure
In many spatial analysis applications (including crime analysis, epidemiology, ecology, and forestry), spatial point process modeling can help you study the interaction between different events and help you model the process intensity (the rate of event occurrence per unit area). For example, crime analysts might want to estimate where crimes are likely to occur in a city and whether they are associated with locations of public features such as bars and bus stops. Forestry researchers might want to estimate where trees grow best and test for association with covariates such as elevation and gradient. This paper describes the SPP procedure, new in SAS/STAT® 13.2, for exploring and modeling spatial point pattern data. It describes methods that PROC SPP implements for exploratory analysis of spatial point patterns and for log-linear intensity modeling that uses covariates. It also shows you how to use specialized functions for studying interactions between points and how to use specialized analytical graphics to diagnose log-linear models of spatial intensity. Crime analysis, forestry, and ecology examples demonstrate key features of PROC SPP.
Read the paper (PDF).
Pradeep Mohan, SAS
Randy Tobias, SAS
Paper 3050-2015:
Automation of Statistics Summary for Census Data in SAS®
Census data, such as education and income, has been extensively used for various purposes. The data is usually collected in percentages of census unit levels, based on the population sample. Such presentation of the data makes it hard to interpret and compare. A more convenient way of presenting the data is to use the geocoded percentage to produce counts for a pseudo-population. We developed a very flexible SAS® macro to automatically generate the descriptive summary tables for the census data as well as to conduct statistical tests to compare the different levels of the variable by groups. The SAS macro is not only useful for census data but can be used to generate summary tables for any data with percentages in multiple categories.
Read the paper (PDF).
Janet Lee, Kaiser Permanente Southern California
C
Paper 2080-2015:
Calculate Decision Consistency Statistics for a Single Administration of a Test
Many certification programs classify candidates into performance levels. For example, the SAS® Certified Base Programmer breaks down candidates into two performance levels: Pass and Fail. It is important to note that because all test scores contain measurement error, the performance level categorizations based on those test scores are also subject to measurement error. An important part of psychometric analysis is to estimate the decision consistency of the classifications. This study helps fill a gap in estimating decision consistency statistics for a single administration of a test using SAS.
Read the paper (PDF).
Fan Yang, The University of Iowa
Yi Song, University of Illinois at Chicago
Paper 3148-2015:
Catering to Your Tastes: Using PROC OPTEX to Design Custom Experiments, with Applications in Food Science and Field Trials
The success of an experimental study almost always hinges on how you design it. Does it provide estimates for everything you're interested in? Does it take all the experimental constraints into account? Does it make efficient use of limited resources? The OPTEX procedure in SAS/QC® software enables you to focus on specifying your interests and constraints, and it takes responsibility for handling them efficiently. With PROC OPTEX, you skip the step of rifling through tables of standard designs to try to find the one that's right for you. You concentrate on the science and the analytics and let SAS® do the computing. This paper reviews the features of PROC OPTEX and shows them in action using examples from field trials and food science experimentation. PROC OPTEX is a useful tool for all these situations, doing the designing and freeing the scientist to think about the food and the biology.
Read the paper (PDF). | Download the data file (ZIP).
Cliff Pereira, Dept of Statistics, Oregon State University
Randy Tobias, SAS
Paper 2380-2015:
Chi-Square and T Tests Using SAS®: Performance and Interpretation
Data analysis begins with cleaning up data, calculating descriptive statistics, and examining variable distributions. Before more rigorous statistical analysis begins, many statisticians perform basic inferential statistical tests such as chi-square and t tests to assess unadjusted associations. These tests help guide the direction of the more rigorous statistical analysis. How to perform chi-square and t tests is presented. We explain how to interpret the output and where to look for the association or difference based on the hypothesis being tested. We propose the next steps for further analysis using example data.
Read the paper (PDF).
Maribeth Johnson, Georgia Regents University
Jennifer Waller, Georgia Regents University
Paper 3291-2015:
Coding Your Own MCMC Algorithm
In Bayesian statistics, Markov chain Monte Carlo (MCMC) algorithms are an essential tool for sampling from probability distributions. PROC MCMC is useful for these algorithms. However, it is often desirable to code an algorithm from scratch. This is especially present in academia where students are expected to be able to understand and code an MCMC. The ability of SAS® to accomplish this is relatively unknown yet quite straightforward. We use SAS/IML® to demonstrate methods for coding an MCMC algorithm with examples of a Gibbs sampler and Metropolis-Hastings random walk.
Read the paper (PDF).
Chelsea Lofland, University of California Santa Cruz
Paper 1424-2015:
Competing Risk Survival Analysis Using SAS®: When, Why, and How
Competing risk arise in time to event data when the event of interest cannot be observed because of a preceding event i.e. a competing event occurring before. An example can be of an event of interest being a specific cause of death where death from any other cause can be termed as a competing event, if focusing on relapse, death before relapse would constitute a competing event. It is well studied and pointed out that in presence of competing risks, the standard product limit methods yield biased results due to violation of their basic assumption. The effect of competing events on parameter estimation depends on their distribution and frequency. Fine and Gray's sub-distribution hazard model can be used in presence of competing events which is available in PROC PHREG with the release of version 9.4 of SAS® software.
Read the paper (PDF).
Lovedeep Gondara, University of Illinois Springfield
Paper 3249-2015:
Cutpoint Determination Methods in Survival Analysis Using SAS®: Updated %FINDCUT Macro
Statistical analysis that uses data from clinical or epidemiological studies include continuous variables such as patient's age, blood pressure, and various biomarkers. Over the years, there has been an increase in studies that focus on assessing associations between biomarkers and disease of interest. Many of the biomarkers are measured as continuous variables. Investigators seek to identify the possible cutpoint to classify patients as high risk versus low risk based on the value of the biomarker. Several data-oriented techniques such as median and upper quartile, and outcome-oriented techniques based on score, Wald, and likelihood ratio tests are commonly used in the literature. Contal and O'Quigley (1999) presented a technique that used log rank test statistic in order to estimate the cutpoint. Their method was computationally intensive and hence was overlooked due to the unavailability of built-in options in standard statistical software. In 2003, we provided the %FINDCUT macro that used Contal and O'Quigley's approach to identify a cutpoint when the outcome of interest was measured as time to event. Over the past decade, demand for this macro has continued to grow, which has led us to consider updating the %FINDCUT macro to incorporate new tools and procedures from SAS® such as array processing, Graph Template Language, and the REPORT procedure. New and updated features include: results presented in a much cleaner report format, user-specified cutpoints, macro parameter error checking, temporary data set cleanup, preserving current option settings, and increased processing speed. We present the utility and added options of the revised %FINDCUT macro using a real-life data set. In addition, we critically compare this method to some of the existing methods and discuss the use and misuse of categorizing a continuous covariate.
Read the paper (PDF).
Jay Mandrekar, Mayo Clinic
Jeffrey Meyers, Mayo Clinic
D
Paper 2442-2015:
Don't Copy and Paste--Use BY Statement Processing with ODS to Make Your Summary Tables
Most manuscripts in medical journals contain summary tables that combine simple summaries and between-group comparisons. These tables typically combine estimates for categorical and continuous variables. The statistician generally summarizes the data using the FREQ procedure for categorical variables and compares percentages between groups using a chi-square or a Fisher's exact test. For continuous variables, the MEANS procedure is used to summarize data as either means and standard deviation or medians and quartiles. Then these statistics are generally compared between groups by using the GLM procedure or NPAR1WAY procedure, depending on whether one is interested in a parametric test or a non-parametric test. The outputs from these different procedures are then combined and presented in a concise format ready for publications. Currently there is no straightforward way in SAS® to build these tables in a presentable format that can then be customized to individual tastes. In this paper, we focus on presenting summary statistics and results from comparing categorical variables between two or more independent groups. The macro takes the dataset, the number of treatment groups, and the type of test (either chi-square or Fisher's exact) as input and presents the results in a publication-ready table. This macro automates summarizing data to a certain extent and minimizes risky typographical errors when copying results or typing them into a table.
Read the paper (PDF).
Jeff Gossett, University of Arkansas for Medical Sciences
Mallikarjuna Rettiganti, UAMS
Paper 3381-2015:
Double Generalized Linear Models Using SAS®: The %DOUBLEGLM Macro
The purpose of this paper is to introduce a SAS® macro named %DOUBLEGLM that enables users to model the mean and dispersion jointly using double generalized linear models described in Nelder (1991) and Lee (1998). The R functions FITJOINT and DGLM (R Development Core Team, 2011) were used to verify the suitability of the %DOUBLEGLM macro estimates. The results showed that estimates were closer than the R functions.
Read the paper (PDF). | Download the data file (ZIP).
Paulo Silva, Universidade de Brasilia
Alan Silva, Universidade de Brasilia
E
Paper SAS1775-2015:
Encore: Introduction to Bayesian Analysis Using SAS/STAT®
The use of Bayesian methods has become increasingly popular in modern statistical analysis, with applications in numerous scientific fields. In recent releases, SAS® software has provided a wealth of tools for Bayesian analysis, with convenient access through several popular procedures in addition to the MCMC procedure, which is designed for general Bayesian modeling. This paper introduces the principles of Bayesian inference and reviews the steps in a Bayesian analysis. It then uses examples from the GENMOD and PHREG procedures to describe the built-in Bayesian capabilities, which became available for all platforms in SAS/STAT® 9.3. Discussion includes how to specify prior distributions, evaluate convergence diagnostics, and interpret the posterior summary statistics.
Read the paper (PDF).
Maura Stokes, SAS
Paper 3242-2015:
Entropy-Based Measures of Weight of Evidence and Information Value for Variable Reduction and Segmentation for Continuous Dependent Variables
My SAS® Global Forum 2013 paper 'Variable Reduction in SAS® by Using Weight of Evidence (WOE) and Information Value (IV)' has become the most sought-after online article on variable reduction in SAS since its publication. But the methodology provided by the paper is limited to reduction of numeric variables for logistic regression only. Built on a similar process, the current paper adds several major enhancements: 1) The use of WOE and IV has been expanded to the analytics and modeling for continuous dependent variables. After the standardization of a continuous outcome, all records can be divided into two groups: positive performance (outcome y above sample average) and negative performance (outcome y below sample average). This treatment is rigorously consistent with the concept of entropy in Information Theory: the juxtaposition of two opposite forces in one equation, and a stronger contrast between the two suggests a higher intensity , that is, more information delivered by the variable in question. As the standardization keeps the outcome variable continuous and quantified, the revised formulas for WOE and IV can be used in the analytics and modeling for continuous outcomes such as sales volume, claim amount, and so on. 2) Categorical and ordinal variables can be assessed together with numeric ones. 3) Users of big data usually need to evaluate hundreds or thousands of variables, but it is not uncommon that over 90% of variables contain little useful information. We have added a SAS macro that trims these variables efficiently in a broad-brushed manner without a thorough examination. Afterward, we examine the retained variables more carefully on their behaviors to the target outcome. 4) We add Chi-Square analysis for categorical/ordinal variables and Gini coefficients for numeric variable in order to provide additional suggestions for segmentation and regression. With the above enhancements added, a SAS macro program is provided at the end of the paper as a complete suite for variable reduction/selection that efficiently evaluates all variables together. The paper provides a detailed explanation for how to use the SAS macro and how to read the SAS outputs that provide useful insights for subsequent linear regression, logistic regression, or scorecard development.
Read the paper (PDF).
Alec Zhixiao Lin, PayPal Credit
Paper SAS1911-2015:
Equivalence and Noninferiority Testing Using SAS/STAT® Software
Proving difference is the point of most statistical testing. In contrast, the point of equivalence and noninferiority tests is to prove that results are substantially the same, or at least not appreciably worse. An equivalence test can show that a new treatment, one that is less expensive or causes fewer side effects, can replace a standard treatment. A noninferiority test can show that a faster manufacturing process creates no more product defects or industrial waste than the standard process. This paper reviews familiar and new methods for planning and analyzing equivalence and noninferiority studies in the POWER, TTEST, and FREQ procedures in SAS/STAT® software. Techniques that are discussed range from Schuirmann's classic method of two one-sided tests (TOST) for demonstrating similar normal or lognormal means in bioequivalence studies, to Farrington and Manning's noninferiority score test for showing that an incidence rate (such as a rate of mortality, side effects, or product defects) is no worse. Real-world examples from clinical trials, drug development, and industrial process design are included.
Read the paper (PDF).
John Castelloe, SAS
Donna Watts, SAS
Paper 3384-2015:
Exploring Number Theory Using Base SAS®
Long before analysts began mining large data sets, mathematicians sought truths hidden within the set of natural numbers. This exploration was formalized into the mathematical subfield known as number theory. Though this discipline has proven fruitful for many applied fields, number theory delights in numerical truth for its own sake. The austere and abstract beauty of number theory prompted nineteenth century mathematician Carl Friedrich Gauss to dub it The Queen of Mathematics.' This session and the related paper demonstrate that the analytical power of the SAS® engine is well-suited for exploring concepts in number theory. In Base SAS®, students and educators will find a powerful arsenal for exploring, illustrating, and visualizing the following: prime numbers, perfect numbers, the Euclidean algorithm, the prime number theorem, Euler's totient function, Chebyshev's theorem, the Chinese remainder theorem, Gauss circle problem, and more! The paper delivers custom SAS procedures and code segments that generate data sets relevant to the exploration of topics commonly found in elementary number theory texts. The efficiency of these algorithms is discussed and an emphasis is placed on structuring data sets to maximize flexibility and ease in exploring new conjectures and illustrating known theorems. Last, the power of SAS plotting is put to use in unexpected ways to visualize and convey number-theoretic facts. This session and the related paper are geared toward the academic user or SAS user who welcomes and revels in mathematical diversions.
Read the paper (PDF).
Matthew Duchnowski, Educational Testing Service
F
Paper 3419-2015:
Forest Plotting Analysis Macro %FORESTPLOT
A powerful tool for visually analyzing regression analysis is the forest plot. Model estimates, ratios, and rates with confidence limits are graphically stacked vertically in order to show how they overlap with each other and to show values of significance. The ability to see whether two values are significantly different from each other or whether a covariate has a significant meaning on its own is made much simpler in a forest plot rather than sifting through numbers in a report table. The amount of data preparation needed in order to build a high-quality forest plot in SAS® can be tremendous because the programmer needs to run analyses, extract the estimates to be plotted, structure the estimates in a format conducive to generating a forest plot, and then run the correct plotting procedure or create a graph template using the Graph Template Language (GTL). While some SAS procedures can produce forest plots using Output Delivery System (ODS) Graphics automatically, the plots are not generally publication-ready and are difficult to customize even if the programmer is familiar with GTL. The macro %FORESTPLOT is designed to perform all of the steps of building a high-quality forest plot in order to save time for both experienced and inexperienced programmers, and is currently set up to perform regression analyses common to the clinical oncology research areas, Cox proportional hazards and logistic, as well as calculate Kaplan-Meier event-free rates. To improve flexibility, the user can specify a pre-built data set to transform into a forest plot if the automated analysis options of the macro do not fit the user's needs.
Read the paper (PDF).
Jeffrey Meyers, Mayo Clinic
Qian Shi, Mayo Clinic
G
Paper 2787-2015:
GEN_OMEGA2: A SAS® Macro for Computing the Generalized Omega- Squared Effect Size Associated with Analysis of Variance Models
Effect sizes are strongly encouraged to be reported in addition to statistical significance and should be considered in evaluating the results of a study. The choice of an effect size for ANOVA models can be confusing because indices might differ depending on the research design as well as the magnitude of the effect. Olejnik and Algina (2003) proposed the generalized eta-squared and omega-squared effect sizes, which are comparable across a wide variety of research designs. This paper provides a SAS® macro for computing the generalized omega-squared effect size associated with analysis of variance models by using data from PROC GLM ODS tables. The paper provides the macro programming language, as well as results from an executed example of the macro.
Read the paper (PDF).
Anh Kellermann, University of South Florida
Yi-hsin Chen, USF
Anh Kellermann, University of South Florida
Jeffrey Kromrey, University of South Florida
Thanh Pham, USF
Patrice Rasmussen, USF
Patricia Rodriguez de Gil, University of South Florida
Jeanine Romano, USF
Paper SAS4121-2015:
Getting Started with Logistic Regression in SAS
This presentation provides a brief introduction to logistic regression analysis in SAS. Learn differences between Linear Regression and Logistic Regression, including ordinary least squares versus maximum likelihood estimation. Learn to: understand LOGISTIC procedure syntax, use continuous and categorical predictors, and interpret output from ODS Graphics.
Danny Modlin, SAS
Paper SAS4140-2015:
Getting Started with Mixed Models in Business
For decades, mixed models been used by researchers to account for random sources of variation in regression-type models. Now they are gaining favor in business statistics to give better predictions for naturally occurring groups of data, such as sales reps, store locations, or regions. Learn about how predictions based on a mixed model differ from predictions in ordinary regression, and see examples of mixed models with business data.
Catherine Truxillo, SAS
H
Paper 3431-2015:
How Latent Analyses within Survey Data Can Be Valuable Additions to Any Regression Model
This study looks at several ways to investigate latent variables in longitudinal surveys and their use in regression models. Three different analyses for latent variable discovery are briefly reviewed and explored. The procedures explored in this paper are PROC LCA, PROC LTA, PROC CATMOD, PROC FACTOR, PROC TRAJ, and PROC SURVEYLOGISTIC. The analyses defined through these procedures are latent profile analyses, latent class analyses, and latent transition analyses. The latent variables are included in three separate regression models. The effect of the latent variables on the fit and use of the regression model compared to a similar model using observed data is briefly reviewed. The data used for this study was obtained via the National Longitudinal Study of Adolescent Health, a study distributed and collected by Add Health. Data was analyzed using SAS® 9.3. This paper is intended for any level of SAS® user. This paper is also aimed at an audience with a background in behavioral science or statistics.
Read the paper (PDF).
Deanna Schreiber-Gregory, National University
Paper 3252-2015:
How to Use SAS® for GMM Logistic Regression Models for Longitudinal Data with Time-Dependent Covariates
In longitudinal data, it is important to account for the correlation due to repeated measures and time-dependent covariates. Generalized method of moments can be used to estimate the coefficients in longitudinal data, although there are currently limited procedures in SAS® to produce GMM estimates for correlated data. In a recent paper, Lalonde, Wilson, and Yin provided a GMM model for estimating the coefficients in this type of data. SAS PROC IML was used to generate equations that needed to be solved to determine which estimating equations to use. In addition, this study extended classifications of moment conditions to include a type IV covariate. Two data sets were evaluated using this method, including re-hospitalization rates from a Medicare database as well as body mass index and future morbidity rates among Filipino children. Both examples contain binary responses, repeated measures, and time-dependent covariates. However, while this technique is useful, it is tedious and can also be complicated when determining the matrices necessary to obtain the estimating equations. We provide a concise and user-friendly macro to fit GMM logistic regression models with extended classifications.
Read the paper (PDF).
Katherine Cai, Arizona State University
I
Paper 3295-2015:
Imputing Missing Data using SAS®
Missing data is an unfortunate reality of statistics. However, there are various ways to estimate and deal with missing data. This paper explores the pros and cons of traditional imputation methods versus maximum likelihood estimation as well as singular versus multiple imputation. These differences are displayed through comparing parameter estimates of a known data set and simulating random missing data of different severity. In addition, this paper uses PROC MI and PROC MIANALYZE and shows how to use these procedures in a longitudinal data set.
Read the paper (PDF).
Christopher Yim, Cal Poly San Luis Obispo
Paper 3052-2015:
Introduce a Linear Regression Model by Using the Variable Transformation Method
This paper explains how to build a linear regression model using the variable transformation method. Testing the assumptions, which is required for linear modeling and testing the fit of a linear model, is included. This paper is intended for analysts who have limited exposure to building linear models. This paper uses the REG, GLM, CORR, UNIVARIATE, and GPLOT procedures.
Read the paper (PDF). | Download the data file (ZIP).
Nancy Hu, Discover
Paper SAS1742-2015:
Introducing the HPGENSELECT Procedure: Model Selection for Generalized Linear Models and More
Generalized linear models are highly useful statistical tools in a broad array of business applications and scientific fields. How can you select a good model when numerous models that have different regression effects are possible? The HPGENSELECT procedure, which was introduced in SAS/STAT® 12.3, provides forward, backward, and stepwise model selection for generalized linear models. In SAS/STAT 14.1, the HPGENSELECT procedure also provides the LASSO method for model selection. You can specify common distributions in the family of generalized linear models, such as the Poisson, binomial, and multinomial distributions. You can also specify the Tweedie distribution, which is important in ratemaking by the insurance industry and in scientific applications. You can run the HPGENSELECT procedure in single-machine mode on the server where SAS/STAT is installed. With a separate license for SAS® High-Performance Statistics, you can also run the procedure in distributed mode on a cluster of machines that distribute the data and the computations. This paper shows you how to use the HPGENSELECT procedure both for model selection and for fitting a single model. The paper also explains the differences between the HPGENSELECT procedure and the GENMOD procedure.
Read the paper (PDF).
Gordon Johnston, SAS
Bob Rodriguez, SAS
J
Paper 3020-2015:
Jeffreys Interval for One-Sample Proportion with SAS/STAT® Software
This paper introduces Jeffreys interval for one-sample proportion using SAS® software. It compares the credible interval from a Bayesian approach with the confidence interval from a frequentist approach. Different ways to calculate the Jeffreys interval are presented using PROC FREQ, the QUANTILE function, a SAS program of the random walk Metropolis sampler, and PROC MCMC.
Read the paper (PDF).
Wu Gong, The Children's Hospital of Philadelphia
K
Paper 2480-2015:
Kaplan-Meier Survival Plotting Macro %NEWSURV
The research areas of pharmaceuticals and oncology clinical trials greatly depend on time-to-event endpoints such as overall survival and progression-free survival. One of the best graphical displays of these analyses is the Kaplan-Meier curve, which can be simple to generate with the LIFETEST procedure but difficult to customize. Journal articles generally prefer that statistics such as median time-to-event, number of patients, and time-point event-free rate estimates be displayed within the graphic itself, and this was previously difficult to do without an external program such as Microsoft Excel. The macro %NEWSURV takes advantage of the Graph Template Language (GTL) that was added with the SG graphics engine to create this level of customizability without the need for back-end manipulation. Taking this one step further, the macro was improved to be able to generate a lattice of multiple unique Kaplan-Meier curves for side-by-side comparisons or for condensing figures for publications. This paper describes the functionality of the macro and describes how the key elements of the macro work.
Read the paper (PDF).
Jeffrey Meyers, Mayo Clinic
L
Paper 3297-2015:
Lasso Regularization for Generalized Linear Models in Base SAS® Using Cyclical Coordinate Descent
The cyclical coordinate descent method is a simple algorithm that has been used for fitting generalized linear models with lasso penalties by Friedman et al. (2007). The coordinate descent algorithm can be implemented in Base SAS® to perform efficient variable selection and shrinkage for GLMs with the L1 penalty (the lasso).
Read the paper (PDF).
Robert Feyerharm, Beacon Health Options
M
Paper 2400-2015:
Modeling Effect Modification and Higher-Order Interactions: A Novel Approach for Repeated Measures Design Using the LSMESTIMATE Statement in SAS® 9.4
Effect modification occurs when the association between a predictor of interest and the outcome is differential across levels of a third variable--the modifier. Effect modification is statistically tested as the interaction effect between the predictor and the modifier. In repeated measures studies (with more than two time points), higher-order (three-way) interactions must be considered to test effect modification by adding time to the interaction terms. Custom fitting and constructing these repeated measures models are difficult and time consuming, especially with respect to estimating post-fitting contrasts. With the advancement of the LSMESTIMATE statement in SAS®, a simplified approach can be used to custom test for higher-order interactions with post-fitting contrasts within a mixed model framework. This paper provides a simulated example with tips and techniques for using an application of the nonpositional syntax of the LSMESTIMATE statement to test effect modification in repeated measures studies. This approach, which is applicable to exploring modifiers in randomized controlled trials (RCTs), goes beyond the treatment effect on outcome to a more functional understanding of the factors that can enhance, reduce, or change this relationship. Using this technique, we can easily identify differential changes for specific subgroups of individuals or patients that subsequently impact treatment decision making. We provide examples of conventional approaches to higher-order interaction and post-fitting tests using the ESTIMATE statement and compare and contrast this to the nonpositional syntax of the LSMESTIMATE statement. The merits and limitations of this approach are discussed.
Read the paper (PDF). | Download the data file (ZIP).
Pronabesh DasMahapatra, PatientsLikeMe Inc.
Ryan Black, NOVA Southeastern University
Paper 3430-2015:
Multilevel Models for Categorical Data Using SAS® PROC GLIMMIX: The Basics
Multilevel models (MLMs) are frequently used in social and health sciences where data are typically hierarchical in nature. However, the commonly used hierarchical linear models (HLMs) are appropriate only when the outcome of interest is normally distributed. When you are dealing with outcomes that are not normally distributed (binary, categorical, ordinal), a transformation and an appropriate error distribution for the response variable needs to be incorporated into the model. Therefore, hierarchical generalized linear models (HGLMs) need to be used. This paper provides an introduction to specifying HGLMs using PROC GLIMMIX, following the structure of the primer for HLMs previously presented by Bell, Ene, Smiley, and Schoeneberger (2013). A brief introduction into the field of multilevel modeling and HGLMs with both dichotomous and polytomous outcomes is followed by a discussion of the model-building process and appropriate ways to assess the fit of these models. Next, the paper provides a discussion of PROC GLIMMIX statements and options as well as concrete examples of how PROC GLIMMIX can be used to estimate (a) two-level organizational models with a dichotomous outcome and (b) two-level organizational models with a polytomous outcome. These examples use data from High School and Beyond (HS&B), a nationally representative longitudinal study of American youth. For each example, narrative explanations accompany annotated examples of the GLIMMIX code and corresponding output.
Read the paper (PDF).
Mihaela Ene, University of South Carolina
Bethany Bell, University of South Carolina
Genine Blue, University of South Carolina
Elizabeth Leighton, University of South Carolina
Paper 2900-2015:
Multiple Ways to Detect Differential Item Functioning in SAS®
Differential item functioning (DIF), as an assessment tool, has been widely used in quantitative psychology, educational measurement, business management, insurance, and health care. The purpose of DIF analysis is to detect response differences of items in questionnaires, rating scales, or tests across different subgroups (for example, gender) and to ensure the fairness and validity of each item for those subgroups. The goal of this paper is to demonstrate several ways to conduct DIF analysis by using different SAS® procedures (PROC FREQ, PROC LOGISITC, PROC GENMOD, PROC GLIMMIX, and PROC NLMIXED) and their applications. There are three general methods to examine DIF: generalized Mantel-Haenszel (MH), logistic regression, and item response theory (IRT). The SAS® System provides flexible procedures for all these approaches. There are two types of DIF: uniform DIF, which remains consistent across ability levels, and non-uniform DIF, which varies across ability levels. Generalized MH is a nonparametric method and is often used to detect uniform DIF while the other two are parametric methods and examine both uniform and non-uniform DIF. In this study, I first describe the underlying theories and mathematical formulations for each method. Then I show the SAS statements, input data format, and SAS output for each method, followed by a detailed demonstration of the differences among the three methods. Specifically, PROC FREQ is used to calculate generalized MH only for dichotomous items. PROC LOGISITIC and PROC GENMOD are used to detect DIF by using logistic regression. PROC NLMIXED and PROC GLIMMIX are used to examine DIF by applying an exploratory item response theory model. Finally, I use SAS/IML® to call two R packages (that is, difR and lordif) to conduct DIF analysis and then compare the results between SAS procedures and R packages. An example data set, the Verbal Aggression assessment, which includes 316 subjects and 24 items, is used in this stud y. Following the general DIF analysis, the male group is used as the reference group, and the female group is used as the focal group. All the analyses are conducted by SAS® 9.3 and R 2.15.3. The paper closes with the conclusion that the SAS System provides different flexible and efficient ways to conduct DIF analysis. However, it is essential for SAS users to understand the underlying theories and assumptions of different DIF methods and apply them appropriately in their DIF analyses.
Read the paper (PDF).
Yan Zhang, Educational Testing Service
P
Paper 3516-2015:
Piecewise Linear Mixed Effects Models Using SAS
Evaluation of the impact of critical or high-risk events or periods in longitudinal studies of growth might provide clues to the long-term effects of life events and efficacies of preventive and therapeutic interventions. Conventional linear longitudinal models typically involve a single growth profile to represent linear changes in an outcome variable across time, which sometimes does not fit the empirical data. The piecewise linear mixed-effects models allow different linear functions of time corresponding to the pre- and post-critical time point trends. This presentation shows: 1) how to perform piecewise linear mixed effects models using SAS step by step, in the context of a clinical trial with two-arm interventions and a predictive covariate of interest; 2) how to obtain the slopes and corresponding p-values for intervention and control groups during pre- and post-critical periods, conditional on different values of the predictive covariate; and 3) explains how to make meaningful comparisons and present results in a scientific manuscript. A SAS macro to generate the summary tables assisting the interpretation of the results is also provided.
Qinlei Huang, St Jude Children's Research Hospital
R
Paper 1341-2015:
Random vs. Fixed Effects: Which Technique More Effectively Addresses Selection Bias in Observational Studies
Retrospective case-control studies are frequently used to evaluate health care programs when it is not feasible to randomly assign members to a respective cohort. Without randomization, observational studies are more susceptible to selection bias where the characteristics of the enrolled population differ from those of the entire population. When the participant sample is different from the comparison group, the measured outcomes are likely to be biased. Given this issue, this paper discusses how propensity score matching and random effects techniques can be used to reduce the impact selection bias has on observational study outcomes. All results shown are drawn from an ROI analysis using a participant (cases) versus non-participant (controls) observational study design for a fitness reimbursement program aiming to reduce health care expenditures of participating members.
Read the paper (PDF). | Download the data file (ZIP).
Jess Navratil-Strawn, Optum
S
Paper SAS1940-2015:
SAS/STAT® 14.1: Methods for Massive, Missing, or Multifaceted Data
The latest release of SAS/STAT® software brings you powerful techniques that will make a difference in your work, whether your data are massive, missing, or somewhere in the middle. New imputation software for survey data adds to an expansive array of methods in SAS/STAT for handling missing data, as does the production version of the GEE procedure, which provides the weighted generalized estimating equation approach for longitudinal studies with dropouts. An improved quadrature method in the GLIMMIX procedure gives you accelerated performance for certain classes of models. The HPSPLIT procedure provides a rich set of methods for statistical modeling with classification and regression trees, including cross validation and graphical displays. The HPGENSELECT procedure adds support for spline effects and lasso model selection for generalized linear models. And new software implements generalized additive models by using an approach that handles large data easily. Other updates include key functionality for Bayesian analysis and pharmaceutical applications.
Read the paper (PDF).
Maura Stokes, SAS
Bob Rodriguez, SAS
Paper 4400-2015:
SAS® Analytics plus Warren Buffett's Wisdom Beats Berkshire Hathaway! Huh?
Individual investors face a daunting challenge. They must select a portfolio of securities comprised of a manageable number of individual stocks, bonds and/or mutual funds. An investor might initiate her portfolio selection process by choosing the number of unique securities to hold in her portfolio. This is both a practical matter and a matter of risk management. It is practical because there are tens of thousands of actively traded securities from which to choose and it is impractical for an individual investor to own every available security. It is also a risk management measure because investible securities bring with them the potential of financial loss -- to the point of becoming valueless in some cases. Increasing the number of securities in a portfolio decreases the probability that an investor will suffer drastically from corporate bankruptcy, for instance. However, holding too many securities in a portfolio can restrict performance. After deciding the number of securities to hold, the investor must determine which securities she will include in her portfolio and what proportion of available cash she will allocate to each security. Once her portfolio is constructed, the investor must manage the portfolio over time. This generally entails periodically reassessing the proportion of each security to maintain as time advances, but may also involve the elimination of some securities and the initiation of positions in new securities. This paper introduces an analytically driven method for portfolio security selection based on minimizing the mean correlation of returns across the portfolio. It also introduces a method for determining the proportion of each security that should be maintained within the portfolio. The methods for portfolio selection and security weighting described herein work in conjunction to maximize expected portfolio return, while minimizing the probability of loss over time. This involves a re-visioning of Harry Markowitz's Nobel Prize winning concept kno wn as Efficient Frontier . Resultant portfolios are assessed via Monte Carlo simulation and results are compared to the Standard & Poor's 500 Index and Warren Buffett's Berkshire Hathaway, which has a well-establish history of beating the Standard & Poor's 500 Index over a long period. To those familiar with Dr. Markowitz's Modern Portfolio Theory this paper may appear simply as a repackaging of old ideas. It is not.
Read the paper (PDF).
Bruce Bedford, Oberweis Dairy
Paper 3150-2015:
Statistical Evaluation of the Doughnut Clustering Method for Product Affinity Segmentation
Product affinity segmentation is a powerful technique for marketers and sales professionals to gain a good understanding of customers' needs, preferences, and purchase behavior. Performing product affinity segmentation is quite challenging in practice because product-level data usually has high skewness, high kurtosis, and a large percentage of zero values. The Doughnut Clustering method has been shown to be effective using real data, and was presented at SAS® Global Forum 2013 in the paper titled Product Affinity Segmentation Using the Doughnut Clustering Approach.' However, the Doughnut Clustering method is not a panacea for addressing the product affinity segmentation problem. There is a clear need for a comprehensive evaluation of this method in order to be able to develop generic guidelines for practitioners about when to apply it. In this paper, we meet the need by evaluating the Doughnut Clustering method on simulated data with different levels of skewness, kurtosis, and percentage of zero values. We developed a five-step approach based on Fleishman's power method to generate synthetic data with prescribed parameters. Subsequently, we designed and conducted a set of experiments to run the Doughnut Clustering method as well as the traditional K-means method as a benchmark on simulated data. We draw conclusions on the performance of the Doughnut Clustering method by comparing the clustering validity metric the ratio of between-cluster variance to within-cluster variance as well as the relative proportion of cluster sizes against those of K-means.
Read the paper (PDF). | Download the data file (ZIP).
Darius Baer, SAS
Goutam Chakraborty, Oklahoma State University
Paper 1521-2015:
Sums of Squares: The Basics and a Surprise
Most 'Design of Experiment' textbooks cover Type I, Type II, and Type III sums of squares, but many researchers and statisticians fall into the habit of using one type mindlessly. This breakout session reviews the basics and illustrates the importance of the choice of type as well as the variable definitions in PROC GLM and PROC REG.
Read the paper (PDF).
Sheila Barron, University of Iowa
Michelle Mengeling, Comprehensive Access & Delivery Research & Evaluation-CADRE, Iowa City VA Health Care System
T
Paper SAS1387-2015:
Ten Tips for Simulating Data with SAS®
Data simulation is a fundamental tool for statistical programmers. SAS® software provides many techniques for simulating data from a variety of statistical models. However, not all techniques are equally efficient. An efficient simulation can run in seconds, whereas an inefficient simulation might require days to run. This paper presents 10 techniques that enable you to write efficient simulations in SAS. Examples include how to simulate data from a complex distribution and how to use simulated data to approximate the sampling distribution of a statistic.
Read the paper (PDF). | Download the data file (ZIP).
Rick Wicklin, SAS
Paper SAS1745-2015:
The SEM Approach to Longitudinal Data Analysis Using the CALIS Procedure
Researchers often use longitudinal data analysis to study the development of behaviors or traits. For example, they might study how an elderly person's cognitive functioning changes over time or how a therapeutic intervention affects a certain behavior over a period of time. This paper introduces the structural equation modeling (SEM) approach to analyzing longitudinal data. It describes various types of latent curve models and demonstrates how you can use the CALIS procedure in SAS/STAT® software to fit these models. Specifically, the paper covers basic latent curve models, such as unconditional and conditional models, as well as more complex models that involve multivariate responses and latent factors. All illustrations use real data that were collected in a study that looked at maternal stress and the relationship between mothers and their preterm infants. This paper emphasizes the practical aspects of longitudinal data analysis. In addition to illustrating the program code, it shows how you can interpret the estimation results and revise the model appropriately. The final section of the paper discusses the advantages and disadvantages of the SEM approach to longitudinal data analysis.
Read the paper (PDF).
Xinming An, SAS
Yiu-Fai Yung, SAS
U
Paper 2740-2015:
Using Heat Maps to Compare Clusters of Ontario DWI Drivers
SAS® PROC FASTCLUS generates five clusters for the group of repeat clients of Ontario's Remedial Measures program. Heat map tables are shown for selected variables such as demographics, scales, factor, and drug use to visualize the difference between clusters.
Rosely Flam-Zalcman, CAMH
Robert Mann, CAM
Rita Thomas, CAMH
Paper 3208-2015:
Using SAS/STAT® Software to Implement a Multivariate Aadaptive Outlier Detection Approach to Distinguish Outliers from Extreme Values
Hawkins (1980) defines an outlier as an observation that deviates so much from other observations as to arouse the suspicion that it was generated by a different mechanism . To identify data outliers, a classic multivariate outlier detection approach implements the Robust Mahalanobis Distance Method by splitting the distribution of distance values into two subsets (within-the-norm and out-of-the-norm), with the threshold value usually set to the 97.5% quantile of the Chi-Square distribution with p (number of variables) degrees of freedom and items whose distance values are beyond it are labeled out-of-the-norm. This threshold value is an arbitrary number, however, and it might flag as out-of-the-norm a number of items that are actually extreme values of the baseline distribution rather than outliers. Therefore, it is desirable to identify an additional threshold, a cutoff point that divides the set of out-of-norm points in two subsets--extreme values and outliers. One way to do this--in particular for larger databases--is to Increase the threshold value to another arbitrary number, but this approach requires taking into consideration the size of the data set since size affects the threshold-separating outliers from extreme values. A 2003 article by Gervini (Journal of Multivariate Statistics) proposes an adaptive threshold that increases with the number of items n if the data is clean but it remains bounded if there are outliers in the data. In 2005 Filzmoser, Garrett, and Reimann (Computers & Geosciences) built on Gervini's contribution to derive by simulation a relationship between the number of items n, the number of variables in the data p, and a critical ancillary variable for the determination of outlier thresholds. This paper implements the Gervini adaptive threshold value estimator by using PROC ROBUSTREG and the SAS® Chi-Square functions CINV and PROBCHI, available in the SAS/STAT® environment. It also provides data simulations to illustrate the reliab ility and the flexibility of the method in distinguishing true outliers from extreme values.
Read the paper (PDF).
Paulo Macedo, Integrity Management Services, LLC
Paper 3644-2015:
Using and Understanding the LSMEANS and LSMESTIMATE Statements
The concept of least squares means, or population marginal means, seems to confuse a lot of people. We explore least squares means as implemented by the LSMEANS statement in SAS®, beginning with the basics. Particular emphasis is paid to the effect of alternative parameterizations (for example, whether binary variables are in the CLASS statement) and the effect of the OBSMARGINS option. We use examples to show how to mimic LSMEANS using ESTIMATE statements and the advantages of the relatively new LSMESTIMATE statement. The basics of estimability are discussed, including how to get around the dreaded non-estimable messages. Emphasis is put on using the STORE statement and PROC PLM to test hypotheses without having to redo all the model calculations. This material is appropriate for all levels of SAS experience, but some familiarity with linear models is assumed.
Read the paper (PDF). | Watch the recording.
David Pasta, ICON Clinical Research
Paper 1335-2015:
Using the GLIMMIX and GENMOD Procedures to Analyze Longitudinal Data from a Department of Veterans Affairs Multisite Randomized Controlled Trial
Many SAS® procedures can be used to analyze longitudinal data. This study employed a multisite randomized controlled trial design to demonstrate the effectiveness of two SAS procedures, GLIMMIX and GENMOD, to analyze longitudinal data from five Department of Veterans Affairs Medical Centers (VAMCs). Older male veterans (n = 1222) seen in VAMC primary care clinics were randomly assigned to two behavioral health models, integrated (n = 605) and enhanced referral (n = 617). Data was collected at baseline, and at 3-, 6-, and 12- month follow-up. A mixed-effects repeated measures model was used to examine the dependent variable, problem drinking, which was defined as count and dichotomous from baseline to 12 month follow-up. Sociodemographics and depressive symptoms were included as covariates. First, bivariate analyses included general linear model and chi-square tests to examine covariates by group and group by problem drinking outcomes. All significant covariates were included in the GLIMMIX and GENMOD models. Then, multivariate analysis included mixed models with Generalized Estimation Equations (GEEs). The effect of group, time, and the interaction effect of group by time were examined after controlling for covariates. Multivariate results were inconsistent for GLIMMIX and GENMOD using Lognormal, Gaussian, Weibull, and Gamma distributions. SAS is a powerful statistical program in data analyses for longitudinal study.
Read the paper (PDF).
Abbas Tavakoli, University of South Carolina/College of Nursing
Marlene Al-Barwani, University of South Carolina
Sue Levkoff, University of South Carolina
Selina McKinney, University of South Carolina
Nikki Wooten, University of South Carolina
Paper SAS1855-2015:
Using the PHREG Procedure to Analyze Competing-Risks Data
Competing risks arise in studies in which individuals are subject to a number of potential failure events and the occurrence of one event might impede the occurrence of other events. For example, after a bone marrow transplant, a patient might experience a relapse or might die while in remission. You can use one of the standard methods of survival analysis, such as the log-rank test or Cox regression, to analyze competing-risks data, whereas other methods, such as the product-limit estimator, might yield biased results. An increasingly common practice of assessing the probability of a failure in competing-risks analysis is to estimate the cumulative incidence function, which is the probability subdistribution function of failure from a specific cause. This paper discusses two commonly used regression approaches for evaluating the relationship of the covariates to the cause-specific failure in competing-risks data. One approach models the cause-specific hazard, and the other models the cumulative incidence. The paper shows how to use the PHREG procedure in SAS/STAT® software to fit these models.
Read the paper (PDF).
Ying So, SAS
Paper 3376-2015:
Using the SAS-PIRT Macro for Estimating the Parameters of Polytomous Items
Polytomous items have been widely used in educational and psychological settings. As a result, the demand for statistical programs that estimate the parameters of polytomous items has been increasing. For this purpose, Samejima (1969) proposed the graded response model (GRM), in which category characteristic curves are characterized by the difference of the two adjacent boundary characteristic curves. In this paper, we show how the SAS-PIRT macro (a SAS® macro written in SAS/IML®) was developed based on the GRM and how it performs in recovering the parameters of polytomous items using simulated data.
Read the paper (PDF).
Sung-Hyuck Lee, ACT, Inc.
W
Paper 3600-2015:
When Two Are Better Than One: Fitting Two-Part Models Using SAS
In many situations, an outcome of interest has a large number of zero outcomes and a group of nonzero outcomes that are discrete or highly skewed. For example, in modeling health care costs, some patients have zero costs, and the distribution of positive costs are often extremely right-skewed. When modeling charitable donations, many potential donors give nothing, and the majority of donations are relatively small with a few very large donors. In the analysis of count data, there are also times where there are more zeros than would be expected using standard methodology, or cases where the zeros might differ substantially than the non-zeros, such as number of cavities a patient has at a dentist appointment or number of children born to a mother. If data has such structure, and ordinary least squares methods are used, then predictions and estimation might be inaccurate. The two-part model gives us a flexible and useful modeling framework in many situations. Methods for fitting the models with SAS® software are illustrated.
Read the paper (PDF).
Laura Kapitula, Grand Valley State University
back to top