SAS/STAT Papers A-Z

A
Session 8000-2016:
A SAS® Macro for Geographically Weighted Negative Binomial Regression
Geographically Weighted Negative Binomial Regression (GWNBR) was developed by Silva and Rodrigues (2014). It is a generalization of the Geographically Weighted Poisson Regression (GWPR) proposed by Nakaya and others (2005) and of the Poisson and negative binomial regressions. This paper shows a SAS® macro to estimate the GWNBR model encoded in SAS/IML® software and shows how to use the SAS procedure GMAP to draw the maps.
Read the paper (PDF) | Download the data file (ZIP)
Alan Silva, University of Brasilia
Thais Rodrigues, University of Brasilia
Session 9280-2016:
An Application of the DEA Optimization Methodology to Optimize the Collection Direction Capacity of a Bank
In the Collection Direction of a well-recognized Colombian financial institution, there was no methodology that provided an optimal number of collection agents to improve the collection task and make possible that more customers be compromised with their minimum monthly payment of their debt. The objective of this paper is to apply the Data Envelopment Analysis (DEA) Optimization Methodology to determine the optimal number of agents to maximize the monthly collection in the bank. We show that the results can have a positive impact to the credit portfolio behavior and reduce the collection management cost. DEA optimization methodology has been successfully used in various fields to solve multi-criteria optimization problems, but it is not commonly used in the financial sector mostly because this methodology requires specialized software, such as SAS® Enterprise Guide®. In this paper, we present the PROC OPTMODEL and we show how to formulate the optimization problem, program the SAS® Code, and how to process adequately the available data.
Read the paper (PDF) | Download the data file (ZIP)
Miguel Díaz, Scotiabank - Colpatria
Oscar Javier Cortés Arrigui, Scotiabank - Colpatria
Session 12000-2016:
Analysis of Grades for University Students Using Administrative Data and the IRT Procedure
The world's capacity to store and analyze data has increased in ways that would have been inconceivable just a couple of years ago. Due to this development, large-scale data are collected by governments. Until recently, this was for purely administrative purposes. This study used comprehensive data files on education. The purpose of this study was to examine compulsory courses for a bachelor's degree in the economics program at the University of Copenhagen. The difficulty and use of the grading scale was compared across the courses by using the new IRT procedure, which was introduced in SAS/STAT® 13.1. Further, the latent ability traits that were estimated for all students in the sample by PROC IRT are used as predictors in a logistic regression model. The hypothesis of interest is that students who have a lower ability trait will have a greater probability of dropping out of the university program compared to successful students. Administrative data from one cohort of students in the economics program at the University of Copenhagen was used (n=236). Three unidimensional Item Response Theory models, two dichotomous and one polytomous, were introduced. It turns out that the polytomous Graded Response model does the best job of fitting data. The findings suggest that in order to receive the highest possible grade, the highest level of student ability is needed for the course exam in the first-year course Descriptive Economics A. In contrast, the third-year course Econometrics C is the easiest course in which to receive a top grade. In addition, this study found that as estimated student ability decreases, the probability of a student dropping out of the bachelor's degree program increases drastically. However, contrary to expectations, some students with high ability levels also end up dropping out.
Read the paper (PDF)
Sara Armandi, University of Copenhagen
Session 11702-2016:
Analyzing Non-Normal Binomial and Categorical Response Variables under Varying Data Conditions
When dealing with non-normal categorical response variables, logistic regression is the robust method to use for modeling the relationship between categorical outcomes and different predictors without assuming a linear relationship between them. Within such models, the categorical outcome might be binary, multinomial, or ordinal, and predictors might be continuous or categorical. Another complexity that might be added to such studies is when data is longitudinal, such as when outcomes are collected at multiple follow-up times. Learning about modeling such data within any statistical method is beneficial because it enables researchers to look at changes over time. This study looks at several methods of modeling binary and categorical response variables within regression models by using real-world data. Starting with the simplest case of binary outcomes through ordinal outcomes, this study looks at different modeling options within SAS® and includes longitudinal cases for each model. To assess binary outcomes, the current study models binary data in the absence and presence of correlated observations under regular logistic regression and mixed logistic regression. To assess multinomial outcomes, the current study uses multinomial logistic regression. When responses are ordered, using ordinal logistic regression is required as it allows for interpretations based on inherent rankings. Different logit functions for this model include the cumulative logit, adjacent-category logit, and continuation ratio logit. Each of these models is also considered for longitudinal (panel) data using methods such as mixed models and Generalized Estimating Equations (GEE). The final consideration, which cannot be addressed by GEE, is the conditional logit to examine bias due to omitted explanatory variables at the cluster level. Different procedures for the aforementioned within SAS® 9.4 are explored and their strengths and limitations are specified for applied researchers finding simil ar data characteristics. These procedures include PROC LOGISTIC, PROC GLIMMIX, PROC GENMOD, PROC NLMIXED, and PROC PHREG.
Read the paper (PDF)
Niloofar Ramezani, University of Northern Colorado
Session 10663-2016:
Applying Frequentist and Bayesian Logistic Regression to MOOC data in SAS®: a Case Study
Massive Open Online Courses (MOOC) have attracted increasing attention in educational data mining research areas. MOOC platforms provide free higher education courses to Internet users worldwide. However, MOOCs have high enrollment but notoriously low completion rates. The goal of this study is apply frequentist and Bayesian logistic regression to investigate whether and how students' engagement, intentions, education levels, and other demographics are conducive to course completion in MOOC platforms. The original data used in this study came from an online eight-week course titled Big Data in Education, taught within the Coursera platform (MOOC) by Teachers College, Columbia University. The data sets for analysis were created from three different sources--clickstream data, a pre-course survey, and homework assignment files. The SAS system provides multiple procedures to perform logistic regression, with each procedure having different features and functions. In this study, we apply two approaches--frequentist and Bayesian logistic regression, to MOOC data. PROC LOGISTIC is used for the frequentist approach, and PROC GENMOD is used for Bayesian analysis. The results obtained from the two approaches are compared. All the statistical analyses are conducted in SAS® 9.3. Our preliminary results show that MOOC students with higher course engagement and higher motivation are more likely to complete the MOOC course.
Read the paper (PDF) | View the e-poster or slides (PDF)
Yan Zhang, Educational Testing Service
Ryan Baker, Columbia University
Yoav Bergner, Educational Testing Service
C
Session 8620-2016:
Competing-Risks Analyses: An Overview of Regression Models
Competing-risks analyses are methods for analyzing the time to a terminal event (such as death or failure) and its cause or type. The cumulative incidence function CIF(j, t) is the probability of death by time t from cause j. New options in the LIFETEST procedure provide for nonparametric estimation of the CIF from event times and their associated causes, allowing for right-censoring when the event and its cause are not observed. Cause-specific hazard functions that are derived from the CIFs are the analogs of the hazard function when only a single cause is present. Death by one cause precludes occurrence of death by any other cause, because an individual can die only once. Incorporating explanatory variables in hazard functions provides an approach to assessing their impact on overall survival and on the CIF. This semiparametric approach can be analyzed in the PHREG procedure. The Fine-Gray model defines a subdistribution hazard function that has an expanded risk set, which consists of individuals at risk of the event by any cause at time t, together with those who experienced the event before t from any cause other than the cause of interest j. Finally, with additional assumptions a full parametric analysis is also feasible. We illustrate the application of these methods with empirical data sets.
Read the paper (PDF) | Watch the recording
Joseph Gardiner, Michigan State University
Session 10220-2016:
Correcting the Quasi-complete Separation Issue in Logistic Regression Models
Abstract quasi-complete separation is a commonly detected issue in logit/probit models. Quasi-complete separation occurs when a dependent variable separates an independent variable or a combination of several independent variables to a certain degree. In other words, levels in a categorical variable or values in a numeric variable are separated by groups of discrete outcome variables. Most of the time, this happens in categorical independent variable(s). Quasi-complete separation can cause convergence failures in logistic regression, which consequently result in a large coefficient estimate and standard errors inflation. Therefore, it can potentially yield biased results. This paper provides comprehensive reviews with various approaches for correcting quasi-complete separation in binary logistic regression model and presents hands-on examples from the healthcare insurance industry. First, it introduces the concept of quasi-complete separation and how to diagnosis the issue. Then, it provides step-by-step guidelines for fixing the problem, from a straightforward data configuration approach to a complicated statistical modeling approach such as the EXACT method, the FIRTH method, and so on.
Read the paper (PDF)
Xinghe Lu, Amerihealth Caritas
Session 11845-2016:
Credit Card Interest and Transactional Income Prediction with an Account-Level Transition Panel Data Model Using SAS/STAT® and SAS/ETS® Software
Credit card profitability prediction is a complex problem because of the variety of card holders' behavior patterns and the different sources of interest and transactional income. Each consumer account can move to a number of states such as inactive, transactor, revolver, delinquent, or defaulted. This paper i) describes an approach to credit card account-level profitability estimation based on the multistate and multistage conditional probabilities models and different types of income estimation, and ii) compares methods for the most efficient and accurate estimation. We use application, behavioral, card state dynamics, and macroeconomic characteristics, and their combinations as predictors. We use different types of logistic regression such as multinomial logistic regression, ordered logistic regression, and multistage conditional binary logistic regression with the LOGISTIC procedure for states transition probability estimation. The state transition probabilities are used as weights for interest rate and non-interest income models (which one is applied depends on the account state). Thus, the scoring model is split according to the customer behavior segment and the source of generated income. The total income consists of interest and non-interest income. Interest income is estimated with the credit limit utilization rate models. We test and compare five proportion models with the NLMIXED, LOGISTIC, and REG procedures in SAS/STAT® software. Non-interest income depends on the probability of being in a particular state, the two-stage model of conditional probability to make a point-of-sales transaction (POS) or cash withdrawal (ATM), and the amount of income generated by this transaction. We use the LOGISTIC procedure for conditional probability prediction and the GLIMMIX and PANEL procedures for direct amount estimation with pooled and random-effect panel data. The validation results confirm that traditional techniques can be effectively applied to complex tasks with many para meters and multilevel business logic. The model is used in credit limit management, risk prediction, and client behavior analytics.
Read the paper (PDF)
Denys Osipenko, the University of Edinburgh Business School
Session 7500-2016:
Customer Lifetime Value Modeling
Customer lifetime value (LTV) estimation involves two parts: the survival probabilities and profit margins. This article describes the estimation of those probabilities using discrete-time logistic hazard models. The estimation of profit margins is based on linear regression. In the scenario, when outliers are present among margins, we suggest applying robust regression with PROC ROBUSTREG.
Read the paper (PDF) | View the e-poster or slides (PDF)
Vadim Pliner, Verizon Wireless
E
Session 11683-2016:
Equivalence Tests
Motivated by the frequent need for equivalence tests in clinical trials, this presentation provides insights into tests for equivalence. We summarize and compare equivalence tests for different study designs, including one-parameter problems, designs with paired observations, and designs with multiple treatment arms. Power and sample size estimations are discussed. We also provide examples to implement the methods by using the TTEST, ANOVA, GLM, and MIXED procedures.
Read the paper (PDF)
Fei Wang, McDougall Scientific Ltd.
John Amrhein, McDougall Scientific Ltd.
Session 9760-2016:
Evaluation of PROC IRT Procedure for Item Response Modeling
The experimental item response theory procedure (PROC IRT), included in recently released SAS/STAT® 13.1 and 13.2, enables item response modeling and trait estimation in SAS®. PROC IRT enables you to perform item parameter calibration and latent trait estimation using a wide spectrum of educational and psychological research. This paper evaluates the performance of PROC IRT in item parameter recovery and under various testing conditions. The pros and cons of PROC IRT versus BILOG-MG 3.0 are presented. For practitioners of IRT models, the development of IRT-related analysis in SAS is inspiring. This analysis offers a great choice to the growing population of IRT users. A shift to SAS can be beneficial based on several features of SAS: its flexibility in data management, its power in data analysis, its convenient output delivery, and its increasing richness in graphical presentation. It is critical to ensure the quality of item parameter calibration and trait estimation before you can continue with other components, such as test scoring, test form constructions, IRT equatings, and so on.
View the e-poster or slides (PDF)
Yi-Fang Wu, ACT, Inc.
Session 8441-2016:
exSPLINE That: Explaining Geographic Variation in Insurance Pricing
Generalized linear models (GLMs) are commonly used to model rating factors in insurance pricing. The integration of territory rating and geospatial variables poses a unique challenge to the traditional GLM approach. Generalized additive models (GAMs) offer a flexible alternative based on GLM principles with a relaxation of the linear assumption. We explore two approaches for incorporating geospatial data in a pricing model using a GAM-based framework. The ability to incorporate new geospatial data and improve traditional approaches to territory ratemaking results in further market segmentation and a better match of price to risk. Our discussion highlights the use of the high-performance GAMPL procedure, which is new in SAS/STAT® 14.1 software. With PROC GAMPL, we can incorporate the geographic effects of geospatial variables on target loss outcomes. We illustrate two approaches. In our first approach, we begin by modeling the predictors as regressors in a GLM, and subsequently model the residuals as part of a GAM based on location coordinates. In our second approach, we model all inputs as covariates within a single GAM. Our discussion compares the two approaches and demonstrates visualization of model outputs.
Read the paper (PDF)
Carol Frigo, State Farm
Kelsey Osterloo, State Farm Insurance
Session 12487-2016:
Extracting Useful Information From the Google Ngram Data Set: A General Method to Take the Growth of the Scientific Literature into Account
Recent years have seen the birth of a powerful tool for companies and scientists: the Google Ngram data set, built from millions of digitized books. It can be, and has been, used to learn about past and present trends in the use of words over the years. This is an invaluable asset from a business perspective, mostly because of its potential application in marketing. The choice of words has a major impact on the success of a marketing campaign and an analysis of the Google Ngram data set can validate or even suggest the choice of certain words. It can also be used to predict the next buzzwords in order to improve marketing on social media or to help measure the success of previous campaigns. The Google Ngram data set is a gift for scientists and companies, but it has to be used with a lot of care. False conclusions can easily be drawn from straightforward analysis of the data. It contains only a limited number of variables, which makes it difficult to extract valuable information from it. Through a detailed example, this paper shows that it is essential to account for the disparity in the genre of the books used to construct the data set. This paper argues that for the years after 1950, the data set has been constructed using a much higher proportion of scientific books than for the years before. An ingenious method is developed to approximate, for each year, this unknown proportion of books coming from the scientific literature. A statistical model accounting for that change in proportion is then presented. This model is used to analyze the trend in the use of common words of the scientific literature in the 20th century. Results suggest that a naive analysis of the trends in the data can be misleading.
Read the paper (PDF)
Aurélien Nicosia, Université Laval
Thierry Duchesne, Universite Laval
Samuel Perreault, Université Laval
F
Session 7860-2016:
Finding and Evaluating Multiple Candidate Models for Logistic Regression
Logistic regression models are commonly used in direct marketing and consumer finance applications. In this context the paper discusses two topics about the fitting and evaluation of logistic regression models. Topic 1 is a comparison of two methods for finding multiple candidate models. The first method is the familiar best subsets approach. Best subsets is then compared to a proposed new method based on combining models produced by backward and forward selection plus the predictors considered by backward and forward. This second method uses the SAS® procedure HPLOGISTIC with selection of models by Schwarz Bayes criterion (SBC). Topic 2 is a discussion of model evaluation statistics to measure predictive accuracy and goodness-of-fit in support of the choice of a final model. Base SAS® and SAS/STAT® software are used.
Read the paper (PDF) | Download the data file (ZIP)
Bruce Lund, Magnify Analytic Solutions, Division of Marketing Associates
Session SAS4720-2016:
Fitting Multilevel Hierarchical Mixed Models Using PROC NLMIXED
Hierarchical nonlinear mixed models are complex models that occur naturally in many fields. The NLMIXED procedure's ability to fit linear or nonlinear models with standard or general distributions enables you to fit a wide range of such models. SAS/STAT® 13.2 enhanced PROC NLMIXED to support multiple RANDOM statements, enabling you to fit nested multilevel mixed models. This paper uses an example to illustrate the new functionality.
Read the paper (PDF)
Raghavendra Kurada, SAS
Session SAS5601-2016:
Fitting Your Favorite Mixed Models with PROC MCMC
The popular MIXED, GLIMMIX, and NLMIXED procedures in SAS/STAT® software fit linear, generalized linear, and nonlinear mixed models, respectively. These procedures take the classical approach of maximizing the likelihood function to estimate model. The flexible MCMC procedure in SAS/STAT can fit these same models by taking a Bayesian approach. Instead of maximizing the likelihood function, PROC MCMC draws samples (using a variety of sampling algorithms) to approximate the posterior distributions of model parameters. Similar to the mixed modeling procedures, PROC MCMC provides estimation, inference, and prediction. This paper describes how to use the MCMC procedure to fit Bayesian mixed models and compares the Bayesian approach to how the classical models would be fit with the familiar mixed modeling procedures. Several examples illustrate the approach in practice.
Read the paper (PDF)
Gordon Brown, SAS
Fang K. Chen, SAS
Maura Stokes, SAS
L
Session 5500-2016:
Latent Class Analysis Using the LCA Procedure
This paper presents the use of latent class analysis (LCA) to base the identification of a set of mutually exclusive latent classes of individuals on responses to a set of categorical, observed variables. The LCA procedure, a user-defined SAS® procedure for conducting LCA and LCA with covariates, is demonstrated using follow-up data on substance use from Monitoring the Future panel data, a nationally representative sample of high school seniors who are followed at selected time points during adulthood. The demonstration includes guidance on data management prior to analysis, PROC LCA syntax requirements and options, and interpretation of output.
Read the paper (PDF)
Patricia Berglund, University of Michigan
Session 1720-2016:
Limit of Detection (LoD) Estimation Using Parametric Curve Fitting to (Hit) Rate Data: The LOD_EST SAS® Macro
The Limit of Detection (LoD) is defined as the lowest concentration or amount of material, target, or analyte that is consistently detectable (for polymerase chain reaction [PCR] quantitative studies, in at least 95% of the samples tested). In practice, the estimation of the LoD uses a parametric curve fit to a set of panel member (PM1, PM2, PM3, and so on) data where the responses are binary. Typically, the parametric curve fit to the percent detection levels takes on the form of a probit or logistic distribution. The SAS® PROBIT procedure can be used to fit a variety of distributions, including both the probit and logistic. We introduce the LOD_EST SAS macro that takes advantage of the SAS PROBIT procedure's strengths and returns an information-rich graphic as well as a percent detection table with associated 95% exact (Clopper-Pearson) confidence intervals for the hit rates at each level.
Read the paper (PDF) | Download the data file (ZIP)
Jesse Canchola, Roche Molecular Systems, Inc.
Pari Hemyari, Roche Molecular Systems, Inc.
M
Session 10761-2016:
Medicare Fraud Analytics Using Cluster Analysis: How PROC FASTCLUS Can Refine the Identification of Peer Comparison Groups
Although limited to a small fraction of health care providers, the existence and magnitude of fraud in health insurance programs requires the use of fraud prevention and detection procedures. Data mining methods are used to uncover odd billing patterns in large databases of health claims history. Efficient fraud discovery can involve the preliminary step of deploying automated outlier detection techniques in order to classify identified outliers as potential fraud before an in-depth investigation. An essential component of the outlier detection procedure is the identification of proper peer comparison groups to classify providers as within-the-norm or outliers. This study refines the concept of peer comparison group within the provider category and considers the possibility of distinct billing patterns associated with medical or surgical procedure codes identifiable by the Berenson-Eggers Type of System (BETOS). The BETOS system covers all HCPCS codes (Health Care Procedure Coding System); assigns a HCPCS code to only one BETOS code; consists of readily understood clinical categories; consists of categories that permit objective assignment& (Center for Medicare and Medicaid Services, CMS). The study focuses on the specialty General Practice and involves two steps: first, the identification of clusters of similar BETOS-based billing patterns; and second, the assessment of the effectiveness of these peer comparison groups in identifying outliers. The working data set is a sample of the summary of 2012 data of physicians active in health care government programs made publicly available by the CMS through its website. The analysis uses PROC FASTCLUS, the SAS® cubic clustering criterion approach, to find the optimal number of clusters in the data. It also uses PROC ROBUSTREG to implement a multivariate adaptive threshold outlier detection method.
Read the paper (PDF) | Download the data file (ZIP)
Paulo Macedo, Integrity Management Services
P
Session 11663-2016:
PROC GLIMMIX as a Teaching and Planning Tool for Experiment Design
In our book SAS for Mixed Models, we state that 'the majority of modeling problems are really design problems.' Graduate students and even relatively experienced statistical consultants can find translating a study design into a useful model to be a challenge. Generalized linear mixed models (GLMMs) exacerbate the challenge, because they accommodate complex designs, complex error structures, and non-Gaussian data. This talk covers strategies that have proven successful in design of experiment courses and in consulting sessions. In addition, GLMM methods can be extremely useful in planning experiments. This talk discusses methods to implement precision and power analysis to help choose between competing plausible designs and to assess the adequacy of proposed designs.
Read the paper (PDF) | Download the data file (ZIP)
Walter Stroup, University of Nebraska, Lincoln
Session 7861-2016:
Probability Density for Repeated Events
In customer relationship management (CRM) or consumer finance it is important to predict the time of repeated events. These repeated events might be a purchase, service visit, or late payment. Specifically, the goal is to find the probability density for the time to first event, the probability density for the time to second event, and so on. Two approaches are presented and contrasted. One approach uses discrete time hazard modeling (DTHM). The second, a distinctly different approach, uses a multinomial logistic model (MLM). Both DTHM and MLM satisfy an important consistency criterion, which is to provide probability densities whose average across all customers equals the empirical population average. DTHM requires fewer models if the number of repeated events is small. MLM requires fewer models if the time horizon is small. SAS® code is provided through a simulation in order to illustrate the two approaches.
Read the paper (PDF)
Bruce Lund, Magnify Analytic Solutions, Division of Marketing Associates
Q
Session 5620-2016:
Quantitle Regression versus Ordinary Least Squares Regression
Regression is used to examine the relationship between one or more explanatory (independent) variables and an outcome (dependent) variable. Ordinary least squares regression models the effect of explanatory variables on the average value of the outcome. Sometimes, we are more interested in modeling the median value or some other quantile (for example, the 10th or 90th percentile). Or, we might wonder if a relationship between explanatory variables and outcome variable is the same for the entire range of the outcome: Perhaps the relationship at small values of the outcome is different from the relationship at large values of the outcome. This presentation illustrates quantile regression, comparing it to ordinary least squares regression. Topics covered will include: a graphical explanation of quantile regression, its assumptions and advantages, using the SAS® QUANTREG procedure, and interpretation of the procedure's output.
Read the paper (PDF) | Watch the recording
Ruth Croxford, Institute for Clinical Evaluative Sciences
R
Session 8240-2016:
Real-Time Predictive Modeling of Key Quality Characteristics Using Regularized Regression: SAS® Procedures GLMSELECT and LASSO
This paper describes the merging of designed experiments and regularized regression to find the significant factors to use for a real-time predictive model. The real-time predictive model is used in a Weyerhaeuser modified fiber mill to estimate the value of several key quality characteristics. The modified fiber product is accepted or rejected based on the model prediction or the predicted values' deviation from lab tests. To develop the model, a designed experiment was needed at the mill. The experiment was planned by a team of engineers, managers, operators, lab techs, and a statistician. The data analysis used the actual values of the process variables manipulated in the designed experiment. There were instances of the set point not being achieved or maintained for the duration of the run. The lab tests of the key quality characteristics were used as the responses. The experiment was designed with JMP® 64-bit Edition 11.2.0 and estimated the two-way interactions of the process variables. It was thought that one of the process variables was curvilinear and this effect was also made estimable. The LASSO method, as implemented in the SAS® GLMSELECT procedure, was used to analyze the data after cleaning and validating the results. Cross validation was used as the variable selection operator. The resulting prediction model passed all the predetermined success criteria. The prediction model is currently being used real time in the mill for product quality acceptance.
Read the paper (PDF)
Jon Lindenauer, Weyerhaeuser
Session 8920-2016:
Reducing and Screening Redundant Variables in Logistic Regression Models Using SAS/STAT® Software and SAS® Enterprise Miner™
Abstract logistic regression is one of the widely used regression models in statistics. Its dependent variable is a discrete variable with at least two levels. It measures the relationship between categorical dependent variables and independent variables and predicts the likelihood of having the event associated with an outcome variable. Variable reduction and screening are the techniques that reduce the redundant independent variables and filter out the independent variables with less predictive power. For several reasons, variable reduction and screening are important and critical especially when there are hundreds of covariates available that could possibly fit into the model. These techniques decrease the model estimation time and they can avoid the occurrence of non-convergence and can generate an unbiased inference of the outcome. This paper reviews various approaches for variable reduction, including traditional methods starting from univariate single logistic regressions and methods based on variable clustering techniques. It also presents hands-on examples using SAS/STAT® software and illustrates three selection nodes available in SAS® Enterprise Miner™: variable selection node, variable clustering node, and decision tree node.
Read the paper (PDF)
Xinghe Lu, Amerihealth Caritas
Session 5621-2016:
Restricted Cubic Spline Regression: A Brief Introduction
Sometimes, the relationship between an outcome (dependent) variable and the explanatory (independent) variable(s) is not linear. Restricted cubic splines are a way of testing the hypothesis that the relationship is not linear or summarizing a relationship that is too non-linear to be usefully summarized by a linear relationship. Restricted cubic splines are just a transformation of an independent variable. Thus, they can be used not only in ordinary least squares regression, but also in logistic regression, survival analysis, and so on. The range of values of the independent variable is split up, with knots defining the end of one segment and the start of the next. Separate curves are fit to each segment. Overall, the splines are defined so that the resulting fitted curve is smooth and continuous. This presentation describes when splines might be used, how the splines are defined, choice of knots, and interpretation of the regression results.
Read the paper (PDF) | Watch the recording
Ruth Croxford, Institute for Clinical Evaluative Sciences
Session 10621-2016:
Risk Adjustment Methods in Value-Based Reimbursement Strategies
Value-based reimbursement is the emerging strategy in the US healthcare system. The premise of value-based care is simple in concept--high quality and low cost provides the greatest value to patients and the various parties that fund their coverage. The basic equation for value is equally simple to compute: value=quality/cost. However, there are significant challenges to measuring it accurately. Error or bias in measuring value could result in the failure of this strategy to ultimately improve the healthcare system. This session discusses various methods and issues with risk adjustment in a value-based reimbursement model. Risk adjustment is an essential tool for ensuring that fair comparisons are made when deciding what health services and health providers have high value. The goal this presentation is to give analysts an overview of risk adjustment and to provide guidance for when, why, and how to use risk adjustment when quantifying performance of health services and healthcare providers on both cost and quality. Statistical modeling approaches are reviewed and practical issues with developing and implementing the models are discussed. Real-world examples are also provided.
Read the paper (PDF)
Daryl Wansink, Conifer Value Based Care
S
Session 7640-2016:
Simulation of Imputation Effects under Different Assumptions
Missing data is something that we cannot always prevent. Data can be missing due to subjects' refusing to answer a sensitive question or in fear of embarrassment. Researchers often assume that their data are missing completely at random or missing at random.' Unfortunately, we cannot test whether the mechanism condition is satisfied because missing values cannot be calculated. Alternatively, we can run simulation in SAS® to observe the behaviors of missing data under different assumptions: missing completely at random, missing at random, and being ignorable. We compare the effects from imputation methods if we assign a set of variable of interests to missing. The idea of imputation is to see how efficient substituted values in a data set affect further studies. This lets the audience decide which method(s) would be best to approach a data set when it comes to missing data.
View the e-poster or slides (PDF)
Danny Rithy, California Polytechnic State University
Soma Roy, California Polytechnic State University
Session 11773-2016:
Statistical Comparisons of Disease Prevalence Rates Using the Bootstrap Procedure
Disease prevalence is one of the most basic measures of the burden of disease in the field of epidemiology. As an estimate of the total number of cases of disease in a given population, prevalence is a standard in public health analysis. The prevalence of diseases in a given area is also frequently at the core of governmental policy decisions, charitable organization funding initiatives, and countless other aspects of everyday life. However, all too often, prevalence estimates are restricted to descriptive estimates of population characteristics when they could have a much wider application through the use of inferential statistics. As an estimate based on a sample from a population, disease prevalence can vary based on random fluctuations in that sample rather than true differences in the population characteristic. Statistical inference uses a known distribution of this sampling variation to perform hypothesis tests, calculate confidence intervals, and perform other advanced statistical methods. However, there is no agreed-upon sampling distribution of the prevalence estimate. In cases where the sampling distribution of an estimate is unknown, statisticians frequently rely on the bootstrap re-sampling procedure first given by Efron in 1979. This procedure relies on the computational power of software to generate repeated pseudo-samples similar in structure to an original, real data set. These multiple samples allow for the construction of confidence intervals and statistical tests to make statistical determinations and comparisons using the estimated prevalence. In this paper, we use the bootstrapping capabilities of SAS® 9.4 to compare statistically the difference between two given prevalence rates. We create a bootstrap analog to the two-sample t test to compare prevalence rates from two states despite the fact that the sampling distribution of these estimates is unknown using SAS®.
Read the paper (PDF)
Matthew Dutton, Florida A&M University
Charlotte Baker, Florida A&M University
Session SAS4900-2016:
Statistical Model Building for Large, Complex Data: Five New Directions in SAS/STAT® Software
The increasing size and complexity of data in research and business applications require a more versatile set of tools for building explanatory and predictive statistical models. In response to this need, SAS/STAT® software continues to add new methods. This presentation takes you on a high-level tour of five recent enhancements: new effect selection methods for regression models with the GLMSELECT procedure, model selection for generalized linear models with the HPGENSELECT procedure, model selection for quantile regression with the HPQUANTSELECT procedure, construction of generalized additive models with the GAMPL procedure, and building classification and regression trees with the HPSPLIT procedure. For each of these approaches, the presentation reviews its key concepts, uses a basic example to illustrate its benefits, and guides you to information that will help you get started.
Read the paper (PDF)
Bob Rodriguez, SAS
Session SAS3520-2016:
Survey Data Imputation with PROC SURVEYIMPUTE
Big data, small data--but what about when you have no data? Survey data commonly include missing values due to nonresponse. Adjusting for nonresponse in the analysis stage might lead different analysts to use different, and inconsistent, adjustment methods. To provide the same complete data to all the analysts, you can impute the missing values by replacing them with reasonable nonmissing values. Hot-deck imputation, the most commonly used survey imputation method, replaces the missing values of a nonrespondent unit by the observed values of a respondent unit. In addition to performing traditional cell-based hot-deck imputation, the SURVEYIMPUTE procedure, new in SAS/STAT® 14.1, also performs more modern fully efficient fractional imputation (FEFI). FEFI is a variation of hot-deck imputation in which all potential donors in a cell contribute their values. Filling in missing values is only a part of PROC SURVEYIMPUTE. The real trick is to perform analyses of the filled-in data that appropriately account for the imputation. PROC SURVEYIMPUTE also creates a set of replicate weights that are adjusted for FEFI. Thus, if you use the imputed data from PROC SURVEYIMPUTE along with the replicate methods in any of the survey analysis procedures--SURVEYMEANS, SURVEYFREQ, SURVEYREG, SURVEYLOGISTIC, or SURVEYPHREG--you can be confident that inferences account not only for the survey design, but also for the imputation. This paper discusses different approaches for handling nonresponse in surveys, introduces PROC SURVEYIMPUTE, and demonstrates its use with real-world applications. It also discusses connections with other ways of handling missing values in SAS/STAT.
Read the paper (PDF)
Pushpal Mukhopadhyay, SAS
T
Session 6541-2016:
The Baker's Dozen: What Every Biostatistician Needs to Know
It's impossible to know all of SAS® or all of statistics. There will always be some technique that you don't know. However, there are a few techniques that anyone in biostatistics should know. If you can calculate those with SAS, life is all the better. In this session you will learn how to compute and interpret a baker's dozen of these techniques, including several statistics that are frequently confused. The following statistics are covered: prevalence, incidence, sensitivity, specificity, attributable fraction, population attributable fraction, risk difference, relative risk, odds ratio, Fisher's exact test, number needed to treat, and McNemar's test. With these 13 tools in their tool chest, even nonstatisticians or statisticians who are not specialists will be able to answer many common questions in biostatistics. The fact that each of these can be computed with a few statements in SAS makes the situation all the sweeter. Bring your own doughnuts.
Read the paper (PDF)
AnnMaria De Mars, 7 Generation Games
Session 5181-2016:
The Use of Statistical Sampling in Auditing Health-Care Insurance Claim Payments
This paper is a primer on the practice of designing, selecting, and making inferences on a statistical sample, where the goal is to estimate the magnitude of error in a book value total. Although the concepts and syntax are presented through the lens of an audit of health-care insurance claim payments, they generalize to other contexts. After presenting the fundamental measures of uncertainty that are associated with sample-based estimates, we outline a few methods to estimate the sample size necessary to achieve a targeted precision threshold. The benefits of stratification are also explained. Finally, we compare several viable estimators to quantify the book value discrepancy, making note of the scenarios where one might be preferred over the others.
Read the paper (PDF)
Taylor Lewis, U.S. Office of Personnel Management
Julie Johnson, OPM - Office of the Inspector General
Christine Muha, U.S. Office of Personnel Management
Session SAS6403-2016:
Tips and Strategies for Mixed Modeling with SAS/STAT® Procedures
Inherently, mixed modeling with SAS/STAT® procedures (such as GLIMMIX, MIXED, and NLMIXED) is computationally intensive. Therefore, considerable memory and CPU time can be required. The default algorithms in these procedures might fail to converge for some data sets and models. This encore presentation of a paper from SAS Global Forum 2012 provides recommendations for circumventing memory problems and reducing execution times for your mixed-modeling analyses. This paper also shows how the new HPMIXED procedure can be beneficial for certain situations, as with large sparse mixed models. Lastly, the discussion focuses on the best way to interpret and address common notes, warnings, and error messages that can occur with the estimation of mixed models in SAS® software.
Read the paper (PDF) | Watch the recording
Phil GIbbs, SAS
U
Session 3160-2016:
Using Administrative Databases for Research: Propensity Score Methods to Adjust for Bias
Health care and other programs collect large amounts of information in order to administer the program. A health insurance plan, for example, probably records every physician service, hospital and emergency department visit, and prescription medication--information that is collected in order to make payments to the various providers (hospitals, physicians, pharmacists). Although the data are collected for administrative purposes, these databases can also be used to address research questions, including questions that are unethical or too expensive to answer using randomized experiments. However, when subjects are not randomly assigned to treatment groups, we worry about assignment bias--the possibility that the people in one treatment group were healthier, smarter, more compliant, etc., than those in the other group, biasing the comparison of the two treatments. Propensity score methods are one way to adjust the comparisons, enabling research using administrative data to mimic research using randomized controlled trials. In this presentation, I explain what the propensity score is, how it is used to compare two treatments, and how to implement a propensity score analysis using SAS®. Two methods using propensity scores are presented: matching and inverse probability weighting. Examples are drawn from health services research using the administrative databases of the Ontario Health Insurance Plan, the single payer for medically necessary care for the 13 million residents of Ontario, Canada. However, propensity score methods are not limited to health care. They can be used to examine the impact of any nonrandomized program, as long as there is enough information to calculate a credible propensity score.
Read the paper (PDF)
Ruth Croxford, Institute for Clinical Evaluative Sciences
Session 1680-2016:
Using GENMOD to Analyze Correlated Data on Military Health System Beneficiaries Receiving Behavioral Health Care in South Carolina Health Care System
Many SAS® procedures can be used to analyze large amounts of correlated data. This study was a secondary analysis of data obtained from the South Carolina Revenue and Fiscal Affairs Office (RFA). The data includes medical claims from all health care systems in South Carolina (SC). This study used the SAS procedure GENMOD to analyze a large amount of correlated data about Military Health Care (MHS) system beneficiaries who received behavioral health care in South Carolina Health Care Systems from 2005 to 2014. Behavioral health (BH) was defined by Major Diagnostic Code (MDC) 19 (mental disorders and diseases) and 20 (alcohol/drug use). MDCs are formed by dividing all possible principal diagnoses from the International Classification Diagnostic (ICD-9) codes into 25 mutually exclusive diagnostic categories. The sample included a total of 6,783 BH visits and 4,827 unique adult and child patients that included military service members, veterans, and their adult and child dependents who have MHS insurance coverage. PROC GENMOD included a multivariate GEE model with type of BH visit (mental health or substance abuse) as the dependent variable, and with gender, race group, age group, and discharge year as predictors. Hospital ID was used in the repeated statement with different correlation structures. Gender was significant with both independent correlation (p = .0001) and exchangeable structure (p = .0003). However, age group was significant using the independent correlation (p = .0160), but non-significant using the exchangeable correlation structure (p = .0584). SAS is a powerful statistical program for analyzing large amounts of correlated data with categorical outcomes.
Read the paper (PDF) | View the e-poster or slides (PDF)
Abbas Tavakoli, University of South Carolina
Jordan Brittingham, USC/ Arnold School of Public Health
Nikki R. Wooten, USC/College of Social Work
Session 11886-2016:
Using PROC HPSPLIT for Prediction with Classification and Regression
The HPSPLIT procedure, a powerful new addition to SAS/STAT®, fits classification and regression trees and uses the fitted trees for interpretation of data structures and for prediction. This presentation shows the use of PROC HPSPLIT to analyze a number of data sets from different subject areas, with a focus on prediction among subjects and across landscapes. A critical component of these analyses is the use of graphical tools to select a model, interpret the fitted trees, and evaluate the accuracy of the trees. The use of different kinds of cross validation for model selection and model assessment is also discussed.
Read the paper (PDF)
Session 2101-2016:
Using the Kaplan-Meier Product-Limit Estimator to Adjust NFL Yardage Averages
Average yards per reception, as well as number of touchdowns, are commonly used to rank National Football League (NFL) players. However, scoring touchdowns lowers the player's average since it stops the play and therefore prevents the player from gaining more yardage. If yardages are tabulated in a life table, then yardage from a touchdown play is denoted as a right-censored observation. This paper discusses the application of the SAS/STAT® Kaplan-Meier product-limit estimator to adjust these averages. Using 15 seasons of NFL receiving data, the relationship between touchdown rates and average yards per reception is compared, before and after adjustment. The modification of adjustments when a player incurred a 2-point safety during the season is also discussed.
Read the paper (PDF)
Keith Curtis, USAA
Session 3820-2016:
Using the SURVEYSELECT Procedure to Draw Stratified Cluster Samples with Unequally Sized Clusters
The SURVEYSELECT procedure is useful for sample selection for a wide variety of applications. This paper presents the application of PROC SURVEYSELECT to a complex scenario involving a state-wide educational testing program, namely, drawing interdependent stratified cluster samples of schools for the field-testing of test questions, which we call items. These stand-alone field tests are given to only small portions of the testing population and as such, a stratified procedure was used to ensure representativeness of the field-test samples. As the field test statistics for these items are evaluated for use in future operational tests, an efficient procedure is needed to sample schools, while satisfying pre-defined sampling criteria and targets. This paper provides an adaptive sampling application and then generalizes the methodology, as much as possible, for potential use in more industries.
Read the paper (PDF) | View the e-poster or slides (PDF)
Chuanjie Liao, Pearson
Brian Patterson, Pearson
W
Session 8540-2016:
Working with PROC SPP, PROC GMAP and PROC GINSIDE to Produce Nice Maps
This paper shows how to create maps from the output of the SAS/STAT® SPP procedure using the same layer of the data. This cut can be done with PROC GINSIDE and the final plot with PROC GMAP. The result is a map that is nicer than the plot produced by PROC SPP.
Read the paper (PDF) | View the e-poster or slides (PDF)
Alan Silva, University of Brasilia
back to top