SAS Global Forum 2016 Proceedings

Although the statistical foundations of predictive analytics have large overlaps, the business objectives, data availability, and regulations are different across the property and casualty insurance, life insurance, banking, pharmaceutical, and genetics industries. A common process in property and casualty insurance companies with large data sets is introduced, including data acquisition, data preparation, variable creation, variable selection, model building (also known as fitting), model validation, and model testing. Variable selection and model validation stages are described in more detail. Some successful models in the insurance companies are introduced. Base SAS^®, SAS^® Enterprise Guide^®, and SAS^® Enterprise Miner™ are presented as the main tools for this process.

Read the paper (PDF)

This paper aims to show a SAS^® macro for generating random numbers of skew-normal and skew-t distributions, as well as the quantiles of these distributions. The results are similar to those generated by the sn package of R software.

Read the paper (PDF) | Download the data file (ZIP) | View the e-poster or slides (PDF)

Geographically Weighted Negative Binomial Regression (GWNBR) was developed by Silva and Rodrigues (2014). It is a generalization of the Geographically Weighted Poisson Regression (GWPR) proposed by Nakaya and others (2005) and of the Poisson and negative binomial regressions. This paper shows a SAS^® macro to estimate the GWNBR model encoded in SAS/IML^® software and shows how to use the SAS procedure GMAP to draw the maps.

Read the paper (PDF) | Download the data file (ZIP)

Longitudinal and repeated measures data are seen in nearly all fields of analysis. Examples of this data include weekly lab test results of patients or performance of test score by children from the same class. Statistics students and analysts alike might be overwhelmed when it comes to repeated measures or longitudinal data analyses. They might try to educate themselves by diving into text books or taking semester-long or intensive weekend courses, resulting in even more confusion. Some might try to ignore the repeated nature of data and take short cuts such as analyzing all data as independent observations or analyzing summary statistics such as averages or changes from first to last points and ignoring all the data in-between. This hands-on presentation introduces longitudinal and repeated measures analyses without heavy emphasis on theory. Students in the workshop will have the opportunity to get hands-on experience graphing longitudinal and repeated measures data. They will learn how to approach these analyses with tools like PROC MIXED and PROC GENMOD. Emphasis will be on continuous outcomes, but categorical outcomes will briefly be covered.

Read the paper (PDF)

Currently Colpatria, as a part of Scotiabank in Colombia, has several methodologies that enable us to have a vision of the customer from a risk perspective. However, the current trend in the financial sector is to have a global vision that involves aspects of risk as well as of profitability and utility. As a part of the business strategies to develop cross-sell and customer profitability under conditions of risk needs, it's necessary to create a customer value index to score the customer according to different groups of business key variables that permit us to describe the profitability and risk of each customer. In order to generate the Index of Customer Value, we propose to construct a synthetic index using principal component analysis and multiple factorial analysis.

Read the paper (PDF)

When dealing with non-normal categorical response variables, logistic regression is the robust method to use for modeling the relationship between categorical outcomes and different predictors without assuming a linear relationship between them. Within such models, the categorical outcome might be binary, multinomial, or ordinal, and predictors might be continuous or categorical. Another complexity that might be added to such studies is when data is longitudinal, such as when outcomes are collected at multiple follow-up times. Learning about modeling such data within any statistical method is beneficial because it enables researchers to look at changes over time. This study looks at several methods of modeling binary and categorical response variables within regression models by using real-world data. Starting with the simplest case of binary outcomes through ordinal outcomes, this study looks at different modeling options within SAS^® and includes longitudinal cases for each model. To assess binary outcomes, the current study models binary data in the absence and presence of correlated observations under regular logistic regression and mixed logistic regression. To assess multinomial outcomes, the current study uses multinomial logistic regression. When responses are ordered, using ordinal logistic regression is required as it allows for interpretations based on inherent rankings. Different logit functions for this model include the cumulative logit, adjacent-category logit, and continuation ratio logit. Each of these models is also considered for longitudinal (panel) data using methods such as mixed models and Generalized Estimating Equations (GEE). The final consideration, which cannot be addressed by GEE, is the conditional logit to examine bias due to omitted explanatory variables at the cluster level. Different procedures for the aforementioned within SAS^® 9.4 are explored and their strengths and limitations are specified for applied researchers finding simil ar data characteristics. These procedures include PROC LOGISTIC, PROC GLIMMIX, PROC GENMOD, PROC NLMIXED, and PROC PHREG.

Read the paper (PDF)

Bayesian inference for complex hierarchical models with smoothing splines is typically intractable, requiring approximate inference methods for use in practice. Markov Chain Monte Carlo (MCMC) is the standard method for generating samples from the posterior distribution. However, for large or complex models, MCMC can be computationally intensive, or even infeasible. Mean Field Variational Bayes (MFVB) is a fast deterministic alternative to MCMC. It provides an approximating distribution that has minimum Kullback-Leibler distance to the posterior. Unlike MCMC, MFVB efficiently scales to arbitrarily large and complex models. We derive MFVB algorithms for Gaussian semiparametric multilevel models and implement them in SAS/IML^® software. To improve speed and memory efficiency, we use block decomposition to streamline the estimation of the large sparse covariance matrix. Through a series of simulations and real data examples, we demonstrate that the inference obtained from MFVB is comparable to that of PROC MCMC. We also provide practical demonstrations of how to estimate additional posterior quantities of interest from MFVB either directly or via Monte Carlo simulation.

Read the paper (PDF) | Download the data file (ZIP)

Competing-risks analyses are methods for analyzing the time to a terminal event (such as death or failure) and its cause or type. The cumulative incidence function CIF(j, t) is the probability of death by time t from cause j. New options in the LIFETEST procedure provide for nonparametric estimation of the CIF from event times and their associated causes, allowing for right-censoring when the event and its cause are not observed. Cause-specific hazard functions that are derived from the CIFs are the analogs of the hazard function when only a single cause is present. Death by one cause precludes occurrence of death by any other cause, because an individual can die only once. Incorporating explanatory variables in hazard functions provides an approach to assessing their impact on overall survival and on the CIF. This semiparametric approach can be analyzed in the PHREG procedure. The Fine-Gray model defines a subdistribution hazard function that has an expanded risk set, which consists of individuals at risk of the event by any cause at time t, together with those who experienced the event before t from any cause other than the cause of interest j. Finally, with additional assumptions a full parametric analysis is also feasible. We illustrate the application of these methods with empirical data sets.

Read the paper (PDF) | Watch the recording

Abstract quasi-complete separation is a commonly detected issue in logit/probit models. Quasi-complete separation occurs when a dependent variable separates an independent variable or a combination of several independent variables to a certain degree. In other words, levels in a categorical variable or values in a numeric variable are separated by groups of discrete outcome variables. Most of the time, this happens in categorical independent variable(s). Quasi-complete separation can cause convergence failures in logistic regression, which consequently result in a large coefficient estimate and standard errors inflation. Therefore, it can potentially yield biased results. This paper provides comprehensive reviews with various approaches for correcting quasi-complete separation in binary logistic regression model and presents hands-on examples from the healthcare insurance industry. First, it introduces the concept of quasi-complete separation and how to diagnosis the issue. Then, it provides step-by-step guidelines for fixing the problem, from a straightforward data configuration approach to a complicated statistical modeling approach such as the EXACT method, the FIRTH method, and so on.

Read the paper (PDF)

Introduction: Cycling is on the rise in many urban areas across the United States. The number of cyclist fatalities is also increasing, by 19% in the last 3 years. With the broad-ranging personal and public health benefits of cycling, it is important to understand factors that are associated with these traffic-related deaths. There are more distracted drivers on the road than ever before, but the question remains of the extent that these drivers are affecting cycling fatality rates. Methods: This paper uses the Fatality Analysis Reporting System (FARS) data to examine factors related to cyclist death when the drivers are distracted. We use a novel machine learning approach, adaptive LASSO, to determine the relevant features and estimate their effect. Results: If a cyclist makes an improper action at or just before the time of the crash, the likelihood of the driver of the vehicle being distracted decreases. At the same time, if the driver is speeding or has failed to obey a traffic sign and fatally hits a cyclist, the likelihood of them also being distracted increases. Being distracted is related to other risky driving practices when cyclists are fatally injured. Environmental factors such as weather and road condition did not impact the likelihood that a driver was distracted when a cyclist fatality occurred.

Read the paper (PDF)

This paper presents an application based on predictive analytics and feature-extraction techniques to develop the alternative method for diagnosis of obstructive sleep apnea (OSA). Our method reduces the time and cost associated with the gold standard or polysomnography (PSG), which is operated manually, by automatically determining the OSA's severity of a patient via classification models using the time series from a one-lead electrocardiogram (ECG). The data is from Dr. Thomas Penzel of Philipps-University, Germany, and can be downloaded at www.physionet.org. The selected data consists of 10 recordings (7 OSAs, and 3 controls) of ECG collected overnight, and non-overlapping-minute-by-minute OSA episode annotations (apnea and non-apnea states). This accounts for a total of 4,998 events (2,532 non-apnea and 2,466 apnea minutes). This paper highlights the nonlinear decomposition technique, wavelet analysis (WA) in SAS/IML^® software, to maximize the information of OSA symptoms from ECG, resulting in useful predictor signals. Then, the spectral and cross-spectral analyses via PROC SPECTRA are used to quantify important patterns of those signals to numbers (features), namely power spectral density (PSD), cross power spectral density (CPSD), and coherency, such that the machine learning techniques in SAS^® Enterprise Miner™, can differentiate OSA states. To eliminate variations such as body build, age, gender, and health condition, we normalize each feature by the feature of its original signal (that is, ratio of PSD of ECGs WA by PSD of ECG). Moreover, because different OSA symptoms occur at different times, we account for this by taking features from adjacency minutes into analysis, and select only important ones using a decision tree model. The best classification result in the validation data (70:30) obtained from the Random Forest model is 96.83% accuracy, 96.39% sensitivity, and 97.26% specificity. The results suggest our method is well comparable to the gold standard.

Read the paper (PDF)

The new SAS^® High-Performance Statistics procedures were developed to respond to the growth of big data and computing capabilities. Although there is some documentation regarding how to use these new high-performance (HP) procedures, relatively little has been disseminated regarding under what specific conditions users can expect performance improvements. This paper serves as a practical guide to getting started with HP procedures in SAS^®. The paper describes the differences between key HP procedures (HPGENSELECT, HPLMIXED, HPLOGISTIC, HPNLMOD, HPREG, HPCORR, HPIMPUTE, and HPSUMMARY) and their legacy counterparts both in terms of capability and performance, with a particular focus on discrepancies in real time required to execute. Simulations were conducted to generate data sets that varied on the number of observations (10,000, 50,000, 100,000, 500,000, 1,000,000, and 10,000,000) and the number of variables (50, 100, 500, and 1,000) to create these comparisons.

Read the paper (PDF) | View the e-poster or slides (PDF)

Motivated by the frequent need for equivalence tests in clinical trials, this presentation provides insights into tests for equivalence. We summarize and compare equivalence tests for different study designs, including one-parameter problems, designs with paired observations, and designs with multiple treatment arms. Power and sample size estimations are discussed. We also provide examples to implement the methods by using the TTEST, ANOVA, GLM, and MIXED procedures.

Read the paper (PDF)

Generalized linear models (GLMs) are commonly used to model rating factors in insurance pricing. The integration of territory rating and geospatial variables poses a unique challenge to the traditional GLM approach. Generalized additive models (GAMs) offer a flexible alternative based on GLM principles with a relaxation of the linear assumption. We explore two approaches for incorporating geospatial data in a pricing model using a GAM-based framework. The ability to incorporate new geospatial data and improve traditional approaches to territory ratemaking results in further market segmentation and a better match of price to risk. Our discussion highlights the use of the high-performance GAMPL procedure, which is new in SAS/STAT^® 14.1 software. With PROC GAMPL, we can incorporate the geographic effects of geospatial variables on target loss outcomes. We illustrate two approaches. In our first approach, we begin by modeling the predictors as regressors in a GLM, and subsequently model the residuals as part of a GAM based on location coordinates. In our second approach, we model all inputs as covariates within a single GAM. Our discussion compares the two approaches and demonstrates visualization of model outputs.

Read the paper (PDF)

The use of logistic models for independent binary data has relied first on asymptotic theory and later on exact distributions for small samples, as discussed by Troxler, Lalonde, and Wilson (2011). While the use of logistic models for dependent analysis based on exact analyses is not common, it is usually presented in the case of one-stage clustering. We present a SAS^® macro that allows the testing of hypotheses using exact methods in the case of one-stage and two-stage clustering for small samples. The accuracy of the method and the results are compared to results obtained using an R program.

Read the paper (PDF)

The popular MIXED, GLIMMIX, and NLMIXED procedures in SAS/STAT^® software fit linear, generalized linear, and nonlinear mixed models, respectively. These procedures take the classical approach of maximizing the likelihood function to estimate model. The flexible MCMC procedure in SAS/STAT can fit these same models by taking a Bayesian approach. Instead of maximizing the likelihood function, PROC MCMC draws samples (using a variety of sampling algorithms) to approximate the posterior distributions of model parameters. Similar to the mixed modeling procedures, PROC MCMC provides estimation, inference, and prediction. This paper describes how to use the MCMC procedure to fit Bayesian mixed models and compares the Bayesian approach to how the classical models would be fit with the familiar mixed modeling procedures. Several examples illustrate the approach in practice.

Read the paper (PDF)

SAS^® Forecast Server provides an excellent tool towards forecasting of time series across a variety of scenarios and industries. Given sufficient data, SAS Forecast Server can provide insights into seasonality, trend, and effects of input variables and events. In some business cases, there might not be sufficient data within individual series to produce these complex models. This paper presents an approach to forecasting these short time series. Using sufficiently long time series as a training data set, forecasted shape and volumes are created through clustering and attribute-based regressions, allowing for new series to have forecasts based on their characteristics. These forecasts are passed into SAS Forecast Server through preprocessing. Production of artificial history and input variables, matched with a custom model addition to the standard model repository, allows for future values of the artificial variable to be realized as the forecast. This process not only relieves the need for manual entry of overrides, but also allows SAS Forecast Server to choose alternative models as additional observations are provided through incremental loads.

Read the paper (PDF)

There are many methods to randomize participants in randomized control trials. If it is important to have approximately balanced groups throughout the course of the trial, simple randomization is not a suitable method. Perhaps the most common alternative method that provides balance is the blocked randomization method. A less well-known method called the treatment adaptive randomized design also achieves balance. This paper shows you how to generate an entire randomization sequence to randomize participants in a two-group clinical trial using the adaptive biased coin randomization design (ABCD), prior to recruiting any patients. Such a sequence could be used in a central randomization server. A unique feature of this method allows the user to determine the extent to which imbalance is permitted to occur throughout the sequence while retaining the probabilistic nature that is essential to randomization. Properties of sequences generated by the ABCD approach are compared to those generated by simple randomization, a variant of simple randomization that ensures balance at the end of the sequence, and by blocked randomization.

View the e-poster or slides (PDF)

There are many ways to customize data presentation and graphics for an area as specialized as oncology. This paper touches on various approaches to presenting oncological data and on ways that each of the graphics can be customized and individualized based on the needs of a study. The topics range from using SGPLOTs and SGPANEL, which are a simple but visually effective way to present stratified information, to using a complex integration of a series of HIGHLOW plots to depict duration of patient responses on a clinical trial. We also take a detailed look at how to generate specialized and customized waterfall plots used to visualize tumor growth patterns, as well as linear graphics that display a different approach to viewing the same endpoint. In addition to specifying the syntax to graph the data, instruction will be given on how to construct data sets in order to achieve the desired graphics.

Read the paper (PDF)

In studies where randomization is not possible, imbalance in baseline covariates (confounding by indication) is a fundamental concern. Propensity score matching (PSM) is a popular method to minimize this potential bias, matching individuals who received treatment to those who did not, to reduce the imbalance in pre-treatment covariate distributions. PSM methods continue to advance, as computing resources expand. Optimal matching, which selects the set of matches that minimizes the average difference in propensity scores between mates, has been shown to outperform less computationally intensive methods. However, many find the implementation daunting. SAS/IML^® software allows the integration of optimal matching routines that execute in R, e.g. the R optmatch package. This presentation walks through performing optimal PSM in SAS^® through implementing R functions, assessing whether covariate trimming is necessary prior to PSM. It covers the propensity score analysis in SAS, the matching procedure, and the post-matching assessment of covariate balance using SAS/STAT^® 13.2 and SAS/IML procedures.

Read the paper (PDF)

The Limit of Detection (LoD) is defined as the lowest concentration or amount of material, target, or analyte that is consistently detectable (for polymerase chain reaction [PCR] quantitative studies, in at least 95% of the samples tested). In practice, the estimation of the LoD uses a parametric curve fit to a set of panel member (PM1, PM2, PM3, and so on) data where the responses are binary. Typically, the parametric curve fit to the percent detection levels takes on the form of a probit or logistic distribution. The SAS^® PROBIT procedure can be used to fit a variety of distributions, including both the probit and logistic. We introduce the LOD_EST SAS macro that takes advantage of the SAS PROBIT procedure's strengths and returns an information-rich graphic as well as a percent detection table with associated 95% exact (Clopper-Pearson) confidence intervals for the hit rates at each level.

Read the paper (PDF) | Download the data file (ZIP)

Retailers need critical information about the expected inventory pattern over the life of a product to make pricing, replenishment, and staffing decisions. Hotels rely on booking curves to set rates and staffing levels for future dates. This paper explores a linear state space approach to understanding these industry challenges, applying the SAS/ETS^® SSM procedure. We also use the SAS/ETS SIMILARITY procedure to provide additional insight. These advanced techniques help us quantify the relationship between the current inventory level and all previous inventory levels (in the retail case). In the hospitality example, we can evaluate how current total bookings relate to historical booking levels. Applying these procedures can produce valuable new insights about the nature of the retail inventory cycle and the hotel booking curve.

Read the paper (PDF)

Markov chain Monte Carlo (MCMC) algorithms are an essential tool in Bayesian statistics for sampling from various probability distributions. Many users prefer to use an existing procedure to code these algorithms, while others prefer to write an algorithm from scratch. We demonstrate the various capabilities in SAS^® software to satisfy both of these approaches. In particular, we first illustrate the ease of using the MCMC procedure to define a structure. Then we step through the process of using SAS/IML^® to write an algorithm from scratch, with examples of a Gibbs sampler and a Metropolis-Hastings random walk.

Read the paper (PDF)

Although limited to a small fraction of health care providers, the existence and magnitude of fraud in health insurance programs requires the use of fraud prevention and detection procedures. Data mining methods are used to uncover odd billing patterns in large databases of health claims history. Efficient fraud discovery can involve the preliminary step of deploying automated outlier detection techniques in order to classify identified outliers as potential fraud before an in-depth investigation. An essential component of the outlier detection procedure is the identification of proper peer comparison groups to classify providers as within-the-norm or outliers. This study refines the concept of peer comparison group within the provider category and considers the possibility of distinct billing patterns associated with medical or surgical procedure codes identifiable by the Berenson-Eggers Type of System (BETOS). The BETOS system covers all HCPCS codes (Health Care Procedure Coding System); assigns a HCPCS code to only one BETOS code; consists of readily understood clinical categories; consists of categories that permit objective assignment& (Center for Medicare and Medicaid Services, CMS). The study focuses on the specialty General Practice and involves two steps: first, the identification of clusters of similar BETOS-based billing patterns; and second, the assessment of the effectiveness of these peer comparison groups in identifying outliers. The working data set is a sample of the summary of 2012 data of physicians active in health care government programs made publicly available by the CMS through its website. The analysis uses PROC FASTCLUS, the SAS^® cubic clustering criterion approach, to find the optimal number of clusters in the data. It also uses PROC ROBUSTREG to implement a multivariate adaptive threshold outlier detection method.

Read the paper (PDF) | Download the data file (ZIP)

In our book SAS for Mixed Models, we state that 'the majority of modeling problems are really design problems.' Graduate students and even relatively experienced statistical consultants can find translating a study design into a useful model to be a challenge. Generalized linear mixed models (GLMMs) exacerbate the challenge, because they accommodate complex designs, complex error structures, and non-Gaussian data. This talk covers strategies that have proven successful in design of experiment courses and in consulting sessions. In addition, GLMM methods can be extremely useful in planning experiments. This talk discusses methods to implement precision and power analysis to help choose between competing plausible designs and to assess the adequacy of proposed designs.

Read the paper (PDF) | Download the data file (ZIP)

Regression is used to examine the relationship between one or more explanatory (independent) variables and an outcome (dependent) variable. Ordinary least squares regression models the effect of explanatory variables on the average value of the outcome. Sometimes, we are more interested in modeling the median value or some other quantile (for example, the 10th or 90th percentile). Or, we might wonder if a relationship between explanatory variables and outcome variable is the same for the entire range of the outcome: Perhaps the relationship at small values of the outcome is different from the relationship at large values of the outcome. This presentation illustrates quantile regression, comparing it to ordinary least squares regression. Topics covered will include: a graphical explanation of quantile regression, its assumptions and advantages, using the SAS^® QUANTREG procedure, and interpretation of the procedure's output.

Read the paper (PDF) | Watch the recording

This paper describes the merging of designed experiments and regularized regression to find the significant factors to use for a real-time predictive model. The real-time predictive model is used in a Weyerhaeuser modified fiber mill to estimate the value of several key quality characteristics. The modified fiber product is accepted or rejected based on the model prediction or the predicted values' deviation from lab tests. To develop the model, a designed experiment was needed at the mill. The experiment was planned by a team of engineers, managers, operators, lab techs, and a statistician. The data analysis used the actual values of the process variables manipulated in the designed experiment. There were instances of the set point not being achieved or maintained for the duration of the run. The lab tests of the key quality characteristics were used as the responses. The experiment was designed with JMP^® 64-bit Edition 11.2.0 and estimated the two-way interactions of the process variables. It was thought that one of the process variables was curvilinear and this effect was also made estimable. The LASSO method, as implemented in the SAS^® GLMSELECT procedure, was used to analyze the data after cleaning and validating the results. Cross validation was used as the variable selection operator. The resulting prediction model passed all the predetermined success criteria. The prediction model is currently being used real time in the mill for product quality acceptance.

Read the paper (PDF)

Abstract logistic regression is one of the widely used regression models in statistics. Its dependent variable is a discrete variable with at least two levels. It measures the relationship between categorical dependent variables and independent variables and predicts the likelihood of having the event associated with an outcome variable. Variable reduction and screening are the techniques that reduce the redundant independent variables and filter out the independent variables with less predictive power. For several reasons, variable reduction and screening are important and critical especially when there are hundreds of covariates available that could possibly fit into the model. These techniques decrease the model estimation time and they can avoid the occurrence of non-convergence and can generate an unbiased inference of the outcome. This paper reviews various approaches for variable reduction, including traditional methods starting from univariate single logistic regressions and methods based on variable clustering techniques. It also presents hands-on examples using SAS/STAT^® software and illustrates three selection nodes available in SAS^® Enterprise Miner™: variable selection node, variable clustering node, and decision tree node.

Read the paper (PDF)

Sometimes, the relationship between an outcome (dependent) variable and the explanatory (independent) variable(s) is not linear. Restricted cubic splines are a way of testing the hypothesis that the relationship is not linear or summarizing a relationship that is too non-linear to be usefully summarized by a linear relationship. Restricted cubic splines are just a transformation of an independent variable. Thus, they can be used not only in ordinary least squares regression, but also in logistic regression, survival analysis, and so on. The range of values of the independent variable is split up, with knots defining the end of one segment and the start of the next. Separate curves are fit to each segment. Overall, the splines are defined so that the resulting fitted curve is smooth and continuous. This presentation describes when splines might be used, how the splines are defined, choice of knots, and interpretation of the regression results.

Read the paper (PDF) | Watch the recording

Longitudinal data with time-dependent covariates is not readily analyzed as there are inherent, complex correlations due to the repeated measurements on the sampling unit and the feedback process between the covariates in one time period and the response in another. A generalized method of moments (GMM) logistic regression model (Lalonde, Wilson, and Yin 2014) is one method for analyzing such correlated binary data. While GMM can account for the correlation due to both of these factors, it is imperative to identify the appropriate estimating equations in the model. Cai and Wilson (2015) developed a SAS^® macro using SAS/IML^® software to fit GMM logistic regression models with extended classifications. In this paper, we expand the use of this macro to allow for continuous responses and as many repeated time points and predictors as possible. We demonstrate the use of the macro through two examples, one with binary response and another with continuous response.

Read the paper (PDF)

Missing data is something that we cannot always prevent. Data can be missing due to subjects' refusing to answer a sensitive question or in fear of embarrassment. Researchers often assume that their data are missing completely at random or missing at random.' Unfortunately, we cannot test whether the mechanism condition is satisfied because missing values cannot be calculated. Alternatively, we can run simulation in SAS^® to observe the behaviors of missing data under different assumptions: missing completely at random, missing at random, and being ignorable. We compare the effects from imputation methods if we assign a set of variable of interests to missing. The idea of imputation is to see how efficient substituted values in a data set affect further studies. This lets the audience decide which method(s) would be best to approach a data set when it comes to missing data.

View the e-poster or slides (PDF)

The increasing size and complexity of data in research and business applications require a more versatile set of tools for building explanatory and predictive statistical models. In response to this need, SAS/STAT^® software continues to add new methods. This presentation takes you on a high-level tour of five recent enhancements: new effect selection methods for regression models with the GLMSELECT procedure, model selection for generalized linear models with the HPGENSELECT procedure, model selection for quantile regression with the HPQUANTSELECT procedure, construction of generalized additive models with the GAMPL procedure, and building classification and regression trees with the HPSPLIT procedure. For each of these approaches, the presentation reviews its key concepts, uses a basic example to illustrate its benefits, and guides you to information that will help you get started.

Read the paper (PDF)

It's impossible to know all of SAS^® or all of statistics. There will always be some technique that you don't know. However, there are a few techniques that anyone in biostatistics should know. If you can calculate those with SAS, life is all the better. In this session you will learn how to compute and interpret a baker's dozen of these techniques, including several statistics that are frequently confused. The following statistics are covered: prevalence, incidence, sensitivity, specificity, attributable fraction, population attributable fraction, risk difference, relative risk, odds ratio, Fisher's exact test, number needed to treat, and McNemar's test. With these 13 tools in their tool chest, even nonstatisticians or statisticians who are not specialists will be able to answer many common questions in biostatistics. The fact that each of these can be computed with a few statements in SAS makes the situation all the sweeter. Bring your own doughnuts.

Read the paper (PDF)

VBA has been described as a glue language, and has been widely used in exchanging data between Microsoft products such as Excel and Word or PowerPoint. How to trigger the VBA macro from SAS^® via DDE has been widely discussed in recent years. However, using SAS to send parameters to a VBA macro was seldom reported. This paper provides a solution for this problem. Copying Excel tables to PowerPoint using the combination of SAS and VBA is illustrated as an example. The SAS program rapidly scans all Excel files that are contained in one folder, passes the file information to VBA as parameters, and triggers the VBA macro to write PowerPoint files in a loop. As a result, a batch of PowerPoint files can be generated by just one mouse-click.

Read the paper (PDF) | Watch the recording

Inherently, mixed modeling with SAS/STAT^® procedures (such as GLIMMIX, MIXED, and NLMIXED) is computationally intensive. Therefore, considerable memory and CPU time can be required. The default algorithms in these procedures might fail to converge for some data sets and models. This encore presentation of a paper from SAS Global Forum 2012 provides recommendations for circumventing memory problems and reducing execution times for your mixed-modeling analyses. This paper also shows how the new HPMIXED procedure can be beneficial for certain situations, as with large sparse mixed models. Lastly, the discussion focuses on the best way to interpret and address common notes, warnings, and error messages that can occur with the estimation of mixed models in SAS^® software.

Read the paper (PDF) | Watch the recording

Health care and other programs collect large amounts of information in order to administer the program. A health insurance plan, for example, probably records every physician service, hospital and emergency department visit, and prescription medication--information that is collected in order to make payments to the various providers (hospitals, physicians, pharmacists). Although the data are collected for administrative purposes, these databases can also be used to address research questions, including questions that are unethical or too expensive to answer using randomized experiments. However, when subjects are not randomly assigned to treatment groups, we worry about assignment bias--the possibility that the people in one treatment group were healthier, smarter, more compliant, etc., than those in the other group, biasing the comparison of the two treatments. Propensity score methods are one way to adjust the comparisons, enabling research using administrative data to mimic research using randomized controlled trials. In this presentation, I explain what the propensity score is, how it is used to compare two treatments, and how to implement a propensity score analysis using SAS^®. Two methods using propensity scores are presented: matching and inverse probability weighting. Examples are drawn from health services research using the administrative databases of the Ontario Health Insurance Plan, the single payer for medically necessary care for the 13 million residents of Ontario, Canada. However, propensity score methods are not limited to health care. They can be used to examine the impact of any nonrandomized program, as long as there is enough information to calculate a credible propensity score.

Read the paper (PDF)

Many SAS^® procedures can be used to analyze large amounts of correlated data. This study was a secondary analysis of data obtained from the South Carolina Revenue and Fiscal Affairs Office (RFA). The data includes medical claims from all health care systems in South Carolina (SC). This study used the SAS procedure GENMOD to analyze a large amount of correlated data about Military Health Care (MHS) system beneficiaries who received behavioral health care in South Carolina Health Care Systems from 2005 to 2014. Behavioral health (BH) was defined by Major Diagnostic Code (MDC) 19 (mental disorders and diseases) and 20 (alcohol/drug use). MDCs are formed by dividing all possible principal diagnoses from the International Classification Diagnostic (ICD-9) codes into 25 mutually exclusive diagnostic categories. The sample included a total of 6,783 BH visits and 4,827 unique adult and child patients that included military service members, veterans, and their adult and child dependents who have MHS insurance coverage. PROC GENMOD included a multivariate GEE model with type of BH visit (mental health or substance abuse) as the dependent variable, and with gender, race group, age group, and discharge year as predictors. Hospital ID was used in the repeated statement with different correlation structures. Gender was significant with both independent correlation (p = .0001) and exchangeable structure (p = .0003). However, age group was significant using the independent correlation (p = .0160), but non-significant using the exchangeable correlation structure (p = .0584). SAS is a powerful statistical program for analyzing large amounts of correlated data with categorical outcomes.

Read the paper (PDF) | View the e-poster or slides (PDF)

The objective of this study is to use the GLM procedure in SAS^® to solve a complex linkage problem with multiple test forms in educational research. Typically, the ABSORB option in the GLM procedure makes this task relatively easy to implement. Note that for educational assessments, to apply one-dimensional combinations of two-parameter logistic (2PL) models (Hambleton, Swaminathan, and Rogers 1991, ch. 1) and generalized partial credit models (Muraki 1997) to a large-scale high-stakes testing program with very frequent administrations requires a practical approach to link test forms. Haberman (2009) suggested a pragmatic solution of simultaneous linking to solve the challenging linking problem. In this solution, many separately calibrated test forms are linked by the use of least-squares methods. In SAS, the GLM procedure can be used to implement this algorithm by the use of the ABSORB option for the variable that specifies administrations, as long as the data are sorted by order of administration. This paper presents the use of SAS to examine the application of this proposed methodology to a simple case of real data.

Read the paper (PDF)

In health care and epidemiological research, there is a growing need for basic graphical output that is clear, easy to interpret, and easy to create. SAS^® 9.3 has a very clear and customizable graphic called a Radar Graph, yet it can only display the unique responses of one variable and would not be useful for multiple binary variables. In this paper we describe a way to display multiple binary variables for a single population on a single radar graph. Then we convert our method into a macro with as few parameters as possible to make this procedure available for everyday users.

Read the paper (PDF) | Download the data file (ZIP)

Average yards per reception, as well as number of touchdowns, are commonly used to rank National Football League (NFL) players. However, scoring touchdowns lowers the player's average since it stops the play and therefore prevents the player from gaining more yardage. If yardages are tabulated in a life table, then yardage from a touchdown play is denoted as a right-censored observation. This paper discusses the application of the SAS/STAT^® Kaplan-Meier product-limit estimator to adjust these averages. Using 15 seasons of NFL receiving data, the relationship between touchdown rates and average yards per reception is compared, before and after adjustment. The modification of adjustments when a player incurred a 2-point safety during the season is also discussed.

Read the paper (PDF)

Each night on the news we hear the level of the Dow Jones Industrial Average along with the 'first difference,' which is today's price-weighted average minus yesterday's. It is that series of first differences that excites or depresses us each night as it reflects whether stocks made or lost money that day. Furthermore, the differences form the data series that has the most addressable statistical features. In particular, the differences have the stationarity requirement, which justifies standard distributional results such as asymptotically normal distributions of parameter estimates. Differencing arises in many practical time series because they seem to have what are called 'unit roots,' which mathematically indicate the need to take differences. In 1976, Dickey and Fuller developed the first well-known tests to decide whether differencing is needed. These tests are part of the ARIMA procedure in SAS/ETS^® in addition to many other time series analysis products. I'll review a little of what is was like to do the development and the required computing back then, say a little about why this is an important issue, and focus on examples.

Read the paper (PDF) | Watch the recording

This paper shows how to create maps from the output of the SAS/STAT^® SPP procedure using the same layer of the data. This cut can be done with PROC GINSIDE and the final plot with PROC GMAP. The result is a map that is nicer than the plot produced by PROC SPP.

Read the paper (PDF) | View the e-poster or slides (PDF)

SAS/IML^® 14.1 enables you to author, install, and call packages. A package consists of SAS/IML source code, documentation, data sets, and sample programs. Packages provide a simple way to share SAS/IML functions. An expert who writes a statistical analysis in SAS/IML can create a package and upload it to the SAS/IML File Exchange. A nonexpert can download the package, install it, and immediately start using it. Packages provide a standard and uniform mechanism for sharing programs, which benefits both experts and nonexperts. Packages are very popular with users of other statistical software, such as R. This paper describes how SAS/IML programmers can construct, upload, download, and install packages. They're not wrapped in brown paper or tied up with strings, but they'll soon be a few of your favorite things!