SAS Global Forum 2017 Proceedings

A C D E F G H M N P S T U W

A

Session 1223-2017:

A SAS^® Macro for Covariate Specification in Linear, Logistic, and Survival Regression

Specifying the functional form of a covariate is a fundamental part of developing a regression model. The choice to include a variable as continuous, categorical, or as a spline can be determined by model fit. This paper offers an efficient and user-friendly SAS^® macro (%SPECI) to help analysts determine how best to specify the appropriate functional form of a covariate in a linear, logistic, and survival analysis model. For each model, our macro provides a graphical and statistical single-page comparison report of the covariate as a continuous, categorical, and restricted cubic spline variable so that users can easily compare and contrast results. The report includes the residual plot and distribution of the covariate. You can also include other covariates in the model for multivariable adjustment. The output displays the likelihood ratio statistic, the Akaike Information Criterion (AIC), as well as other model-specific statistics. The %SPECI macro is demonstrated using an example data set. The macro includes the PROC REG, PROC LOGISTIC, PROC PHREG, PROC REPORT, PROC SGPLOT, and more procedures in SAS^® 9.4.

Read the paper (PDF) | Download the data file (ZIP)

Sai Liu, Stanford University

Session SAS0478-2017:

Advanced Hierarchical Modeling with the MCMC Procedure

Hierarchical models, also known as random-effects models, are widely used for data that consist of collections of units and are hierarchically structured. Bayesian methods offer flexibility in modeling assumptions that enable you to develop models that capture the complex nature of real-world data. These flexible modeling techniques include choice of likelihood functions or prior distributions, regression structure, multiple levels of observational units, and so on. This paper shows how you can fit these complex, multilevel hierarchical models by using the MCMC procedure in SAS/STAT^® software. PROC MCMC easily handles models that go beyond the single-level random-effects model, which typically assumes the normal distribution for the random effects and estimates regression coefficients. This paper shows how you can use PROC MCMC to fit hierarchical models that have varying degrees of complexity, from frequently encountered conditional independent models to more involved cases of modeling intricate interdependence. Examples include multilevel models for single and multiple outcomes, nested and non-nested models, autoregressive models, and Cox regression models with frailty. Also discussed are repeated measurement models, latent class models, spatial models, and models with nonnormal random-effects prior distributions.

Read the paper (PDF)

Fang Chen, SAS

Maura Stokes, SAS

Session SAS0339-2017:

An Oasis of Serenity in a Sea of Chaos: Automating the Management of Your UNIX/Linux Multi-tiered SAS^® Services

UNIX and Linux SAS^® administrators, have you ever been greeted by one of these statements as you walk into the office before you have gotten your first cup of coffee? Power outage! SAS servers are down. I cannot access my reports. Have you frantically tried to restart the SAS servers to avoid loss of productivity and missed one of the steps in the process, causing further delays while other work continues to pile up? If you have had this experience, you understand the benefit to be gained from a utility that automates the management of these multi-tiered deployments. Until recently, there was no method for automatically starting and stopping multi-tiered services in an orchestrated fashion. Instead, you had to use time-consuming manual procedures to manage SAS services. These procedures were also prone to human error, which could result in corrupted services and additional time lost, debugging and resolving issues injected by this process. To address this challenge, SAS Technical Support created the SAS Local Services Management (SAS_lsm) utility, which provides automated, orderly management of your SAS^® multi-tiered deployments. The intent of this paper is to demonstrate the deployment and usage of the SAS_lsm utility. Now, go grab a coffee, and let's see how SAS_lsm can make life less chaotic.

Read the paper (PDF)

Clifford Meyers, SAS

Session 1131-2017:

Application of Survival Analysis for Predicting Customer Churn with Recency, Frequency, and Monetary

Customer churn is an important area of concern that affects not just the growth of your company, but also the profit. Conventional survival analysis can provide a customer's likelihood to churn in the near term, but it does not take into account the lifetime value of the higher-risk churn customers you are trying to retain. Not all customers are equally important to your company. Recency, frequency, and monetary (RFM) analysis can help companies identify customers that are most important and most likely to respond to a retention offer. In this paper, we use the IML and PHREG procedures to combine the RFM analysis and survival analysis in order to determine the optimal number of higher-risk and higher-value customers to retain.

Read the paper (PDF)

Bo Zhang, IBM

Liwei Wang, Pharmaceutical Product Development Inc

C

Session 0829-2017:

Choose Carefully! An Assessment of Different Sample Designs on Estimates of Official Statistics

Designing a survey is a meticulous process involving a number of steps and many complex choices. For most survey researchers, the choice of a probability or non-probability sample is somewhat simple. However, throughout the sample design process, there are more complex choices each of which can introduce bias into the survey estimates. For example, the sampling statistician must decide whether to stratify the frame. And, if so, he has to decide how many strata, whether to explicitly stratify, and how should a stratum be defined. He also has to decide whether to use clusters. And, if so, how to define a cluster and what should be the ideal cluster size. The factors affecting these choices, along with the impact of different sample designs on survey estimates, are explored in this paper. The SURVEYSELECT procedure in SAS/STAT^® 14.1 is used to select a number of samples based on different designs using data from Jamaica's 2011 Population and Housing Census. Census results are assumed to be equal to the true population parameter. The estimates from each selected sample are evaluated against this parameter to assess the impact of different sample designs on point estimates. Design-adjusted survey estimates are computed using the SURVEYMEANS and SURVEYFREQ procedures in SAS/STAT 14.1. The resultant variances are evaluated to determine the sample design that yields the most precise estimates.

Read the paper (PDF)

Leesha Delatie-Budair, Statistical Institute of Jamaica

Jessica Campbell, Statistical Institute of Jamaica

Session 0288-2017:

Continuous Predictors in Regression Analyses

This presentation discusses the options for including continuous covariates in regression models. In his book, 'Clinical Prediction Models,' Ewout Steyerberg presents a hierarchy of procedures for continuous predictors, starting with dichotomizing the variable and moving to modeling the variable using restricted cubic splines or using a fractional polynomial model. This presentation discusses all of the choices, with a focus on the last two. Restricted cubic splines express the relationship between the continuous covariate and the outcome using a set of cubic polynomials, which are constrained to meet at pre-specified points, called knots. Between the knots, each curve can take on the shape that best describes the data. A fractional polynomial model is another flexible method for modeling a relationship that is possibly nonlinear. In this model, polynomials with noninteger and negative powers are considered, along with the more conventional square and cubic polynomials, and the small subset of powers that best fits the data is selected. The presentation describes and illustrates these methods at an introductory level intended to be useful to anyone who is familiar with regression analyses.

Read the paper (PDF)

Ruth Croxford, Institute for Clinical Evaluative Sciences

Session SAS0687-2017:

Convergence of Big Data, the Cloud, and Analytics: A Docker Toolbox for the Data Scientist

Learn how SAS^® works with analytics, big data, and the cloud all in one product: SAS^® Analytics for Containers. This session describes the architecture of containers running in the public, private, or hybrid cloud. The reference architecture also shows how SAS leverages the distributed compute of Hadoop. Topics include how SAS products such as Base SAS^®, SAS/STAT^® software, and SAS/GRAPH^® software can all run in a container in the cloud. This paper discusses how to work with a SAS container running in a variety of Infrastructure as a Service (IaaS) models, including Amazon Web Services and OpenStack cloud. Additional topics include provisioning web-browser-based clients via Jupyter Notebooks and SAS^® Studio to provide data scientists with the tool of their choice. A customer use case is discussed that describes how SAS Analytics for Containers enables an IT department to meet the ad hoc, compute-intensive, and scaling demands of the organization. An exciting differentiator for the data scientist is the ability to send some or all of the analytic workload to run inside their Hadoop cluster by using the SAS accelerators for Hadoop. Doing so enables data scientists to dive inside the data lake and harness the power of all the data.

Read the paper (PDF)

Donna De Capite, SAS

D

Session 0837-2017:

Data Science Rex: How Data Science Is Evolving (or Facing Extinction) across the Academic Landscape

The discipline of data science has seen an unprecedented evolution from primordial darkness to becoming the academic equivalent of an apex predator on university campuses across the country. But, survival of the discipline is not guaranteed. This session explores the genetic makeup of programs that are likely to survive, the genetic makeup of those that are likely to become extinct, and the role that the business community plays in that evolutionary process.

Read the paper (PDF)

Jennifer Priestley, Kennesaw State University

Session 1472-2017:

Differential Item Functioning Using SAS^®: An Item Response Theory Approach for Graded Responses

Until recently, psychometric analyses of test data within the Item Response Theory (IRT) framework were conducted using specialized, commercial software. However, with the inclusion of the IRT procedure in the suite of SAS^® statistical tools, SAS users can explore the psychometric properties of test items using modern test theory or IRT. Considering the item as the unit of analysis, the relationship between test items and the constructs they measure can be modeled as a function of an unobservable or latent variable. This latent variable or trait (for example, ability or proficiency), vary in the population. However, when examinees having the same trait level do not have the same probability to answering correctly or endorsing an item, we said that such an item might be functioning differently or exhibiting differential item functioning or DIF (Thissen, Steinberg, and Wainer, 2012). This study introduces the implementation of PROC IRT for conducting a DIF analysis for graded responses, using Samejima's graded response model (GRM; Samejima, 1969, 2010). The effectiveness of PROC IRT for evaluation of DIF items is assessed in terms of the Type I error and statistical power of the likelihood ratio test for testing DIF in graded responses.

Patricia Rodríguez de Gil, University of South Florida

Session 0957-2017:

Does Factor Indeterminacy Matter in Multidimensional Item Response Theory?

This paper illustrates proper applications of multidimensional item response theory (MIRT), which is available in SAS^® PROC IRT. MIRT combines item response theory (IRT) modeling and factor analysis when the instrument carries two or more latent traits. Although it might seem convenient to accomplish two tasks simultaneously by using one procedure, users should be cautious of misinterpretations. This illustration uses the 2012 Program for International Student Assessment (PISA) data set collected by Organisation for Economic Co-operation and Development (OECD). Because there are two known sub-domains in the PISA test (reading and math), PROC IRT was programmed to adopt a two-factor solution. In additional, the loading plot, dual plot, item difficulty/discrimination plot, and test information function plot in JMP^® were used to examine the psychometric properties of the PISA test. When reading and math items were analyzed in SAS MIRT, seven to 10 latent factors are suggested. At first glance, these results are puzzling because ideally all items should be loaded into two factors. However, when the psychometric attributes yielded from a two-parameter IRT analysis are examined, it is evident that both the reading and math test items are well written. It is concluded that even if factor indeterminacy is present, it is advisable to evaluate its psychometric soundness based on IRT because content validity can supersede construct validity.

Read the paper (PDF) | View the e-poster or slides (PDF)

Chong Ho Yu, Azusa Pacific University

E

Session SAS0374-2017:

Estimating Causal Effects from Observational Data with the CAUSALTRT Procedure

Randomized control trials have long been considered the gold standard for establishing causal treatment effects. Can causal effects be reasonably estimated from observational data too? In observational studies, you observe treatment T and outcome Y without controlling confounding variables that might explain the observed associations between T and Y. Estimating the causal effect of treatment T therefore requires adjustments that remove the effects of the confounding variables. The new CAUSALTRT (causal-treat) procedure in SAS/STAT^® 14.2 enables you to estimate the causal effect of a treatment decision by modeling either the treatment assignment T or the outcome Y, or both. Specifically, modeling the treatment leads to the inverse probability weighting methods, and modeling the outcome leads to the regression methods. Combined modeling of the treatment and outcome leads to doubly robust methods that can provide unbiased estimates for the treatment effect even if one of the models is misspecified. This paper reviews the statistical methods that are implemented in the CAUSALTRT procedure and includes examples of how you can use this procedure to estimate causal effects from observational data. This paper also illustrates some other important features of the CAUSALTRT procedure, including bootstrap resampling, covariate balance diagnostics, and statistical graphics.

Read the paper (PDF)

Michael Lamm, SAS

Yiu-Fai Yung, SAS

Session 0767-2017:

Estimation Strategies Involving Pooled Survey Data

Pooling two or more cross-sectional survey data sets (such as stacking the data sets on top of one another) is a strategy often used by researchers for one of two purposes: (1) to more efficiently conduct significance tests on point estimate changes observed over time or (2) to increase the sample size in hopes of improving the precision of a point estimate. The latter purpose is especially common when making inferences on a subgroup, or domain, of the target population insufficiently represented by a single survey data set. Using data from the National Survey of Family Growth (NSFG), the aim of this paper is to walk through a series of practical estimation objectives that can be tackled by analyzing data from two or more pooled survey data sets. Where applicable, we comment on the resulting interpretive nuances.

Read the paper (PDF)

Taylor Lewis, George Mason University

Session SAS0462-2017:

Evaluating Predictive Accuracy of Survival Models with PROC PHREG

Model validation is an important step in the model building process because it provides opportunities to assess the reliability of models before their deployment. Predictive accuracy measures the ability of the models to predict future risks, and significant developments have been made in recent years in the evaluation of survival models. SAS/STAT^® 14.2 includes updates to the PHREG procedure with a variety of techniques to calculate overall concordance statistics and time-dependent receiver operator characteristic (ROC) curves for right-censored data. This paper describes how to use these criteria to validate and compare fitted survival models and presents examples to illustrate these applications.

Read the paper (PDF)

Changbin Guo, SAS

Ying So, SAS

Woosung Jang, SAS

F

Session 0902-2017:

Fitting Complex Statistical Models with NLMIXED and MCMC Procedures

SAS/STAT^® software has several procedures that estimate parameters from generalized linear models designed for both continuous and discrete response data (including proportions and counts). Procedures such as LOGISTIC, GENMOD, GLIMMIX, and FMM, among others, offer a flexible range of analysis options to work with data from a variety of distributions and also with correlated or clustered data. SAS^® procedures can also model zero-inflated and truncated distributions. This paper demonstrates how statements from PROC NLMIXED can be written to match the output results from these procedures, including the LS-means. Situations arise where the flexible programming statements of PROC NLMIXED are needed for other situations such as zero-inflated or hurdle models, truncated counts, or proportions (including legitimate zeros) that have random effects, and also for probability distributions not available elsewhere. A useful application of these coding techniques is that programming statements from NLMIXED can often be directly transferred into PROC MCMC with little or no modification to perform analyses from a Bayesian perspective with these various types of complex models.

Read the paper (PDF)

Robin High, University of Nebraska Medical Center

Session 0202-2017:

Fitting a Flexible Model for Longitudinal Count Data Using the NLMIXED Procedure

Longitudinal count data arise when a subject's outcomes are measured repeatedly over time. Repeated measures count data have an inherent within subject correlation that is commonly modeled with random effects in the standard Poisson regression. A Poisson regression model with random effects is easily fit in SAS^® using existing options in the NLMIXED procedure. This model allows for overdispersion via the nature of the repeated measures; however, departures from equidispersion can also exist due to the underlying count process mechanism. We present an extension of the cross-sectional COM-Poisson (CMP) regression model established by Sellers and Shmueli (2010) (a generalized regression model for count data in light of inherent data dispersion) to incorporate random effects for analysis of longitudinal count data. We detail how to fit the CMP longitudinal model via a user-defined log-likelihood function in PROC NLMIXED. We demonstrate the model flexibility of the CMP longitudinal model via simulated and real data examples.

Read the paper (PDF)

Darcy Morris, U.S. Census Bureau

Session SAS0525-2017:

Five Things You Should Know about Quantile Regression

The increasing complexity of data in research and business analytics requires versatile, robust, and scalable methods of building explanatory and predictive statistical models. Quantile regression meets these requirements by fitting conditional quantiles of the response with a general linear model that assumes no parametric form for the conditional distribution of the response; it gives you information that you would not obtain directly from standard regression methods. Quantile regression yields valuable insights in applications such as risk management, where answers to important questions lie in modeling the tails of the conditional distribution. Furthermore, quantile regression is capable of modeling the entire conditional distribution; this is essential for applications such as ranking the performance of students on standardized exams. This expository paper explains the concepts and benefits of quantile regression, and it introduces you to the appropriate procedures in SAS/STAT^® software.

Read the paper (PDF)

Robert Rodriguez, SAS

Yonggang Yao, SAS

Session 0881-2017:

Fuzzy Matching and Predictive Models for Acquisition of New Customers

The acquisition of new customers is fundamental to the success of every business. Data science methods can greatly improve the effectiveness of acquiring prospective customers and can contribute to the profitability of business operations. In a business-to-business (B2B) setting, a predictive model might target business prospects as individual firms listed by, for example, Dunn & Bradstreet. A typical acquisition model can be defined using a binary response with the values categorizing a firm's customers and non-customers, for which it is then necessary to identify which of the prospects are actually customers of the firm. The methods of fuzzy logic, for example, based on the distance between strings, might help in matching customers' names and addresses with the overall universe of prospects. However, two errors can occur: false positives (when the prospect is incorrectly classified as a firm's customer), and false negatives (when the prospect is incorrectly classified as a non-customer). In the current practice of building acquisition models, these errors are typically ignored. In this presentation, we assess how these errors affect the performance of the predictive model as measured by a lift. In order to improve the model's performance, we suggest using a pre-determined sample of correct matches and to calibrate its predicted probabilities based on actual take-up rates. The presentation is illustrated with real B2B data and includes elements of SAS^® code that was used in the research.

Read the paper (PDF)

Daniel Marinescu, Concordia University

G

Session 1128-2017:

Geospatial Analysis: Linear, Nonlinear, or Both?

An important component of insurance pricing is the insured location and the associated riskiness of that location. Recently, we have experienced a large increase in the availability of external risk classification variables and associated risk factors by geospatial location. As additional geospatial data becomes available, it is prudent for insurers to take advantage of the new information to better match price to risk. Generalized additive models using penalized likelihood (GAMPL) have been explored as a way to incorporate new location-based information. This type of model can leverage the new geospatial information and incorporate it with traditional insurance rating variables in a regression-based model for rating. In our method, we propose a local regression model in conjunction with our GAMPL model. Our discussion demonstrates the use of the LOESS procedure as well as the GAMPL procedure in a combined solution. Both procedures are in SAS/STAT^® software. We discuss in detail how we built a local regression model and used the predictions from this model as an offset into a generalized additive model. We compare the results of the combined approach to results of each model individually.

Read the paper (PDF)

Kelsey Osterloo, State Farm Insurance Company

Angela Wu, State Farm Insurance Company

Session 1529-2017:

Getting Started with Bayesian Analytics

The presentation will give a brief introduction to Bayesian Analysis within SAS. Participants will learn the difference between Bayesian and Classical Statistics and be introduced to PROC MCMC.

Danny Modlin, SAS

Session 1530-2017:

Getting Started with Machine Learning

Machine Learning algorithms have been available in SAS software since 1979. This session provides practical examples of machine learning applications. The evolution of machine learning at SAS is illustrated with examples of nearest-neighbor discriminant analysis in SAS/STAT PROC DISCRIM to advanced predictive modeling in SAS Enterprise Miner. Machine learning techniques addressed include memory based reasoning, decision trees, neural networks, and gradient boosting algorithms.

Terry Woodfield, SAS

Session 1527-2017:

Getting Started with Multilevel Modeling

In this presentation you will learn the basics of working with nested data, such as students within classes, customers within households, or patients within clinics through the use of multilevel models. Multilevel models can accommodate correlation among nested units through random intercepts and slopes, and generalize easily to 2, 3, or more levels of nesting. These models represent a statistically efficient and powerful way to test your key hypotheses while accounting for the hierarchical nesting of the design. The GLIMMIX procedure is used to demonstrate analyses in SAS.

Catherine Truxillo, SAS

H

Session SAS2009-2017:

Hands-On Workshop: Statistical Analysis using SAS^® University Edition

This workshop provides hands-on experience performing statistical analysis with the Statistics tasks in SAS Studio. Workshop participants will learn to perform statistical analyses using tasks, evaluate which tasks are ideal for different kinds of analyses, edit the generated code, and customize a task.

Danny Modlin, SAS

M

Session 0764-2017:

Multi-Group Calibration in SAS^®: The IRT Procedure and SAS/IML^®

In item response theory (IRT), the distribution of examinees' abilities is needed to estimate item parameters. However, specifying the ability distribution is difficult, if not impossible, because examinees' abilities are latent variables. Therefore, IRT estimation programs typically assume that abilities follow a standard normal distribution. When estimating item parameters using two separate computer runs, one problem with this approach is that it causes item parameter estimates obtained from two groups that differ in ability level to be on different scales. There are several methods that can be used to place the item parameter estimates on a common scale, one of which is multi-group calibration. This method is also called concurrent calibration because all items are calibrated concurrently with a single computer run. There are two ways to implement multi-group calibration in SAS^®: 1) Using PROC IRT. 2) Writing an algorithm from scratch using SAS/IML^®. The purpose of this study is threefold. First, the accuracy of the item parameter estimates are evaluated using a simulation study. Second, the item parameter estimates are compared to those produced by an item calibration program flexMIRT. Finally, the advantages and disadvantages of using these two approaches to conduct multi-group calibration are discussed.

View the e-poster or slides (PDF)

Kyung Yong Kim, University of Iowa

Seohee Park, University of Iowa

Jinah Choi, University of Iowa

Hongwook Seo, ACT

N

Session 0970-2017:

Not So Simple: Intervals You Can Have Confidence In with Real Survey Data

Confidence intervals are critical to understanding your survey data. If your intervals are too narrow, you might inadvertently judge a result to be statistically significant when it is not. While many familiar SAS^® procedures, such as PROC MEANS and PROC REG, provide statistical tests, they rely on the assumption that the data comes from a simple random sample. However, almost no real-world survey uses such sampling. Learn how to use the SURVEYMEANS procedure and its SURVEY cousins to estimate confidence intervals and perform significance tests that account for the structure of the underlying survey, including the replicate weights now supplied by some statistical agencies. Learn how to extract the results you need from the flood of output that these procedures deliver.

Read the paper (PDF)

David Vandenbroucke, U.S Department of Housing and Urban Development

P

Session 0406-2017:

PROC LOGISTIC: Using New SAS^® 9.4 Features for Cumulative Logit Models with Partial Proportional Odds

Multicategory logit models extend the techniques of logistic regression to response variables with three or more categories. For ordinal response variables, a cumulative logit model assumes that the effect of an explanatory variable is identical for all modeled logits (known as the assumption of proportional odds). Past research supports the finding that as the sample size and number of predictors increase, it is unlikely that proportional odds can be assumed across all predictors. An emerging method to effectively model this relationship uses a partial proportional odds model, fit with unique parameter estimates at each level of the modeled relationship only for the predictors in which proportionality cannot be assumed. First used in SAS/STAT^® 12.1, PROC LOGISTIC in SAS^® 9.4 now extends this functionality for variable selection methods in a manner in which all equal and unequal slope parameters are available for effect selection. Previously, the statistician was required to assess predictor non-proportionality a priori through likelihood tests or subjectively through graphical diagnostics. Following a review of statistical methods and limitations of other commercially available software to model data exhibiting non-proportional odds, a public-use data set is used to examine the new functionality in PROC LOGISTIC using stepwise variable selection methods. Model diagnostics and the improvement in prediction compared to a general cumulative model are noted.

Read the paper (PDF) | Download the data file (ZIP) | View the e-poster or slides (PDF)

Paul Hilliard, Educational Testing Service (ETS)

Session SAS0332-2017:

Propensity Score Methods for Causal Inference with the PSMATCH Procedure

In a randomized study, subjects are randomly assigned to either a treated group or a control group. Random assignment ensures that the distribution of the covariates is the same in both groups and that the treatment effect can be estimated by directly comparing the outcomes for the subjects in the two groups. In contrast, subjects in an observational study are not randomly assigned. In order to establish causal interpretations of the treatment effects in observational studies, special statistical approaches that adjust for the covariate confounding are required to obtain unbiased estimation of causal treatment effects. One strategy for correctly estimating the treatment effect is based on the propensity score, which is the conditional probability of the treatment assignment given the observed covariates. Prior to the analysis, you use propensity scores to adjust the data by weighting observations, stratifying subjects that have similar propensity scores, or matching treated subjects to control subjects. This paper reviews propensity score methods for causal inference and introduces the PSMATCH procedure, which is new in SAS/STAT^® 14.2. The procedure provides methods of weighting, stratification, and matching. Matching methods include greedy matching, matching with replacement, and optimal matching. The procedure assesses covariate balance by comparing distributions between the adjusted treated and control groups.

Read the paper (PDF)

Yang Yuan, SAS

S

Session 1005-2017:

SAS^® Macros for Computing the Mediated Effect in the Pretest-Posttest Control Group Design

Mediation analysis is a statistical technique for investigating the extent to which a mediating variable transmits the relation of an independent variable to a dependent variable. Because it is useful in many fields, there have been rapid developments in statistical mediation methods. The most cutting-edge statistical mediation analysis focuses on the causal interpretation of mediated effect estimates. Cause-and-effect inferences are particularly challenging in mediation analysis because of the difficulty of randomizing subjects to levels of the mediator (MacKinnon, 2008). The focus of this paper is how incorporating longitudinal measures of the mediating and outcome variables aides in the causal interpretation of mediated effects. This paper provides useful SAS^® tools for designing adequately powered studies to detect the mediated effect. Three SAS macros were developed using the powerful but easy-to-use REG, CALIS, and SURVEYSELECT procedures to do the following: (1) implement popular statistical models for estimating the mediated effect in the pretest-posttest control group design; (2) conduct a prospective power analysis for determining the required sample size for detecting the mediated effect; and (3) conduct a retrospective power analysis for studies that have already been conducted and a required sample to detect an observed effect is desired. We demonstrate the use of these three macros with an example.

Read the paper (PDF)

David MacKinnon, Arizona State University

Session SAS0521-2017:

Step Up Your Statistical Practice with Today's SAS/STAT^® Software

Has the rapid pace of SAS/STAT^® releases left you unaware of powerful enhancements that could make a difference in your work? Are you still using PROC REG rather than PROC GLMSELECT to build regression models? Do you understand how the GENMOD procedure compares with the newer GEE and HPGENSELECT procedures? Have you grasped the distinction between PROC PHREG and PROC ICPHREG? This paper will increase your awareness of modern alternatives to well-established tools in SAS/STAT by using succinct, high-level comparisons rather than detailed descriptions to explain the relative benefits of procedures and methods. The paper focuses on alternatives in the areas of regression modeling, mixed models, generalized linear models, and survival analysis. When you see the advantages of these newer tools, you will want to put them into practice. This paper points you to helpful resources for getting started.

Read the paper (PDF)

Robert Rodriguez, SAS

Phil Gibbs, SAS

Session 0876-2017:

Streamline Your Workflow: Integrating SAS^®, LaTeX, and R into a Single Reproducible Document

There is an industry-wide push toward making workflows seamless and reproducible. Incorporating reproducibility into the workflow has many benefits; among them are increased transparency, time savings, and accuracy. We walk through how to seamlessly integrate SAS^®, LaTeX, and R into a single reproducible document. We also discuss best practices for general principles such as literate programming and version control.

Read the paper (PDF)

Lucy D'Agostino McGowan, Vanderbilt University

T

Session 0802-2017:

The Truth Is Out There: Leveraging Census Data Using PROC SURVEYLOGISTIC

The advent of robust and thorough data collection has resulted in the term big data. With Census data becoming richer, more nationally representative, and voluminous, we need methodologies that are designed to handle the manifold survey designs that Census data sets implement. The relatively nascent PROC SURVEYLOGISTIC, an experimental procedure in SAS^®9 and fully supported in SAS 9.1, addresses some of these methodologies, including clusters, strata, and replicate weights. PROC SURVEYLOGISTIC handles data that is not a straightforward random sample. Using Census data sets, this paper provides examples highlighting the appropriate use of survey weights to calculate various estimates, as well as the calculation and interpretation of odds ratios between categorical variable interactions when predicting a binary outcome.

Read the paper (PDF)

Richard Dirmyer, Rochester Institute of Technology

U

Session 0229-2017:

Using SAS^® to Estimate SE, SP, PPV, NPV, and Other Statistics of Chemical Mass Casualty Triage

Chemical incidents involving irritant chemicals such as chlorine pose a significant threat to life and require rapid assessment. Data from the Validating Triage for Chemical Mass Casualty Incidents A First Step R01 grant was used to determine the most predictive signs and symptoms (S/S) for a chlorine mass casualty incident. SAS^® 9.4 was used to estimate sensitivity, specificity, positive and negative predictive values, and other statistics of irritant gas syndrome agent S/S for two exiting systems designed to assist emergency responders in hazardous material incidents (Wireless Information System for Emergency Responders (WISER) and CHEMM Intelligent Syndrome Tool (CHEMM-IST)). The results for WISER showed the sensitivity was .72 to 1.0; specificity .25 to .47; and the positive predictive value and negative predictive value were .04 to .87 and .33 to 1.0, respectively. The results for CHEMM-IST showed the sensitivity was .84 to .97; specificity .29 to .45; and the positive predictive value and negative predictive value were .18 to .42 and .86 to .97, respectively.

Read the paper (PDF) | View the e-poster or slides (PDF)

Abbas Tavakoli, University of South Carolina

Joan Culley, University of South Carolina

Jane Richter, University of South Carolina

Sara Donevant, University of South Carolina

Jean Craig, Medical University of South Carolina

Session 0869-2017:

Using a Population Average Model to Investigate the Success of a Customer Retention Strategy

In many healthcare settings, patients are like customers they have a choice. One example is whether to participate in a procedure. In population-based screening in which the goal is to reduce deaths, the success of a program hinges on the patient's choice to accept and comply with the procedure. Like in many other industries, this not only relies on the program to attract new eligible patients to attend for the first time, but it also relies on the ability of the program to retain existing customers. The success of a new customer retention strategy within a breast screening environment is examined by applying a population averaged model (also know as marginal models), which uses generalized estimating equations (GEEs) to account for the lack of independence of the observations. Arguments for why a population average model was applied instead of a mixed effects model (or random effects model) are provided. This business case provides a great introductory session for people to better understand the difference between mixed effects and marginal models, and illustrates how to implement a population average model within SAS^® by using the GENMOD procedure.

Read the paper (PDF)

Colleen McGahan, BC CANCER AGENCY

Session 0231-2017:

Using a SAS^® Macro to Calculate Kappa and 95% CI for Several Pairs of Nurses of Chemical Triage

It is often necessary to assess multi-rater agreement for multiple-observation categories in case-controlled studies. The Kappa statistic is one of the most common agreement measures for categorical data. The purpose of this paper is to show an approach for using SAS^® 9.4 procedures and the SAS^® Macro Language to estimate Kappa with 95% CI for pairs of nurses that used two different triage systems during a computer-simulated chemical mass casualty incident (MCI). Data from the Validating Triage for Chemical Mass Casualty Incidents A First Step R01 grant was used to assess the performance of a typical hospital triage system called the Emergency Severity Index (ESI), compared with an Irritant Gas Syndrome Agent (IGSA) triage algorithm being developed from this grant, to quickly prioritize the treatment of victims of IGSA incidents. Six different pairs of nurses used ESI triage, and seven pairs of nurses used the IGSA triage prototype to assess 25 patients exposed to an IGSA and 25 patients not exposed. Of the 13 pairs of nurses in this study, two pairs were randomly selected to illustrate the use of the SAS Macro Language for this paper. If the data was not square for two nurses, a square-form table for observers using pseudo-observations was created. A weight of 1 for real observations and a weight of .0000000001 for pseudo-observations were assigned. Several macros were used to reduce programming. In this paper, we show only the results of one pair of nurses for ESI.

Read the paper (PDF) | View the e-poster or slides (PDF)

Abbas Tavakoli, University of South Carolina

Joan Culley, University of South Carolina

Jane Richter, University of South Carolina

Sara Donevant, University of South Carolina

Jean Craig, Medical University of South Carolina

Session 0854-2017:

Using the LOGISTIC or SURVEYLOGISTIC Procedure and Weighting of Public-Use Data in the Classroom

The rapidly evolving informatics capabilities of the past two decades have resulted in amazing new data-based opportunities. Large public use data sets are now available for easy download and utilization in the classroom. Days of classroom exercises based on static, clean, easily maneuverable samples of 100 or less are over. Instead, we have large and messy real-world data at our fingertips allowing for educational opportunities not available in years past. There are now hundreds of public-use data sets available for download and analysis in the classroom. Many of these sources are survey-based and require the understanding of weighting techniques. These techniques are necessary for proper variance estimation allowing for sound inferences through statistical analysis. This example uses the California Health Interview Survey to present and compare weighted and non-weighted results using the SURVEYLOGISTIC procedure.

Read the paper (PDF)

Tyler Smith, National University

Besa Smith, Analydata

Session 0923-2017:

Using the SYMPUT Function to Automatically Choose Reference for Bivariate Cox Proportional Models

Bivariate Cox proportional models are used when we test the association between a single covariate and the outcome. The test repeats for each covariate of interest. SAS^® uses the last category as the default reference. This raises problems when we want to keep using 0 as our reference for each covariate. The reference group can be changed in the CLASS statement. But, if a format is associated with a covariate, we have to use the corresponding format instead of raw numeric data. This problem becomes even worse when we have to repeat the test and manually enter the reference every single time. This presentation demonstrates one way of fixing the problem using the MACRO function and SYMPUT function.

Read the paper (PDF) | Download the data file (ZIP)

Zhongjie Cai, University of Southern California

W

Session 0785-2017:

Weight of Evidence Coding for the Cumulative Logit Model

Weight of evidence (WOE) coding of a nominal or discrete variable X is widely used when preparing predictors for usage in binary logistic regression models. The concept of WOE is extended to ordinal logistic regression for the case of the cumulative logit model. If the target (dependent) variable has L levels, then L-1 WOE variables are needed to recode X. The appropriate setting for implementing WOE coding is the cumulative logit model with partial proportionate odds. As in the binary case, it is important to bin X to achieve parsimony before the WOE coding. SAS^® code to perform this binning is discussed. An example is given that shows the implementation of WOE coding and binning for a cumulative logit model with the target variable having three levels.

Read the paper (PDF)

Bruce Lund, Magnify Analytic Solutions

Session 0883-2017:

What Statisticians Should Know about Machine Learning

In the last few years, machine learning and statistical learning methods have gained increasing popularity among data scientists and analysts. Statisticians have sometimes been reluctant to embrace these methodologies, partly due to a lack of familiarity, and partly due to concerns with interpretability and usability. In fact, statisticians have a lot to gain by using these modern, highly computational tools. For certain types of problems, machine learning methods can be much more accurate predictors than traditional methods for regression and classification, and some of these methods are particularly well suited for the analysis of big and wide data. Many of these methods have origins in statistics or at the boundary of statistics and computer science, and some are already well established in statistical procedures, including LASSO and elastic net for model selection and cross validation for model evaluation. In this talk, I go through some examples illustrating the application of machine learning methods and interpretation of their results, and show how these methods are similar to, and differ from, traditional statistical methodology for the same problems.

Read the paper (PDF)

D. Cutler, Utah State University