Statistician Papers A-Z

Paper 1876-2014:
A Mental Health and Risk Behavior Analysis of American Youth Using PROC FACTOR and SURVEYLOGISTIC
The current study looks at recent health trends and behavior analyses of youth in America. Data used in this analysis was provided by the Center for Disease Control and Prevention and gathered using the Youth Risk Behavior Surveillance System (YRBSS). A factor analysis was performed to identify and define latent mental health and risk behavior variables. A series of logistic regression analyses were then performed using the risk behavior and demographic variables as potential contributing factors to each of the mental health variables. Mental health variables included disordered eating and depression/suicidal ideation data, while the risk behavior variables included smoking, consumption of alcohol and drugs, violence, vehicle safety, and sexual behavior data. Implications derived from the results of this research are a primary focus of this study. Risks and benefits of using a factor analysis with logistic regression in social science research will also be discussed in depth. Results included reporting differences between the years of 1991 and 2011. All results are discussed in relation to current youth health trend issues. Data was analyzed using SAS® 9.3.
Deanna Schreiber-Gregory, North Dakota State University
Paper 1442-2014:
A Risk Score Calculator for Short-Term Morbidity Following Hip Fracture Surgery
Hip fractures are a common source of morbidity and mortality among the elderly. While multiple prior studies have identified risk factors for poor outcomes, few studies have presented a validated method for stratifying patient risk. The purpose of this study was to develop a simple risk score calculator tool predictive of 30-day morbidity after hip fracture. To achieve this, we prospectively queried a database maintained by The American College of Surgeons (ACS) National Surgical Quality Improvement Program (NSQIP) to identify all cases of hip fracture between 2005 and 2010, based on primary Current Procedural Terminology (CPT) codes. Patient demographics, comorbidities, laboratory values, and operative characteristics were compared in a univariate analysis, and a multivariate logistic regression analysis was then used to identify independent predictors of 30-day morbidity. Weighted values were assigned to each independent risk factor and were used to create predictive models of 30-day complication risk. The models were internally validated with randomly partitioned 80%/20% cohort groups. We hypothesized that significant predictors of morbidity could be identified and used in a predictive model for a simple risk score calculator. All analyses are performed via SAS® software.
Yubo Gao, University of Iowa Hospitals and Clinics
Paper 1657-2014:
A SAS® Macro for Complex Sample Data Analysis Using Generalized Linear Models
This paper shows users how they can use a SAS® macro named %SURVEYGLM to incorporate information about survey design to Generalized Linear Models (GLM). The R function %svyglm (Lumley, 2004) was used to verify the suitability of the %SURVEYGLM macro estimates. The results show that estimates are closer than the R function and that new distributions can be easily added to the algorithm.
Paulo Dourado, University of Brasilia
Alan Silva, Universidade de Brasilia
Paper 1461-2014:
A SAS® Macro to Diagnose Influential Subjects in Longitudinal Studies
Influence analysis in statistical modeling looks for observations that unduly influence the fitted model. Cook s distance is a standard tool for influence analysis in regression. It works by measuring the difference in the fitted parameters as individual observations are deleted. You can apply the same idea to examining influence of groups of observations for example, the multiple observations for subjects in longitudinal or clustered data but you need to adapt it to the fact that different subjects can have different numbers of observations. Such an adaptation is discussed by Zhu, Ibrahim, and Cho (2012), who generalize the subject size factor as the so-called degree of perturbation, and correspondingly generalize Cook s distances as the scaled Cook s distance. This paper presents the %SCDMixed SAS® macro, which implements these ideas for analyzing influence in mixed models for longitudinal or clustered data. The macro calculates the degree of perturbation and scaled Cook s distance measures of Zhu et al. (2012) and presents the results with useful tabular and graphical summaries. The underlying theory is discussed, as well as some of the programming tricks useful for computing these influence measures efficiently. The macro is demonstrated using both simulated and real data to show how you can interpret its results for analyzing influence in your longitudinal modeling.
Grant Schneider, The Ohio State University
Randy Tobias, SAS Institute
Paper SAS400-2014:
An Introduction to Bayesian Analysis with SAS/STAT® Software
The use of Bayesian methods has become increasingly popular in modern statistical analysis, with applications in numerous scientific fields. In recent releases, SAS® has provided a wealth of tools for Bayesian analysis, with convenient access through several popular procedures in addition to the MCMC procedure, which is specifically designed for complex Bayesian modeling (not discussed here). This paper introduces the principles of Bayesian inference and reviews the steps in a Bayesian analysis. It then describes the Bayesian capabilities provided in four procedures(the GENMOD, PHREG, FMM, and LIFEREG procedures) including the available prior distributions, posterior summary statistics, and convergence diagnostics. Various sampling methods that are used to sample from the posterior distributions are also discussed. The second part of the paper describes how to use the GENMOD and PHREG procedures to perform Bayesian analyses for real-world examples and how to take advantage of the Bayesian framework to address scientific questions.
Fang Chen, SAS
Funda Gunes, SAS
Maura Stokes, SAS
Paper 1842-2014:
An Investigation of the Kolmogorov-Smirnov Nonparametric Test Using SAS®
The Kolmogorov-Smirnov (K-S) test is one of the most useful and general nonparametric methods for comparing two samples. It is sensitive to all types of differences between two populations (shift, scale, shape, and so on). In this paper, we will present a thorough investigation into the K-S test including, derivation of the formal test procedure, practical demonstration of the test, large sample approximation of the test, and ease of use in SAS® using the NPAR1WAY procedure.
Tison Bolen, Cardinal Health
Lisa Conley, Cardinal Health
Jason Greenfield, Cardinal Health
Dawit Mulugeta, Cardinal Health
Paper 1278-2014:
Analysis of Data with Overdispersion Using SAS®
Overdispersion (extra variation) arises in binomial, multinomial, or count data when variances are larger than those allowed by the binomial, multinomial, or Poisson model. This phenomenon is caused by clustering of the data, lack of independence, or both. As pointed out by McCullagh and Nelder (1989), Overdispersion is not uncommon in practice. In fact, some would maintain that over-dispersion is the norm in practice and nominal dispersion the exception. Several approaches are found for handling overdispersed data, namely quasi-likelihood and likelihood models, generalized estimating equations, and generalized linear mixed models. Some classical likelihood models are presented. Among them are the beta-binomial, binomial cluster (a.k.a. random clumped binomial), negative-binomial, zero-inflated Poisson, zero-inflated negative-binomial, hurdle Poisson, and the hurdle negative-binomial. We focus on how these approaches or models can be implemented in a practical way using, when appropriate, the procedures GLIMMIX, GENMOD, FMM, COUNTREG, NLMIXED, and SURVEYLOGISTIC. Some real data set examples are discussed in order to illustrate these applications. We also provide some guidance on how to analyze generalized linear overdispersion mixed models and possible scenarios where we might encounter them.
Jorge Morel, Procter and Gamble
Paper 1260-2014:
Analyzing Data from Experiments in Which the Treatment Groups Have Different Hierarchical Structures
In randomized experiments, it is generally assumed that the hierarchical structures and variances are the same in the treatment and control groups. In some situations, however, these structures and variance components can differ. Consider a randomized experiment in which individuals randomized to the treatment condition are further assigned to clusters in which the intervention is administered, but no such clustering occurs in the control condition. Such a structure can occur, for example, when the individuals in the treatment condition are randomly assigned to group therapy sessions or to mathematics tutoring groups; individuals in the control condition do not receive group therapy or mathematics tutoring and therefore do not have that level of clustering. In this example, individuals in the treatment condition have a hierarchical structure, but individuals in the control condition do not. If the therapists or tutors differ in efficacy, the clustering in the treatment condition induces an extra source of variability in the data that needs to be accounted for in the analysis. We show how special features of SAS® PROC MIXED and PROC GLIMMIX can be used to analyze data in which one or more treatment groups have a hierarchical structure that differs from that in the control group. We also discuss how to code variables in order to increase the computational efficiency for estimating parameters from these designs.
Sharon Lohr, Westat
Peter Schochet, Mathematica Policy Research
Paper SAS279-2014:
Analyzing Interval-Censored Data with the ICLIFETEST Procedure
SAS/STAT® 13.1 includes the new ICLIFETEST procedure, which is specifically designed for analyzing interval-censored data. This type of data is frequently found in studies where the event time of interest is known to have occurred not at a specific time but only within a certain time period. PROC ICLIFETEST performs nonparametric survival analysis of interval-censored data and is a counterpart to PROC LIFETEST, which handles right-censored data. With similar syntax, you use PROC ICLIFETEST to estimate the survival function and to compare the survival functions of different populations. This paper introduces you to the ICLIFETEST procedure and presents examples that illustrate how you can use it to perform analyses of interval-censored data.
Changbin Guo, SAS
Gordon Johnston, SAS
Ying So, SAS
Paper SAS026-2014:
Analyzing Multilevel Models with the GLIMMIX Procedure
Hierarchical data are common in many fields, from pharmaceuticals to agriculture to sociology. As data sizes and sources grow, information is likely to be observed on nested units at multiple levels, calling for the multilevel modeling approach. This paper describes how to use the GLIMMIX procedure in SAS/STAT® to analyze hierarchical data that have a wide variety of distributions. Examples are included to illustrate the flexibility that PROC GLIMMIX offers for modeling within-unit correlation, disentangling explanatory variables at different levels, and handling unbalanced data. Also discussed are enhanced weighting options, new in SAS/STAT 13.1, for both the MODEL and RANDOM statements. These weighting options enable PROC GLIMMIX to handle weights at different levels. PROC GLIMMIX uses a pseudolikelihood approach to estimate parameters, and it computes robust standard error estimators. This new feature is applied to an example of complex survey data that are collected from multistage sampling and have unequal sampling probabilities.
Min Zhu, SAS
Paper SAS019-2014:
Case-Level Residual Analysis in the CALIS Procedure
This paper demonstrates the new case-level residuals in the CALIS procedure and how they differ from classic residuals in structural equation modeling (SEM). Residual analysis has a long history in statistical modeling for finding unusual observations in the sample data. However, in SEM, case-level residuals are considerably more difficult to define because of 1) latent variables in the analysis and 2) the multivariate nature of these models. Historically, residual analysis in SEM has been confined to residuals obtained as the difference between the sample and model-implied covariance matrices. Enhancements to the CALIS procedure in SAS/STAT® 12.1 enable users to obtain case-level residuals as well. This enables a more complete residual and influence analysis. Several examples showing mean/covariance residuals and case-level residuals are presented.
Catherine Truxillo, SAS
Paper 1661-2014:
Challenges of Processing Questionnaire Data from Collection to SDTM to ADaM and Solutions Using SAS®
Often in a clinical trial, measures are needed to describe pain, discomfort, or physical constraints that are visible but not measurable through lab tests or other vital signs. In these cases, researchers turn to questionnaires to provide documentation of improvement or statistically meaningful change in support safety and efficacy hypotheses. For example, in studies (like Parkinson s studies) where pain or depression are serious non-motor symptoms of the disease, these questionnaires provide primary endpoints for analysis. Questionnaire data presents unique challenges in both collection and analysis in the world of CDISC standards. The questions are usually aggregated into scale scores, as the underlying questions by themselves provide little additional usefulness. SAS® is a powerful tool for extraction of the raw data from the collection databases and transposition of columns into a basic data structure in SDTM, which is vertical. The data is then processed further as per the instructions in the Statistical Analysis Plan (SAP). This involves translation of the originally collected values into sums, and the values of some questions need to be reversed. Missing values can be computed as means of the remaining questions. These scores are then saved as new rows in the ADaM (analysis-ready) data sets. This paper describes the types of questionnaires, how data collection takes place, the basic CDISC rules for storing raw data in SDTM, and how to create analysis data sets with derived records using ADaM standards, while maintaining traceability to the original question.
Karin LaPann, PRA International
Terek Peterson
Paper 1798-2014:
Comparison of Five Analytic Techniques for Two-Group, Pre-Post Repeated Measures Designs Using SAS®
There has been debate regarding which method to use to analyze repeated measures continuous data when the design includes only two measurement times. Five different techniques can be applied and give similar results when there is little to no correlation between pre- and post-test measurements and when data at each time point are complete: 1) analysis of variance on the difference between pre- and post-test, 2) analysis of covariance on the differences between pre- and post-test controlling for pre-test, 3) analysis of covariance on post-test controlling for pre-test, 4) multiple analysis of variance on post- test and pre-test, and 5) repeated measures analysis of variance. However, when there is missing data or if a moderate to high correlation between pre- and post-test measures exists under an intent-to-treat analysis framework, bias is introduced in the tests for the ANOVA, ANCOVA, and MANOVA techniques. A comparison of Type III sum of squares, F-tests, and p-values for a complete case and an intent-to-treat analysis are presented. The analysis using a complete case data set shows that all five methods produce similar results except for the repeated measures ANOVA due to a moderate correlation between pre- and post-test measures. However, significant bias is introduced for the tests using the intent-to-treat data set.
J. Madison Hyer, Georgia Regents University
Jennifer Waller, Georgia Regents University
Paper SAS403-2014:
Consumer Research Tools
The big questions in consumer research lead to statistical methods appropriate to them. 'What do consumers say?' is all about analyzing surveys and finding relationships between preferences and background attributes. 'What do consumers think? is about looking at higher-level structures like preference mappings that can be derived from ratings. 'What will consumers pay?' is about conducting choice experiments to pin down the way consumers trade off among features and with prices, with the willingness to pay. 'How do you trigger purchases?' is about experiments that determine which interventions work, and how to target them to potential consumers, with uplift modeling. The SAS product JMP® version 11 was released last fall with a new group of modeling tools to address these and other questions in consumer research. Traditionally JMP has specialized in engineering tools, but consumer research is an important part of engineering, in product planning, to make sure you produce the products with the attributes consumers want.
John Sall, SAS
Paper 1285-2014:
Customer Profiling for Marketing Strategies in a Healthcare Environment
In this new era of healthcare reform, health insurance companies have heightened their efforts to pinpoint who their customers are, what their characteristics are, what they look like today, and how this impacts business in today s and tomorrow s healthcare environment. The passing of the Healthcare Reform policies led insurance companies to focus and prioritize their projects on understanding who the members in their current population were. The goal was to provide an integrated single view of the customer that could be used for retention, increased market share, balancing population risk, improving customer relations, and providing programs to meet the members' needs. By understanding the customer, a marketing strategy could be built for each customer segment classification, as predefined by specific attributes. This paper describes how SAS® was used to perform the analytics that were used to characterize their insured population. The high-level discussion of the project includes regression modeling, customer segmentation, variable selection, and propensity scoring using claims, enrollment, and third-party psychographic data.
MaryAnne DePesquo, BlueCross BlueShield of Arizona
Paper 1603-2014:
Data Coarsening and Data Swapping Algorithms
With increased concern about privacy and simultaneous pressure to make survey data available, statistical disclosure control (SDC) treatments are performed on survey microdata to reduce disclosure risk prior to dissemination to the public. This situation is all the more problematic in the push to provide data online for immediate user query. Two SDC approaches are data coarsening, which reduces the information collected, and data swapping, which is used to adjust data values. Data coarsening includes recodes, top-codes and variable suppression. Challenges related to creating a SAS® macro for data coarsening include providing flexibility for conducting different coarsening approaches, and keeping track of the changes to the data so that variable and value labels can be assigned correctly. Data swapping includes selecting target records for swapping, finding swapping partners, and swapping data values for the target variables. With the goal of minimizing the impact on resulting estimates, challenges for data swapping are to find swapping partners that are close matches in terms of both unordered categorical and ordered categorical variables. Such swapping partners ensure that enough change is made to the target variables, that data consistency between variables is retained, and that the pool of potential swapping partners is controlled. An example is presented using each algorithm.
Sixia Chen, Westat
Mamadou Diallo, Westat
Amita Gopinath, Westat
Katie Hubbell, Westat
Tom Krenzke, Westat
Paper 1872-2014:
Efficiency Estimation Using a Hybrid of Data Envelopment Analysis and Linear Regression
Literature suggests two main approaches, parametric and non-parametric, for constructing efficiency frontiers on which efficiency scores of other units can be based. Parametric functions can be either deterministic or stochastic in nature. However, when multiple inputs and outputs are encountered, Data Envelopment Analysis (DEA), a non-parametric approach, is a powerful tool used for decades in measurement of productivity/efficiency with a wide range of applications. Both approaches have advantages and limitations. This paper attempts to further explore and validate a hybrid approach, taking the best of both the DEA and the parametric approach, in order to estimate efficiency of Decision Making Units (DMUs) in an even better way.
John Dilip Raj, GE
Paper 1594-2014:
Generating Dynamic Tables Using PROC SQL and PROC TABULATE
PROC TABULATE is the most widely used reporting tool in SAS®, along with PROC REPORT. Any kind of report with the desired statistics can be produced by PROC TABULATE. When we need to report some summary statistics like mean, median, and range in the heading, either we have to edit it outside SAS in word processing software or enter it manually. In this paper, we discuss how we can automate this to be dynamic by using PROC SQL and some simple macros.
Lovedeep Gondara, BC Cancer Agency
Paper SAS2203-2014:
Getting Started with Mixed Models
This introductory presentation is intended for an audience new to mixed models who wants to get an overview of this useful class of models. Learn about mixed models as an extension of ordinary regression models, and see several examples of mixed models in social, agricultural, and pharmaceutical research.
Catherine Truxillo, SAS
Paper SAS2206-2014:
Getting Started with Poisson Regression Modeling
When the dependent variable is a count, Poisson regression is a natural choice of distribution for fitting a regression model. This presentation is intended for an audience experienced in linear regression modeling, but new to Poisson regression modeling. Learn the basics of this useful distribution and see some examples where it is appropriate. Tips for identifying problems with fitting a Poisson regression model and some helpful alternatives are provided.
Chris Daman, SAS
Marc Huber
Paper SAS2205-2014:
Getting Started with Survey Procedures
Analyzing data from a complex probability survey involves weighting observations so that inferences are correct. This introductory presentation is intended for an audience new to analyzing survey data. Learn the essentials of using the SURVEYxx procedures in SAS/STAT®.
Chris Daman, SAS
Bob Lucas, SAS
Paper SAS2221-2014:
Getting Started with the SAS/IML® Language
Do you need a statistic that is not computed by any SAS® procedure? Reach for the SAS/IML® language! Many statistics are naturally expressed in terms of matrices and vectors. For these, you need a matrix-vector language. This hands-on workshop introduces the SAS/IML language to experienced SAS programmers. The workshop focuses on statements that create and manipulate matrices, read and write data sets, and control the program flow. You will learn how to write user-defined functions, interact with other SAS procedures, and recognize efficient programming techniques. Programs are written using the SAS/IML® Studio development environment. This course covers Chapters 2 4 of Statistical Programming with SAS/IML Software (Wicklin, 2010).
Rick Wicklin, SAS
Paper 1275-2014:
Got Randomness? SAS® for Mixed and Generalized Linear Mixed Models
It is not uncommon to find models with random components like location, clinic, teacher, etc., not just the single error term we think of in ordinary regression. This paper uses several examples to illustrate the underlying ideas. In addition, the response variable might be Poisson or binary rather than normal, thus taking us into the realm of generalized linear mixed models, These too will be illustrated with examples.
David Dickey, NC State University
Paper 1658-2014:
Healthcare Services Data Distribution, Transformation, and Model Fitting
Healthcare services data on products and services come in different shapes and forms. Data cleaning, characterization, massaging, and transformation are essential precursors to any statistical model-building efforts. In addition, data size, quality, and distribution influence model selection, model life cycle, and the ease with which business insights are extracted from data. Analysts need to examine data characteristics and determine the right data transformation and methods of analysis for valid interpretation of results. In this presentation, we demonstrate the common data distribution types for a typical healthcare services industry such as Cardinal Health and their salient features. In addition, we use Base SAS® and SAS/STAT® for data transformation of both the response (Y) and the explanatory (X) variables in four combinations [RR (Y and X as row data), TR (only Y transformed), RT (only X transformed), and TT (Y and X transformed)] and the practical significance of interpreting linear, logistic, and completely randomized design model results using the original and the transformed data values for decision-making processes. The reality of dealing with diverse forms of data, the ramification of data transformation, and the challenge of interpreting model results of transformed data are discussed. Our analysis showed that the magnitude of data variability is an overriding factor to the success of data transformation and the subsequent tasks of model building and interpretation of model parameters. Although data transformation provided some benefits, it complicated analysis and subsequent interpretation of model results.
Tison Bolen, Cardinal Health
Lisa Conley, Cardinal Health
Jason Greenfield, Cardinal Health
Dawit Mulugeta, Cardinal Health
Paper SAS369-2014:
I Want the Latest and Greatest! The Top Five Things You Need to Know about Migration
Determining what, when, and how to migrate SAS® software from one major version to the next is a common challenge. SAS provides documentation and tools to help make the assessment, planning, and eventual deployment go smoothly. We describe some of the keys to making your migration a success, including the effective use of the SAS® Migration Utility, both in the analysis mode and the execution mode. This utility is responsible for analyzing each machine in an existing environment, surfacing product-specific migration information, and creating packages to migrate existing product configurations to later versions. We show how it can be used to simplify each step of the migration process, including recent enhancements to flag product version compatibility and incompatibility.
Josh Hames, SAS
Gerry Nelson, SAS
Paper SAS276-2014:
Introducing SAS® Decision Management
Organizations today make numerous decisions within their businesses that affect almost every aspect of their daily operations. Many of these decisions are now automatically generated by sophisticated enterprise decision management systems. These decisions include what offers to make to customers, sales transaction processing, payment processing, call center interactions, industrial maintenance, transportation scheduling, and thousands of other applications that all have a significant impact on the business bottom line. Concurrently, many of these same companies have developed or are now developing analytics that provide valuable insight into their customers, their products, and their markets. Unfortunately, many of the decision systems cannot maximize the power of analytics in the business processes at the point where the decisions are made. SAS® Decision Manager is a new product that integrates analytical models with business rules and deploys them to operational systems where the decisions are made. Analytically driven decisions can be monitored, assessed, and improved over time. This paper describes the new product and its use and shows how models and business rules can be joined into a decision process and deployed to either batch processes or to real-time web processes that can be consumed by business applications.
Charlotte Crain, SAS
David Duling, SAS
Steve Sparano, SAS
Paper 1492-2014:
Introduction to Frailty Models
This session introduces frailty models and their use in biostatistics to model time-to-event or survival data. The session uses examples to review situations in which a frailty model is a reasonable modeling option, to describe which SAS® procedures can be used to fit frailty models, and to discuss the advantages and disadvantages of frailty models compared to other modeling options.
John Amrhein, McDougall Scientific Ltd.
Paper SAS384-2014:
Is Nonlinear Regression Throwing You a Curve? New Diagnostic and Inference Tools in the NLIN Procedure
The NLIN procedure fits a wide variety of nonlinear models. However, some models can be so nonlinear that standard statistical methods of inference are not trustworthy. That s when you need the diagnostic and inferential features that were added to PROC NLIN in SAS/STAT® 9.3, 12.1, and 13.1. This paper presents these features and explains how to use them. Examples demonstrate how to use parameter profiling and confidence curves to identify the nonlinearcharacteristics of the model parameters. They also demonstrate how to use the bootstrap method to study the sampling distribution of parameter estimates and to make more accurate statistical inferences. This paper highlights how measures of nonlinearity help you diagnose models and decide on potential reparameterization. It also highlights how multithreading is used to tame the large number of nonlinear optimizations that are required for these features.
Biruk Gebremariam, SAS
Paper SAS060-2014:
Making Comparisons Fair: How LS-Means Unify the Analysis of Linear Models
How do you compare group responses when the data are unbalanced or when covariates come into play? Simple averages will not do, but LS-means are just the ticket. Central to postfitting analysis in SAS/STAT® linear modeling procedures, LS-means generalize the simple average for unbalanced data and complicated models. They play a key role both in standard treatment comparisons and Type III tests and in newer techniques such as sliced interaction effects and diffograms. This paper reviews the definition of LS-means, focusing on their interpretation as predicted population marginal means, and it illustrates their broad range of use with numerous examples.
Weijie Cai, SAS
Paper 1739-2014:
Medical Scoring for Breast Cancer Recurrence
Breast cancer is the most common cancer among females globally. After being diagnosed and treated for breast cancer, patients fear the recurrence of breast cancer. Breast cancer recurrence (BCR) can be defined as the return of breast cancer after primary treatment, and it can recur within the first three to five years. BCR studies have been conducted mostly in developed countries such as the United States, Japan, and Canada. Thus, the primary aim of this study is to investigate the feasibility of building a medical scorecard to assess the risk of BCR among Malaysian women. The medical scorecard was developed using data from 454 out of 1,149 patients who were diagnosed and underwent treatment at the Department of Surgery, Hospital Kuala Lumpur from 2006 until 2011. The outcome variable is a binary variable with two values: 1 (recurrence) and 0 (remission). Based on the availability of data, only 13 categorical predictors were identified and used in this study. The predictive performance of the Breast Cancer Recurrence scorecard (BCR scorecard) model was compared to the standard logistic regression (LR) model. Both the BCR scorecard and LR model were developed using SAS® Enterprise Miner 7.1. From this exploratory study, although the BCR scorecard model has better predictive ability with a lower misclassification rate (18%) compared to the logistic regression model (23%), the sensitivity of the BCR scorecard model is still low, possibly due to the small sample size and small number of risk factors. Five important risk factors were identified: histological type, race, stage, tumor size, and vascular invasion in predicting recurrence status.
Nor Aina Emran, Hospital Kuala Lumpur
Nurul Husna Jamian, Universiti Teknologi Mara (UiTM)
Yap Bee Wah, Universiti Teknologi Mara
Paper 1467-2014:
Missing Data: Overview, Likelihood, Weighted Estimating Equations, and Multiple Imputation
In applied statistical practice, incomplete measurement sequences are the rule rather than the exception. Fortunately, in a large variety of settings, the stochastic mechanism governing the incompleteness can be ignored without hampering inferences about the measurement process. While ignorability only requires the relatively general missing at random assumption for likelihood and Bayesian inferences, this result cannot be invoked when non-likelihood methods are used. We will first sketch the framework used for contemporary missing-data analysis. Apart from revisiting some of the simpler but problematic methods, attention will be paid to direct likelihood and multiple imputation. Because popular non-likelihood-based methods do not enjoy the ignorability property in the same circumstances as likelihood and Bayesian inferences, weighted versions have been proposed. This holds true in particular for generalized estimating equations (GEE). Even so-called doubly-robust versions have been derived. Apart from GEE, also pseudo-likelihood based strategies can be adapted appropriately. We describe a suite of corrections to the standard form of pseudo-likelihood, to ensure its validity under missingness at random. Our corrections follow both single and double robustness ideas, and is relatively simple to apply.
Geert Molenberghs, Universiteit Hasselt & KU Leuven
Paper 1300-2014:
Model Variable Selection Using Bootstrap Decision Tree
Bootstrapped Decision Tree is a variable selection method used to identify and eliminate unintelligent variables from a large number of initial candidate variables. Candidates for subsequent modeling are identified by selecting variables consistently appearing at the top of decision trees created using a random sample of all possible modeling variables. The technique is best used to reduce hundreds of potential fields to a short list of 30 50 fields to be used in developing a model. This method for variable selection has recently become available in JMP® under the name BootstrapForest; this paper presents an implementation in Base SAS®9. The method does accept but does not require a specific outcome to be modeled and will therefore work for nearly any type of model, including segmentation, MCMC, multiple discrete choice, in addition to standard logistic regression. Keywords: Bootstrapped Decision Tree, Variable Selection
David Corliss, Magnify Analytic Solutions
Paper 1304-2014:
Modeling Fractional Outcomes with SAS®
For most practitioners, ordinary least square (OLS) regression with a Gaussian distributional assumption might be the top choice for modeling fractional outcomes in many business problems. However, it is conceptually flawed to assume a Gaussian distribution for a response variable in the [0, 1] interval. In this paper, several modeling methodologies for fractional outcomes with their implementations in SAS® are discussed through a data analysis exercise in predicting corporate financial leverage ratios. Various empirical and conceptual methods for the model evaluation and comparison are also discussed throughout the example. This paper provides a comprehensive survey about how to model fractional outcomes.
Jason Xin, SAS
wensui liu, Fifth Third Bancorp
Paper 1528-2014:
Multivariate Ratio and Regression Estimators
This paper considers the %MRE macro for estimating multivariate ratio estimates. Also, we use PROC REG to estimate multivariate regression estimates and to show that regression estimates are superior to the ratio estimates.
Alan Silva, Universidade de Brasilia
Paper 1295-2014:
PD_Calibrate Macro
PD_Calibrate is a macro that standardizes the calibration of our predictive credit-scoring models at Nykredit. The macro is activated with an input data set, variables, anchor point, specification of method, number of buckets, kink-value, and so on. The output consists of graphs, HTML, and two data sets containing key values for the model being calibrated and values for the use of graphics.
Keld Asnæs, Nykredit a/s
Jesper Michelsen, Nykredit
Paper 1902-2014:
Plotting Differences Among LS-means in Generalized Linear Models
The effectiveness of visual interpretation of the differences between pairs of LS-means in a generalized linear model includes the graph's ability to display four inferential and two perceptual tasks. Among the types of graphs which display some or all of these tasks are the forest plot, the mean-mean scatter plot (diffogram), and closely related to it, the mean-mean multiple comparison (MMC) plot. These graphs provide essential visual perspectives for interpretation of the differences among pairs of LS-means from a generalized linear model (GLM). The diffogram is a graphical option now available through ODS statistical graphics with linear model procedures such as GLIMMIX. Through combining ODS output files of the LS-means and their differences, the SGPLOT procedure can efficiently produce forest and MMC plots.
Robin High, University of Nebraska Medical Center
Paper SAS030-2014:
Power and Sample Size for MANOVA and Repeated Measures with the GLMPOWER Procedure
Power analysis helps you plan a study that has a controlled probability of detecting a meaningful effect, giving you conclusive results with maximum efficiency. SAS/STAT® provides two procedures for performing sample size and power computations: the POWER procedure provides analyses for a wide variety of different statistical tests, and the GLMPOWER procedure focuses on power analysis for general linear models. In SAS/STAT 13.1, the GLMPOWER procedure has been updated to enable power analysis for multivariate linear models and repeated measures studies. Much of the syntax is similar to the syntax of the GLM procedure, including both the new MANOVA and REPEATED statements and the existing MODEL and CONTRAST statements. In addition, PROC GLMPOWER offers flexible yet parsimonious options for specifying the covariance. One such option is the two-parameter linear exponent autoregressive (LEAR) correlation structure, which includes other common structures such as AR(1), compound symmetry, and first-order moving average as special cases. This paper reviews the new repeated measures features of PROC GLMPOWER, demonstrates their use in several examples, and discusses the pros and cons of the MANOVA and repeated measures approaches.
John Castelloe, SAS
Paper SAS156-2014:
Putting on the Ritz: New Ways to Style Your ODS Graphics to the Max
Do you find it difficult to dress up your graphs for your reports or presentations? SAS® 9.4 introduced new capabilities in ODS Graphics that give you the ability to style your graphs without creating or modifying ODS styles. Some of the new capabilities include the following: a new option for controling how ODS styles are applied graph syntax for overriding ODS style attributes for grouped plots the ability to define font glyphs and images as plot markers enhanced attribute map support In this presentation, we discuss these new features in detail, showing examples in the context of Graph Template Language and ODS Graphics procedures.
Dan Heath, SAS
Paper 1886-2014:
Recommending News Articles Using Cosine Similarity Function
Predicting news articles that customers are likely to view/read next provides a distinct advantage to news sites. Collaborative filtering is a widely used technique for the same. This paper details an approach within collaborative filtering that uses the cosine similarity function to achieve this purpose. The paper further details two different approaches, customized targeting and article level targeting, that can be used in marketing campaigns. Please note that this presentation connects with Session ID 1887. Session ID 1887 happens immediately following this session
John Dilip Raj, GE
Ledalla Venkata Naga Rajendra, GE
Qing Wang, Warwick Business School
Paper 1887-2014:
Recommending TV Programs Using Correlation
Personalized recommender systems are being used in many industries to increase customer engagement. In the TV industry, this is primarily used to increase viewership, which in turn increases market share, revenue, and profit. This paper attempts to develop a recommender system using the correlation procedure under collaborative filtering methodology. The only data requirement for this recommendation system would be past viewership of customers for a given time period. Please note that this session connects with Session ID 1886. Session ID 1886 happens immediately prior to this session
John Dilip Raj, GE
Ledalla Venkata Naga Rajendra, GE
Qing Wang, Warwick Business School
Paper 1502-2014:
Regression Analysis of Duration and Severity Data: New Capabilities with SAS® Software
Duration and severity data arise in several fields including biostatistics, demography, economics, engineering, and sociology. SAS® procedures LIFETEST, LIFEREG. and PHREG are the workhorses for analysis of time to event data in applications in biostatistics. Similar methods apply to the magnitude or severity of a random event, where the outcome might be right, left, or interval censored and/or, right or left truncated. All combinations of types of censoring and truncation could be present in the data set. Regression models such as the accelerated failure time model, the Cox model, and the non-homogeneous Poisson model have extensions to address time-varying covariates in the analysis of clustered outcomes, multivariate outcomes of mixed types, and recurrent events. We present an overview of new capabilities that are available in the procedures QLIM, QUANTLIFE, RELIABILITY, and SEVERITY with examples illustrating their application using empirical data sets drawn from easily accessible sources.
Joseph Gardiner, Michigan State University
Paper 1650-2014:
Risk Factors and Outcome of Spinal Epidural Abscess from Incident Hemodialysis Patients from the United States Renal Data System between 2005 and 2008
Spinal epidural abscess (SEA) is a serious complication in hemodialysis (HD) patients, yet there is little medical literature that discusses it. This analysis identified risk factors and co-morbidities associated with SEA, as well as risk factors for mortality following the diagnosis. All incident HD cases from the United States Renal Data System for calendar years 2005 2008 were queried for a diagnosis of SEA. Potential clinical covariates, survival, and risk factors were recovered using ICD-9 diagnosis codes. Log-binomial regressions were performed using PROC GENMOD to assess the relative risks, and Cox regression models were run using PROC PHREG to estimate hazard ratios for mortality. For the 4-year study period, 660/355084 (0.19%) HD patients were identified with SEA, the largest cohort to date. Older age (RR=1.625), infectious comorbidities including bacteremia (RR=7.7976), methicillin-resistant Staphylococcus aureus infection (RR=2.6507), hepatitis C (RR=1.545), and non-infectious factors including diabetes (RR=1.514) and presence of vascular catheters (RR=1.348) were identified as significant risk factors for SEA. SEA in HD patients was associated with an increased risk of death (HR=1.20). Older age (HR=2.269), the presence of dialysis catheters (HR=1.884), cirrhosis (HR=1.715), decubitus ulcers (HR=1.669), bacteremia (HR=1.407), and total parenteral nutrition (HR=1.376) constitute the greatest risk factors for death after SEA diagnosis and thus necessitate a comprehensive approach to management.
Stephanie Baer, Georgia Regents University and Augusta VAMC
Rhonda Colombo, Georgia Regents University
Lu Huber, Georgia Regents University
Chan Jin, Georgia Regents University
N. Stanley Nahman Jr., Georgia Regents University and Augusta VAMC
Jennifer White, Georgia Regents University
Paper SAS136-2014:
Risk-Based Monitoring of Clinical Trials Using JMP® Clinical
Guidelines from the International Conference on Harmonisation (ICH) suggest that clinical trial data should be actively monitored to ensure data quality. Traditional interpretation of this guidance has often led to 100 percent source data verification (SDV) of respective case report forms through on-site monitoring. Such monitoring activities can also identify deficiencies in site training and uncover fraudulent behavior. However, such extensive on-site review is time-consuming, expensive and, as is true for any manual effort, limited in scope and prone to error. In contrast, risk-based monitoring makes use of central computerized review of clinical trial data and site metrics to determine whether sites should receive more extensive quality review through on-site monitoring visits. We demonstrate a risk-based monitoring solution within JMP® Clinical to assess clinical trial data quality. Further, we describe a suite of tools used for identifying potentially fraudulent data at clinical sites. Data from a clinical trial of patients who experienced an aneurysmal subarachnoid hemorrhage provide illustration.
Richard Zink, SAS
Paper SAS181-2014:
SAS/STAT® 13.1 Round-Up
SAS/STAT® 13.1 brings valuable new techniques to all sectors of the audience forSAS statistical software. Updates for survival analysis include nonparametricmethods for interval censoring and models for competing risks. Multipleimputation methods are extended with the addition of sensitivity analysis.Bayesian discrete choice models offer a modern approach for consumer research.Path diagrams are a welcome addition to structural equation modeling, and itemresponse models are available for educational assessment. This paper providesoverviews and introductory examples for each of the new focus areas in SAS/STAT13.1. The paper also provides a sneak preview of the follow-up release,SAS/STAT 13.2, which brings additional strategies for missing data analysis andother important updates to statistical customers.
Bob Rodriguez, SAS
Maura Stokes, SAS
Paper SAS004-2014:
SAS® Predictive Asset Maintenance: Find Out Why Before It's Too Late!
Are you wondering what is causing your valuable machine asset to fail? What could those drivers be, and what is the likelihood of failure? Do you want to be proactive rather than reactive? Answers to these questions have arrived with SAS® Predictive Asset Maintenance. The solution provides an analytical framework to reduce the amount of unscheduled downtime and optimize maintenance cycles and costs. An all new (R&D-based) version of this offering is now available. Key aspects of this paper include: Discussing key business drivers for and capabilities of SAS Predictive Asset Maintenance. Detailed analysis of the solution, including: Data model Explorations Data selections Path I: analysis workbench maintenance analysis and stability monitoring Path II: analysis workbench JMP®, SAS® Enterprise Guide®, and SAS® Enterprise Miner Analytical case development using SAS Enterprise Miner, SAS® Model Manager, and SAS® Data Integration Studio SAS Predictive Asset Maintenance Portlet for reports A realistic business example in the oil and gas industry is used.
George Habek, SAS
Paper SAS1525-2014:
SAS® Workshop: High-Performance Analytics
This workshop provides hands-on experience using SAS® Enterprise Miner high-performance nodes. Workshop participants will do the following: learn the similarities and differences between high-performance nodes and standard nodes build a project flow using high-performance nodes extract and save a score code for model deployment
Bob Lucas, SAS
Jeff Thompson, SAS
Paper 1321-2014:
Scatterplots: Basics, Enhancements, Problems, and Solutions
The scatter plot is a basic tool for examining the relationship between two variables. While the basic plot is good, enhancements can make it better. In addition, there might be problems of overplotting. In this paper, I cover ways to create basic and enhanced scatter plots and to deal with overplotting.
Peter Flom, Peter Flom Consulting
Paper 1279-2014:
Selecting Peer Institutions with Cluster Analysis
Universities strive to be competitive in the quality of education as well as cost of attendance. Peer institutions are selected to make comparisons pertaining to academics, costs, and revenues. These comparisons lead to strategic decisions and long-range planning to meet goals. The process of finding comparable institutions could be completed with cluster analysis, a statistical technique. Cluster analysis places universities with similar characteristics into groups or clusters. A process to determine peer universities will be illustrated using PROC STANDARD, PROC FASTCLUS, and PROC CLUSTER.
Diana Suhr, University of Northern Colorado
Paper SAS270-2014:
Sensitivity Analysis in Multiple Imputation for Missing Data
Multiple imputation, a popular strategy for dealing with missing values, usually assumes that the data are missing at random (MAR). That is, for a variable X, the probability that an observation is missing depends only on the observed values of other variables, not on the unobserved values of X. It is important to examine the sensitivity of inferences to departures from the MAR assumption, because this assumption cannot be verified using the data. The pattern-mixture model approach to sensitivity analysis models the distribution of a response as the mixture of a distribution of the observed responses and a distribution of the missing responses. Missing values can then be imputed under a plausible scenario for which the missing data are missing not at random (MNAR). If this scenario leads to a conclusion different from that of inference under MAR, then the MAR assumption is questionable. This paper reviews the concepts of multiple imputation and explains how you can apply the pattern-mixture model approach in the MI procedure by using the MNAR statement, which is new in SAS/STAT® 13.1. You can specify a subset of the observations to derive the imputation model, which is used for pattern imputation based on control groups in clinical trials. You can also adjust imputed values by using specified shift and scale parameters for a set of selected observations, which are used for sensitivity analysis with a tipping-point approach.
Yang Yuan, SAS
Paper 1645-2014:
Speed Dating: Looping Through a Table Using Dates
Have you ever needed to use dates as values to loop through a table? For example, how many events occurred by 1, 2 , 3 & n months ahead? Maybe you just changed the dates manually and re-ran the query n times? This is a common need in economic and behavioral sciences. This presentation demonstrates how to create a table of dates that can be used with SAS® macro variables to loop through a table. Using this dates table in combination with the SAS DO loop ensures accuracy and saves time.
Scott Fawver, Arch Mortgage Insurance Company
Paper 1305-2014:
The Commonly Used Statistical Methods to Control The Probability of an Overall Type I Error and the Application in Clinical Trial
In a clinical study, we often set up multiple hypotheses with regard to the cost of getting study result. However, the multiplicity problem arises immediately when they are performed in a univariate manner. Some methods to control the rate of the overall type I error are applied widely, and they are discussed in this paper, except the methodology, we will introduce its application in one study case and provide the SAS® code.
Lixiang Yao, icon
Paper 1882-2014:
Using PROC MCMC for Bayesian Item Response Modeling
The new Markov chain Monte Carlo (MCMC) procedure introduced in SAS/STAT® 9.2 and further exploited in SAS/STAT® 9.3 enables Bayesian computations to run efficiently with SAS®. The MCMC procedure allows one to carry out complex statistical modeling within Bayesian frameworks under a wide spectrum of scientific research; in psychometrics, for example, the estimation of item and ability parameters is a kind. This paper describes how to use PROC MCMC for Bayesian inferences of item and ability parameters under a variety of popular item response models. This paper also covers how the results from SAS PROC MCMC are different from or similar to the results from WinBUGS. For those who are interested in the Bayesian approach to item response modeling, it is exciting and beneficial to shift to SAS, based on its flexibility of data managements and its power of data analysis. Using the resulting item parameter estimates, one can continue to test form constructions, test equatings, etc., with all these test development processes being accomplished with SAS!
Yi-Fang Wu, Department of Educational Measurement and Statistics, Iowa Testing Programs, University of Iowa
Paper SAS214-2014:
When Do You Schedule Preventative Maintenance? Multivariate Time Series in SAS/ETS®
Expensive physical capital must be regularly maintained for optimal efficiency and long-term insurance against damage. The maintenance process usually consists of constantly monitoring high-frequency sensor data and performing corrective maintenance when the expected values do not match the actual values. An economic system can also be thought of as a system that requires constant monitoring and occasional maintenance in the form of monetary or fiscal policy. This paper shows how to use the SSM procedure in SAS/ETS® to make forecasts of expected values by using high-frequency multivariate time series. The paper also demonstrates the functionality of the new SASEFRED interface engine in SAS/ETS.
Xilong Chen, SAS
Kenneth Sanford, SAS
Rajesh Selukar, SAS
back to top