SAS Global Forum 2014 Proceedings

One of the first lessons that SAS^® programmers learn on the job is that numeric and character variables do not play well together, and that type mismatches are one of the more common source of errors in their otherwise flawless SAS programs. Luckily, converting variables from one type to another in SAS (that is, casting) is not difficult, requiring only the judicious use of either the input() or put() function. There remains, however, the danger of data being lost in the conversion process. This type of error is most likely to occur in cases of character-to-numeric variable conversion, most especially when the user does not fully understand the data contained in the data set. This paper will review the basics of data storage for character and numeric variables in SAS, the use of formats and informats for conversions, and how to ensure accurate type conversion of even high-precision numeric values.

Hip fractures are a common source of morbidity and mortality among the elderly. While multiple prior studies have identified risk factors for poor outcomes, few studies have presented a validated method for stratifying patient risk. The purpose of this study was to develop a simple risk score calculator tool predictive of 30-day morbidity after hip fracture. To achieve this, we prospectively queried a database maintained by The American College of Surgeons (ACS) National Surgical Quality Improvement Program (NSQIP) to identify all cases of hip fracture between 2005 and 2010, based on primary Current Procedural Terminology (CPT) codes. Patient demographics, comorbidities, laboratory values, and operative characteristics were compared in a univariate analysis, and a multivariate logistic regression analysis was then used to identify independent predictors of 30-day morbidity. Weighted values were assigned to each independent risk factor and were used to create predictive models of 30-day complication risk. The models were internally validated with randomly partitioned 80%/20% cohort groups. We hypothesized that significant predictors of morbidity could be identified and used in a predictive model for a simple risk score calculator. All analyses are performed via SAS^® software.

Influence analysis in statistical modeling looks for observations that unduly influence the fitted model. Cook s distance is a standard tool for influence analysis in regression. It works by measuring the difference in the fitted parameters as individual observations are deleted. You can apply the same idea to examining influence of groups of observations for example, the multiple observations for subjects in longitudinal or clustered data but you need to adapt it to the fact that different subjects can have different numbers of observations. Such an adaptation is discussed by Zhu, Ibrahim, and Cho (2012), who generalize the subject size factor as the so-called degree of perturbation, and correspondingly generalize Cook s distances as the scaled Cook s distance. This paper presents the %SCDMixed SAS^® macro, which implements these ideas for analyzing influence in mixed models for longitudinal or clustered data. The macro calculates the degree of perturbation and scaled Cook s distance measures of Zhu et al. (2012) and presents the results with useful tabular and graphical summaries. The underlying theory is discussed, as well as some of the programming tricks useful for computing these influence measures efficiently. The macro is demonstrated using both simulated and real data to show how you can interpret its results for analyzing influence in your longitudinal modeling.

Stepwise regression includes regression models in which the predictive variables are selected by an automated algorithm. The stepwise method involves two approaches: backward elimination and forward selection. Currently, SAS^® has three procedures capable of performing stepwise regression: REG, LOGISTIC, and GLMSELECT. PROC REG handles the linear regression model, but does not support a CLASS statement. PROC LOGISTIC handles binary responses and allows for logit, probit, and complementary log-log link functions. It also supports a CLASS statement. The GLMSELECT procedure performs selections in the framework of general linear models. It allows for a variety of model selection methods, including the LASSO method of Tibshirani (1996) and the related LAR method of Efron et al. (2004). PROC GLMSELECT also supports a CLASS statement. We present a stepwise algorithm for generalized linear mixed models for both marginal and conditional models. We illustrate the algorithm using data from a longitudinal epidemiology study aimed to investigate parents beliefs, behaviors, and feeding practices that associate positively or negatively with indices of sleep quality.

Finding groups with similar attributes is at the core of knowledge discovery. To this end, Cluster Analysis automatically locates groups of similar observations. Despite successful applications, many practitioners are uncomfortable with the degree of automation in Cluster Analysis, which causes intuitive knowledge to be ignored. This is more true in text mining applications since individual words have meaning beyond the data set. Discovering groups with similar text is extremely insightful. However, blind applications of clustering algorithms ignore intuition and hence are unable to group similar text categories. The challenge is to integrate the power of clustering algorithms with the knowledge of experts. We demonstrate how SAS/STAT^® 9.2 procedures and the SAS^® Macro Language are used to ensemble the opinion of domain experts with multiple clustering models to arrive at a consensus. The method has been successfully applied to a large data set with structured attributes and unstructured opinions. The result is the ability to discover observations with similar attributes and opinions by capturing the wisdom of the crowds whether man or model.

This paper expands upon A Multilevel Model Primer Using SAS^® PROC MIXED in which we presented an overview of estimating two- and three-level linear models via PROC MIXED. However, in our earlier paper, we, for the most part, relied on simple options available in PROC MIXED. In this paper, we present a more advanced look at common PROC MIXED options used in the analysis of social and behavioral science data, as well introduce users to two different SAS macros previously developed for use with PROC MIXED: one to examine model fit (MIXED_FIT) and the other to examine distributional assumptions (MIXED_DX). Specific statistical options presented in the current paper include (a) PROC MIXED statement options for estimating statistical significance of variance estimates (COVTEST, including problems with using this option) and estimation methods (METHOD =), (b) MODEL statement option for degrees of freedom estimation (DDFM =), and (c) RANDOM statement option for specifying the variance/covariance structure to be used (TYPE =). Given the importance of examining model fit, we also present methods for estimating changes in model fit through an illustration of the SAS macro MIXED_FIT. Likewise, the SAS macro MIXED_DX is introduced to remind users to examine distributional assumptions associated with two-level linear models. To maintain continuity with the 2013 introductory PROC MIXED paper, thus providing users with a set of comprehensive guides for estimating multilevel models using PROC MIXED, we use the same real-world data sources that we used in our earlier primer paper.

The use of Bayesian methods has become increasingly popular in modern statistical analysis, with applications in numerous scientific fields. In recent releases, SAS^® has provided a wealth of tools for Bayesian analysis, with convenient access through several popular procedures in addition to the MCMC procedure, which is specifically designed for complex Bayesian modeling (not discussed here). This paper introduces the principles of Bayesian inference and reviews the steps in a Bayesian analysis. It then describes the Bayesian capabilities provided in four procedures(the GENMOD, PHREG, FMM, and LIFEREG procedures) including the available prior distributions, posterior summary statistics, and convergence diagnostics. Various sampling methods that are used to sample from the posterior distributions are also discussed. The second part of the paper describes how to use the GENMOD and PHREG procedures to perform Bayesian analyses for real-world examples and how to take advantage of the Bayesian framework to address scientific questions.

SAS^® and SAS^® Enterprise Miner^™ have provided advanced data mining and machine learning capabilities for years beginning long before the current buzz. Moreover, SAS has continually incorporated advances in machine learning research into its classification, prediction, and segmentation procedures. SAS Enterprise Miner now includes many proven machine learning algorithms in its high-performance environment and is introducing new leading-edge scalable technologies. This paper provides an overview of machine learning and presents several supervised and unsupervised machine learning examples that use SAS Enterprise Miner. So, come back to the future to see machine learning in action with SAS!

Overdispersion (extra variation) arises in binomial, multinomial, or count data when variances are larger than those allowed by the binomial, multinomial, or Poisson model. This phenomenon is caused by clustering of the data, lack of independence, or both. As pointed out by McCullagh and Nelder (1989), Overdispersion is not uncommon in practice. In fact, some would maintain that over-dispersion is the norm in practice and nominal dispersion the exception. Several approaches are found for handling overdispersed data, namely quasi-likelihood and likelihood models, generalized estimating equations, and generalized linear mixed models. Some classical likelihood models are presented. Among them are the beta-binomial, binomial cluster (a.k.a. random clumped binomial), negative-binomial, zero-inflated Poisson, zero-inflated negative-binomial, hurdle Poisson, and the hurdle negative-binomial. We focus on how these approaches or models can be implemented in a practical way using, when appropriate, the procedures GLIMMIX, GENMOD, FMM, COUNTREG, NLMIXED, and SURVEYLOGISTIC. Some real data set examples are discussed in order to illustrate these applications. We also provide some guidance on how to analyze generalized linear overdispersion mixed models and possible scenarios where we might encounter them.

In randomized experiments, it is generally assumed that the hierarchical structures and variances are the same in the treatment and control groups. In some situations, however, these structures and variance components can differ. Consider a randomized experiment in which individuals randomized to the treatment condition are further assigned to clusters in which the intervention is administered, but no such clustering occurs in the control condition. Such a structure can occur, for example, when the individuals in the treatment condition are randomly assigned to group therapy sessions or to mathematics tutoring groups; individuals in the control condition do not receive group therapy or mathematics tutoring and therefore do not have that level of clustering. In this example, individuals in the treatment condition have a hierarchical structure, but individuals in the control condition do not. If the therapists or tutors differ in efficacy, the clustering in the treatment condition induces an extra source of variability in the data that needs to be accounted for in the analysis. We show how special features of SAS^® PROC MIXED and PROC GLIMMIX can be used to analyze data in which one or more treatment groups have a hierarchical structure that differs from that in the control group. We also discuss how to code variables in order to increase the computational efficiency for estimating parameters from these designs.

SAS/STAT^® 13.1 includes the new ICLIFETEST procedure, which is specifically designed for analyzing interval-censored data. This type of data is frequently found in studies where the event time of interest is known to have occurred not at a specific time but only within a certain time period. PROC ICLIFETEST performs nonparametric survival analysis of interval-censored data and is a counterpart to PROC LIFETEST, which handles right-censored data. With similar syntax, you use PROC ICLIFETEST to estimate the survival function and to compare the survival functions of different populations. This paper introduces you to the ICLIFETEST procedure and presents examples that illustrate how you can use it to perform analyses of interval-censored data.

Hierarchical data are common in many fields, from pharmaceuticals to agriculture to sociology. As data sizes and sources grow, information is likely to be observed on nested units at multiple levels, calling for the multilevel modeling approach. This paper describes how to use the GLIMMIX procedure in SAS/STAT^® to analyze hierarchical data that have a wide variety of distributions. Examples are included to illustrate the flexibility that PROC GLIMMIX offers for modeling within-unit correlation, disentangling explanatory variables at different levels, and handling unbalanced data. Also discussed are enhanced weighting options, new in SAS/STAT 13.1, for both the MODEL and RANDOM statements. These weighting options enable PROC GLIMMIX to handle weights at different levels. PROC GLIMMIX uses a pseudolikelihood approach to estimate parameters, and it computes robust standard error estimators. This new feature is applied to an example of complex survey data that are collected from multistage sampling and have unequal sampling probabilities.

A central component of discussions of healthcare reform in the U.S. are estimations of healthcare cost and use at the national or state level, as well as for subpopulation analyses for individuals with certain demographic properties or medical conditions. For example, a striking but persistent observation is that just 1% of the U.S. population accounts for more than 20% of total healthcare costs, and 5% account for almost 50% of total costs. In addition to descriptions of specific data sources underlying this type of observation, we demonstrate how to use SAS^® to generate these estimates and to extend the analysis in various ways; that is, to investigate costs for specific subpopulations. The goal is to provide SAS programmers and healthcare analysts with sufficient data-source background and analytic resources to independently conduct analyses on a wide variety of topics in healthcare research. For selected examples, such as the estimates above, we concretely show how to download the data from federal web sites, replicate published estimates, and extend the analysis. An added plus is that most of the data sources we describe are available as free downloads.

Sampling is widely used in different fields for quality control, population monitoring, and modeling. However, the purposes of sampling might be justified by the business scenario, such as legal or compliance needs. This paper uses one probability sampling method, stratified sampling, combined with quality control review business cost to determine an optimized procedure of sampling that satisfies both statistical selection criteria and business needs. The first step is to determine the total number of strata by grouping the strata with a small number of sample units, using box-and-whisker plots outliers as a whole. Then, the cost to review the sample in each stratum is justified by a corresponding business counter-party, which is the human working hour. Lastly, using the determined number of strata and sample review cost, optimal allocation of predetermined total sample population is applied to allocate the sample into different strata.

Many different neuroscience researchers have explored how various parts of the brain are connected, but no one has performed association mining using brain data. In this study, we used SAS^® Enterprise Miner^™ 7.1 for association mining of brain data collected by a 14-channel EEG device. An application of the association mining technique is presented in this novel context of brain activities and by linking our results to theories of cognitive neuroscience. The brain waves were collected while a user processed information about Facebook, the most well-known social networking site. The data was cleaned using Independent Component Analysis via an open source MATLAB package. Next, by applying the LORETA algorithm, activations at every fraction of the second were recorded. The data was codified into transactions to perform association mining. Results showing how various parts of brain get excited while processing the information are reported. This study provides preliminary insights into how brain wave data can be analyzed by widely available data mining techniques to enhance researcher s understanding of brain activation patterns.

In our previous work, we often needed to perform large numbers of repetitive and data-driven post-campaign analyses to evaluate the performance of marketing campaigns in terms of customer response. These routine tasks were usually carried out manually by using Microsoft Excel, which was tedious, time-consuming, and error-prone. In order to improve the work efficiency and analysis accuracy, we managed to automate the analysis process with SAS^® programming and replace the manual Excel work. Through the use of SAS macro programs and other advanced skills, we successfully automated the complicated data-driven analyses with high efficiency and accuracy. This paper presents and illustrates the creative analytical ideas and programming skills for developing the automatic analysis process, which can be extended to apply in a variety of business intelligence and analytics fields.

This paper kicks off a project to write a comprehensive book of best practices for documenting SAS^® projects. The presenter s existing documentation styles are explained. The presenter wants to discuss and gather current best practices used by the SAS user community. The presenter shows documentation styles at three different levels of scope. The first is a style used for project documentation, the second a style for program documentation, and the third a style for variable documentation. This third style enables researchers to repeat the modeling in SAS research, in an alternative language, or conceptually.

There is an ever-increasing number of study designs and analysis of clinical trials using Bayesian frameworks to interpret treatment effects. Many research scientists prefer to understand the power and probability of taking a new drug forward across the whole range of possible true treatment effects, rather than focusing on one particular value to power the study. Examples are used in this paper to show how to compute Bayesian probabilities using the SAS/STAT^® MIXED procedure and UNIVARIATE procedure. Particular emphasis is given to the application on efficacy analysis, including the comparison of new drugs to placebos and to standard drugs on the market.

The emerging discipline of data governance encompasses data quality assurance, data access and use policy, security risks and privacy protection, and longitudinal management of an organization s data infrastructure. In the interests of forestalling another bureaucratic solution to data governance issues, this presentation features database programming tools that provide rapid access to big data and make selective access to and restructuring of metadata practical.

The Affordable Care Act (ACA) contains provisions that have stimulated interest in analytics among health care providers, especially those provisions that address quality of outcomes. High Impact Technologies (HIT) has been addressing these issues since before passage of the ACA and has a Health Care Data Model recognized by Gartner and implemented at several health care providers. Recently, HIT acquired SAS^® Visual Analytics, and this paper reports our successful efforts to use SAS Visual Analytics for visually exploring Big Data for health care providers. Health care providers can suffer significant financial penalties for readmission rates above a certain threshold and other penalties related to quality of care. We have been able to use SAS Visual Analytics, coupled with our experience gained from implementing the HIT Healthcare Data Model at a number of Healthcare providers, to identify clinical measures that are significant predictors for readmission. As a result, we can help health care providers reduce the rate of 30-day readmissions.

Inference of variance components in linear mixed effect models (LMEs) is not always straightforward. I introduce and describe a flexible SAS^® macro (%COVTEST) that uses the likelihood ratio test (LRT) to test covariance parameters in LMEs by means of the parametric bootstrap. Users must supply the null and alternative models (as macro strings), and a data set name. The macro calculates the observed LRT statistic and then simulates data under the null model to obtain an empirical p-value. The macro also creates graphs of the distribution of the simulated LRT statistics. The program takes advantage of processing accomplished by PROC MIXED and some SAS/IML^® functions. I demonstrate the syntax and mechanics of the macro using three examples.

A case control study is in its most basic form comparing a case series to a matched control series and are commonly implemented in the field of public health. While matching is intended to eliminate confounding, the main potential benefit of matching in case control studies is a gain in efficiency. There are many known methods for selecting potential match or matches (in case of 1:n studies) per case, the most prominent being distance-based approach and matching on propensity scores. In this paper, we will go through both and compare their results and will present a macro capable of performing both.

This paper demonstrates the new case-level residuals in the CALIS procedure and how they differ from classic residuals in structural equation modeling (SEM). Residual analysis has a long history in statistical modeling for finding unusual observations in the sample data. However, in SEM, case-level residuals are considerably more difficult to define because of 1) latent variables in the analysis and 2) the multivariate nature of these models. Historically, residual analysis in SEM has been confined to residuals obtained as the difference between the sample and model-implied covariance matrices. Enhancements to the CALIS procedure in SAS/STAT^® 12.1 enable users to obtain case-level residuals as well. This enables a more complete residual and influence analysis. Several examples showing mean/covariance residuals and case-level residuals are presented.

The graphical display of the individual data is important in understanding the raw data and the relationship between the variables in the data set. You can explore your data to ensure statistical assumptions hold by detecting and excluding outliers if they exist. Since you can visualize what actually happens to individual subjects, you can make your conclusions more convincing in statistical analysis and interpretation of the results. SAS^® provides many tools for creating graphs of individual data. In some cases, multiple tools need to be combined to make a specific type of graph that you need. Examples are used in this paper to show how to create graphs of individual data using the SAS^® ODS Graphics procedures (SG procedures).

There has been debate regarding which method to use to analyze repeated measures continuous data when the design includes only two measurement times. Five different techniques can be applied and give similar results when there is little to no correlation between pre- and post-test measurements and when data at each time point are complete: 1) analysis of variance on the difference between pre- and post-test, 2) analysis of covariance on the differences between pre- and post-test controlling for pre-test, 3) analysis of covariance on post-test controlling for pre-test, 4) multiple analysis of variance on post- test and pre-test, and 5) repeated measures analysis of variance. However, when there is missing data or if a moderate to high correlation between pre- and post-test measures exists under an intent-to-treat analysis framework, bias is introduced in the tests for the ANOVA, ANCOVA, and MANOVA techniques. A comparison of Type III sum of squares, F-tests, and p-values for a complete case and an intent-to-treat analysis are presented. The analysis using a complete case data set shows that all five methods produce similar results except for the repeated measures ANOVA due to a moderate correlation between pre- and post-test measures. However, significant bias is introduced for the tests using the intent-to-treat data set.

When submitting clinical data to the Food and Drug Administration (FDA), besides the usual trials results, we need to submit the information that helps the FDA to understand the data. The FDA has required the CDISC Case Report Tabulation Data Definition Specification (Define-XML), which is based on the CDISC Operational Data Model (ODM), for submissions using Study Data Tabulation Model (SDTM). Electronic submission to the FDA is therefore a process of following the guidelines from CDISC and FDA. This paper illustrates how to create an FDA guidance compliant define.xml v2 from metadata by using SAS^®.

The first table in many research journal articles is a statistical comparison of demographic traits across study groups. It might not be exciting, but it s necessary. And although SAS^® calculates these numbers with ease, it is a time-consuming chore to transfer these results into a journal-ready table. Introducing the time-saving deluxe %MAKETABLE SAS macro it does the heavy work for you. It creates a Microsoft Word table of up to four comparative groups reporting t-tests, chi-square, ANOVA, or median test results, including a p-value. You specify only a one-line macro call for each line in the table, and the macro takes it from there. The result is a tidily formatted journal-ready Word table that you can easily include in a manuscript, report, or Microsoft PowerPoint presentation. For statisticians and researchers needing to summarize group comparisons in a table, this macro saves time and relieves you from the drudgery of trying to make your output neat and pretty. And after all, isn t that what we want computing to do for us?

In this new era of healthcare reform, health insurance companies have heightened their efforts to pinpoint who their customers are, what their characteristics are, what they look like today, and how this impacts business in today s and tomorrow s healthcare environment. The passing of the Healthcare Reform policies led insurance companies to focus and prioritize their projects on understanding who the members in their current population were. The goal was to provide an integrated single view of the customer that could be used for retention, increased market share, balancing population risk, improving customer relations, and providing programs to meet the members' needs. By understanding the customer, a marketing strategy could be built for each customer segment classification, as predefined by specific attributes. This paper describes how SAS^® was used to perform the analytics that were used to characterize their insured population. The high-level discussion of the project includes regression modeling, customer segmentation, variable selection, and propensity scoring using claims, enrollment, and third-party psychographic data.

Debugging SAS^® code contained in a macro can be frustrating because the SAS error messages refer only to the line in the SAS log where the macro was invoked. This can make it difficult to pinpoint the problem when the macro contains a large amount of SAS code. Using a macro that contains one small DATA step, this paper shows how to use the MPRINT and MFILE options along with the fileref MPRINT to write just the SAS code generated by a macro to a file. The 'de-macroified' SAS code can be easily executed and debugged.

Your company s chronically overloaded SAS^® environment, adversely impacted user community, and the resultant lackluster productivity have finally convinced your upper management that it is time to upgrade to a SAS^® grid to eliminate all the resource problems once and for all. But after the contract is signed and implementation begins, you as the SAS administrator suddenly realize that your company-wide standard mode of SAS operations, that is, using the traditional SAS^® Display Manager on a server machine, runs counter to the expectation of the SAS grid your users are now supposed to switch to SAS^® Enterprise Guide^® on a PC. This is utterly unacceptable to the user community because almost everything has to change in a big way. If you like to play a hero in your little world, this is your opportunity. There are a number of things you can do to make the transition to the SAS grid as smooth and painless as possible, and your users get to keep their favorite SAS Display Manager.

In evaluation instruments and tests, individual items are often collected using an ordinal measurement or Likert type scale. Typically measures such as Cronbach s alpha are estimated using the standard Pearson correlation. Gadderman and Zumbo (2012) illustrate how using the standard Pearson correlations may yield biased estimates of reliability when the data are ordinal and present methodology for using the polychoric correlation in reliability estimates as an alternative. This session shows how to implement the methods of Gadderman and Zumbo using SAS^® software. An example will be presented that incorporates these methods in the estimation of the reliability of an active learning post-occupancy evaluation instrument developed by Steelcase Education Solutions researchers.

Logistic regression is a powerful technique for predicting the outcome of a categorical response variable and is used in a wide range of disciplines. Until recently, however, this methodology was available only for data that were collected using a simple random sample. Thanks to the work of statisticians such as Binder (1983), logistic modeling has been extended to data that are collected from a complex survey design that includes strata, clusters, and weights. Through examples, this paper provides guidance on how to use PROC SURVEYLOGISTIC to apply logistic regression modeling techniques to data that are collected from a complex survey design. The examples relate to calculating odds ratios for models with interactions, scoring data sets, and producing ROC curves. As an extension of these techniques, a final example shows how to fit a Generalized Estimating Equations (GEE) logit model.

Each month, our project team delivers updated 5-Star ratings for 15,700+ nursing homes across the United States to Centers for Medicare and Medicaid Services. There is a wealth of data (and processing) behind the ratings, and this data is longitudinal in nature. A prior paper in this series, 'Programming the Provider Previews: Extreme SAS^® Reporting,' discussed one aspect of the processing involved in maintaining the Nursing Home Compare website. This paper will discuss two other aspects of our processing: creating an annual data Compendium and extending the 5-star processing to accommodate several different output formats for different purposes. Products used include Base SAS^®, SAS/STAT^®, ODS Graphics procedures, and SAS/GRAPH^®. New annotate facilities in both SAS/GRAPH and the ODS Graphics procedures will be discussed. This paper and presentation will be of most interest to SAS programmers with medium to advanced SAS skills.

For almost two decades, Western Kentucky University's Office of Institutional Research (WKU-IR) has used SAS^® to help shape the future of the institution by providing faculty and administrators with information they can use to make a difference in the lives of their students. This presentation provides specific examples of how WKU-IR has shaped the policies and practices of our institution and discusses how WKU-IR moved from a support unit to a key strategic partner. In addition, the presentation covers the following topics: How the WKU Office of Institutional Research developed over time; Why WKU abandoned reactive reporting for a more accurate, convenient system using SAS^® Enterprise Intelligence Suite for Education; How WKU shifted from investigating what happened to predicting outcomes using SAS^® Enterprise Miner^™ and SAS^® Text Miner; How the office keeps the system relevant and utilized by key decision makers; What the office has accomplished and key plans for the future.

PROC TABULATE is the most widely used reporting tool in SAS^®, along with PROC REPORT. Any kind of report with the desired statistics can be produced by PROC TABULATE. When we need to report some summary statistics like mean, median, and range in the heading, either we have to edit it outside SAS in word processing software or enter it manually. In this paper, we discuss how we can automate this to be dynamic by using PROC SQL and some simple macros.

This paper shares our experience integrating two leading data analytics and Geographic Information Systems (GIS) software products SAS^® and ArcGIS to provide integrated reporting capabilities. SAS is a powerful tool for data manipulation and statistical analysis. ArcGIS is a powerful tool for analyzing data spatially and presenting complex cartographic representations. Combining statistical data analytics and GIS provides increased insight into data and allows for new and creative ways of visualizing the results. Although products exist to facilitate the sharing of data between SAS and ArcGIS, there are no ready-made solutions for integrating the output of these two tools in a dynamic and automated way. Our approach leverages the individual strengths of SAS and ArcGIS, as well as the report delivery infrastructure of SAS^® Information Delivery Portal.

This introductory presentation is intended for an audience new to mixed models who wants to get an overview of this useful class of models. Learn about mixed models as an extension of ordinary regression models, and see several examples of mixed models in social, agricultural, and pharmaceutical research.

For decades, mixed models have been used by researchers to account for random sources of variation in regression-type models. Now, they are gaining favor in business statistics for giving better predictions for naturally occurring groups of data, such as sales reps, store locations, or regions. Learn about how predictions based on a mixed model differ from predictions in ordinary regression and see examples of mixed models with business data.

When the dependent variable is a count, Poisson regression is a natural choice of distribution for fitting a regression model. This presentation is intended for an audience experienced in linear regression modeling, but new to Poisson regression modeling. Learn the basics of this useful distribution and see some examples where it is appropriate. Tips for identifying problems with fitting a Poisson regression model and some helpful alternatives are provided.

Analyzing data from a complex probability survey involves weighting observations so that inferences are correct. This introductory presentation is intended for an audience new to analyzing survey data. Learn the essentials of using the SURVEYxx procedures in SAS/STAT^®.

Do you need a statistic that is not computed by any SAS^® procedure? Reach for the SAS/IML^® language! Many statistics are naturally expressed in terms of matrices and vectors. For these, you need a matrix-vector language. This hands-on workshop introduces the SAS/IML language to experienced SAS programmers. The workshop focuses on statements that create and manipulate matrices, read and write data sets, and control the program flow. You will learn how to write user-defined functions, interact with other SAS procedures, and recognize efficient programming techniques. Programs are written using the SAS/IML^® Studio development environment. This course covers Chapters 2 4 of Statistical Programming with SAS/IML Software (Wicklin, 2010).

It is not uncommon to find models with random components like location, clinic, teacher, etc., not just the single error term we think of in ordinary regression. This paper uses several examples to illustrate the underlying ideas. In addition, the response variable might be Poisson or binary rather than normal, thus taking us into the realm of generalized linear mixed models, These too will be illustrated with examples.

This paper illustrates some SAS^® graphs that can be useful for variable selection in predictive modeling. Analysts are often confronted with hundreds of candidate variables available for use in predictive models, and this paper illustrates some simple SAS graphs that are easy to create and that are useful for visually evaluating candidate variables for inclusion or exclusion in predictive models. The graphs illustrated in this paper are bar charts with confidence intervals using the GCHART procedure and comparative histograms using the UNIVARIATE procedure. The graphs can be used for most combinations of categorical or continuous target variables with categorical or continuous input variables. This paper assumes the reader is familiar with the basic process of creating predictive models using multiple (linear or logistic) regression.

Healthcare services data on products and services come in different shapes and forms. Data cleaning, characterization, massaging, and transformation are essential precursors to any statistical model-building efforts. In addition, data size, quality, and distribution influence model selection, model life cycle, and the ease with which business insights are extracted from data. Analysts need to examine data characteristics and determine the right data transformation and methods of analysis for valid interpretation of results. In this presentation, we demonstrate the common data distribution types for a typical healthcare services industry such as Cardinal Health and their salient features. In addition, we use Base SAS^® and SAS/STAT^® for data transformation of both the response (Y) and the explanatory (X) variables in four combinations [RR (Y and X as row data), TR (only Y transformed), RT (only X transformed), and TT (Y and X transformed)] and the practical significance of interpreting linear, logistic, and completely randomized design model results using the original and the transformed data values for decision-making processes. The reality of dealing with diverse forms of data, the ramification of data transformation, and the challenge of interpreting model results of transformed data are discussed. Our analysis showed that the magnitude of data variability is an overriding factor to the success of data transformation and the subsequent tasks of model building and interpretation of model parameters. Although data transformation provided some benefits, it complicated analysis and subsequent interpretation of model results.

The role of the Data Scientist is the viral job description of the decade. And like LOLcats, there are many types of Data Scientists. What is this new role? Who is hiring them? What do they do? What skills are required to do their job? What does this mean for the SAS^® programmer and the statistician? Are they obsolete? And finally, if I am a SAS user, how can I become a Data Scientist? Come learn about this job of the future and what you can do to be part of it.

Determining what, when, and how to migrate SAS^® software from one major version to the next is a common challenge. SAS provides documentation and tools to help make the assessment, planning, and eventual deployment go smoothly. We describe some of the keys to making your migration a success, including the effective use of the SAS^® Migration Utility, both in the analysis mode and the execution mode. This utility is responsible for analyzing each machine in an existing environment, surfacing product-specific migration information, and creating packages to migrate existing product configurations to later versions. We show how it can be used to simplify each step of the migration process, including recent enhancements to flag product version compatibility and incompatibility.

This session introduces frailty models and their use in biostatistics to model time-to-event or survival data. The session uses examples to review situations in which a frailty model is a reasonable modeling option, to describe which SAS^® procedures can be used to fit frailty models, and to discuss the advantages and disadvantages of frailty models compared to other modeling options.

The NLIN procedure fits a wide variety of nonlinear models. However, some models can be so nonlinear that standard statistical methods of inference are not trustworthy. That s when you need the diagnostic and inferential features that were added to PROC NLIN in SAS/STAT^® 9.3, 12.1, and 13.1. This paper presents these features and explains how to use them. Examples demonstrate how to use parameter profiling and confidence curves to identify the nonlinearcharacteristics of the model parameters. They also demonstrate how to use the bootstrap method to study the sampling distribution of parameter estimates and to make more accurate statistical inferences. This paper highlights how measures of nonlinearity help you diagnose models and decide on potential reparameterization. It also highlights how multithreading is used to tame the large number of nonlinear optimizations that are required for these features.

Item response theory (IRT) is concerned with accurate test scoring and development of test items. You design test items to measure various types of abilities (such as math ability), traits (such as extroversion), or behavioral characteristics (such as purchasing tendency). Responses to test items can be binary (such as correct or incorrect responses in ability tests) or ordinal (such as degree of agreement on Likert scales). Traditionally, IRT models have been used to analyze these types of data in psychological assessments and educational testing. With the use of IRT models, you can not only improve scoring accuracy but also economize test administrations by adaptively using only the discriminative items. These features might explain why in recent years IRT models have become increasingly popular in many other fields, such as medical research, health sciences, quality-of-life research, and even marketing research. This paper describes a variety of IRT models, such as the Rasch model, two-parameter model, and graded response model, and demonstrates their application by using real-data examples. It also shows how to use the IRT procedure, which is new in SAS/STAT^® 13.1, to calibrate items, interpret item characteristics, and score respondents. Finally, the paper explains how the application of IRT models can help improve test scoring and develop better tests. You will see the value in applying item response theory, possibly in your own organization!

The soaring number of publicly available data sets across disciplines have allowed for increased access to real-life data for use in both research and educational settings. These data often leverage cost-effective complex sampling designs including stratification and clustering, which allow for increased efficiency in survey data collection and analyses. Weighting becomes a necessary component in these survey data in order to properly calculate variance estimates and arrive at sound inferences through statistical analysis. Generally speaking, these weights are included with the variables provided in the public use data, though an explanation for how and when to use these weights is often lacking. This paper presents an analysis using the California Health Interview Survey to compare weighted and non-weighted results using SAS^® PROC LOGISTIC and PROC SURVEYLOGISTIC.

How do you compare group responses when the data are unbalanced or when covariates come into play? Simple averages will not do, but LS-means are just the ticket. Central to postfitting analysis in SAS/STAT^® linear modeling procedures, LS-means generalize the simple average for unbalanced data and complicated models. They play a key role both in standard treatment comparisons and Type III tests and in newer techniques such as sliced interaction effects and diffograms. This paper reviews the definition of LS-means, focusing on their interpretation as predicted population marginal means, and it illustrates their broad range of use with numerous examples.

Email is an important marketing channel for digital marketers. We can stay connected with our subscribers and attract them with relevant content as long as they are still subscribed to our email communication. In this session, we are planning to discuss why it's important to manage opt-out risk; how did we predict opt-out risk; and how do we proactively manage opt-out using the models we developed.

One of the most common questions about logistic regression is How do I know if my model fits the data? There are many approaches to answering this question, but they generally fall into two categories: measures of predictive power (like R-squared) and goodness of fit tests (like the Pearson chi-square). This presentation looks first at R-squared measures, arguing that the optional R-squares reported by PROC LOGISTIC might not be optimal. Measures proposed by McFadden and Tjur appear to be more attractive. As for goodness of fit, the popular Hosmer and Lemeshow test is shown to have some serious problems. Several alternatives are considered.

In applied statistical practice, incomplete measurement sequences are the rule rather than the exception. Fortunately, in a large variety of settings, the stochastic mechanism governing the incompleteness can be ignored without hampering inferences about the measurement process. While ignorability only requires the relatively general missing at random assumption for likelihood and Bayesian inferences, this result cannot be invoked when non-likelihood methods are used. We will first sketch the framework used for contemporary missing-data analysis. Apart from revisiting some of the simpler but problematic methods, attention will be paid to direct likelihood and multiple imputation. Because popular non-likelihood-based methods do not enjoy the ignorability property in the same circumstances as likelihood and Bayesian inferences, weighted versions have been proposed. This holds true in particular for generalized estimating equations (GEE). Even so-called doubly-robust versions have been derived. Apart from GEE, also pseudo-likelihood based strategies can be adapted appropriately. We describe a suite of corrections to the standard form of pseudo-likelihood, to ensure its validity under missingness at random. Our corrections follow both single and double robustness ideas, and is relatively simple to apply.

Bootstrapped Decision Tree is a variable selection method used to identify and eliminate unintelligent variables from a large number of initial candidate variables. Candidates for subsequent modeling are identified by selecting variables consistently appearing at the top of decision trees created using a random sample of all possible modeling variables. The technique is best used to reduce hundreds of potential fields to a short list of 30 50 fields to be used in developing a model. This method for variable selection has recently become available in JMP^® under the name BootstrapForest; this paper presents an implementation in Base SAS^®9. The method does accept but does not require a specific outcome to be modeled and will therefore work for nearly any type of model, including segmentation, MCMC, multiple discrete choice, in addition to standard logistic regression. Keywords: Bootstrapped Decision Tree, Variable Selection

For most practitioners, ordinary least square (OLS) regression with a Gaussian distributional assumption might be the top choice for modeling fractional outcomes in many business problems. However, it is conceptually flawed to assume a Gaussian distribution for a response variable in the [0, 1] interval. In this paper, several modeling methodologies for fractional outcomes with their implementations in SAS^® are discussed through a data analysis exercise in predicting corporate financial leverage ratios. Various empirical and conceptual methods for the model evaluation and comparison are also discussed throughout the example. This paper provides a comprehensive survey about how to model fractional outcomes.

Predicting loss given default (LGD) is playing an increasingly crucial role in quantitative credit risk modeling. In this paper, we propose to apply mixed effects models to predict corporate bonds LGD, as well as other widely used LGD models. The empirical results show that mixed effects models are able to explain the unobservable heterogeneity and to make better predictions compared with linear regression and fractional response regression. All the statistical models are performed in SAS/STAT^®, SAS^® 9.2, using specifically PROC REG and PROC NLMIXED, and the model evaluation metrics are calculated in PROC IML. This paper gives a detailed description on how to use PROC NLMIXED to build and estimate generalized linear models and mixed effects models.

While survey researchers make great attempts to standardize their questionnaires including the usage of ratings scales in order to collect unbiased data, respondents are still prone to introducing their own interpretation and bias to their responses. This bias can potentially affect the understanding of commonly investigated drivers of customer satisfaction and limit the quality of the recommendations made to management. One such problem is scale use heterogeneity, in which respondents do not employ a panoramic view of the entire scale range as provided, but instead focus on parts of the scale in giving their responses. Studies have found that bias arising from this phenomenon was especially prevalent in multinational research, e.g., respondents of some cultures being inclined to use only the neutral points of the scale. Moreover, personal variability in response tendencies further complicates the issue for researchers. This paper describes an implementation that uses a Bayesian hierarchical model to capture the distribution of heterogeneity while incorporating the information present in the data. More specifically, SAS^® PROC MCMC is used to carry out a comprehensive modeling strategy of ratings data that account for individual level scale usage. Key takeaways include an assessment of differences between key driver analyses that ignore this phenomenon versus the one that results from our implementation. Managerial implications are also emphasized in light of the prevalent use of more simplistic approaches.

This paper considers the %MRE macro for estimating multivariate ratio estimates. Also, we use PROC REG to estimate multivariate regression estimates and to show that regression estimates are superior to the ratio estimates.

When viewing and working with SAS^® data sets especially wide ones it s often instinctive to rearrange the variables (columns) into some intuitive order. The RETAIN statement is one of the most commonly cited methods used for ordering variables. Though RETAIN can perform this task, its use as an ordering clause can cause a host of easily missed problems due to its intended function of retaining values across DATA step iterations. This risk is especially great for the more novice SAS programmer. Instead, two equally effective and less risky ways to order data set variables are recommended, namely, the FORMAT and SQL SELECT statements.

Part of being a good analyst and statistician is being able to understand the output of a statistical test in SAS^®. P-values are ubiquitous in statistical output as well as medical literature and can be the deciding factor in whether a paper gets published. This shows a somewhat dictatorial side of them. But do we really know what they mean? In a democratic process, people vote for another person to represent them, their values, and their opinions. In this sense, the sample of research subjects, their characteristics, and their experience, are combined and represented to a certain degree by the p-value. This paper discusses misconceptions about and misinterpretations of the p-value, as well as how things can go awry in calculating a p-value. Alternatives to p-values are described, with advantages and disadvantages of each. Finally, some thoughts about p-value interpretation are given. To disarm the dictator, we need to understand what the democratic p-value can tell us about what it represents&.and what it doesn't. This presentation is aimed at beginning to intermediate SAS statisticians and analysts working with SAS/STAT^®.

PD_Calibrate is a macro that standardizes the calibration of our predictive credit-scoring models at Nykredit. The macro is activated with an input data set, variables, anchor point, specification of method, number of buckets, kink-value, and so on. The output consists of graphs, HTML, and two data sets containing key values for the model being calibrated and values for the use of graphics.

PROC TABULATE is a powerful tool for creating tabular summary reports. Its advantages, over PROC REPORT, are that it requires less code, allows for more convenient table construction, and uses syntax that makes it easier to modify a table s structure. However, its inability to compute the sum, difference, product, and ratio of column sums has hindered its use in many circumstances. This paper illustrates and discusses some creative approaches and methods for overcoming these limitations, enabling users to produce needed reports and still enjoy the simplicity and convenience of PROC TABULATE. These methods and skills can have prominent applications in a variety of business intelligence and analytics fields.

The effectiveness of visual interpretation of the differences between pairs of LS-means in a generalized linear model includes the graph's ability to display four inferential and two perceptual tasks. Among the types of graphs which display some or all of these tasks are the forest plot, the mean-mean scatter plot (diffogram), and closely related to it, the mean-mean multiple comparison (MMC) plot. These graphs provide essential visual perspectives for interpretation of the differences among pairs of LS-means from a generalized linear model (GLM). The diffogram is a graphical option now available through ODS statistical graphics with linear model procedures such as GLIMMIX. Through combining ODS output files of the LS-means and their differences, the SGPLOT procedure can efficiently produce forest and MMC plots.

Power analysis helps you plan a study that has a controlled probability of detecting a meaningful effect, giving you conclusive results with maximum efficiency. SAS/STAT^® provides two procedures for performing sample size and power computations: the POWER procedure provides analyses for a wide variety of different statistical tests, and the GLMPOWER procedure focuses on power analysis for general linear models. In SAS/STAT 13.1, the GLMPOWER procedure has been updated to enable power analysis for multivariate linear models and repeated measures studies. Much of the syntax is similar to the syntax of the GLM procedure, including both the new MANOVA and REPEATED statements and the existing MODEL and CONTRAST statements. In addition, PROC GLMPOWER offers flexible yet parsimonious options for specifying the covariance. One such option is the two-parameter linear exponent autoregressive (LEAR) correlation structure, which includes other common structures such as AR(1), compound symmetry, and first-order moving average as special cases. This paper reviews the new repeated measures features of PROC GLMPOWER, demonstrates their use in several examples, and discusses the pros and cons of the MANOVA and repeated measures approaches.

The use of predictive models in healthcare has steadily increased over the decades. Statistical models now are assumed to be a necessary component in population health management. This session will review practical considerations in the choice of models to develop, criteria for assessing the utility of the models for production, and challenges with incorporating the models into business process flows. Specific examples of models will be provided based upon work by the Health Economics team at Blue Cross Blue Shield of North Carolina.

Sparse data sets are common in applications of text and data mining, social network analysis, and recommendation systems. In SAS^® software, sparse data sets are usually stored in the coordinate list (COO) transactional format. Two major drawbacks are associated with this sparse data representation: First, most SAS procedures are designed to handle dense data and cannot consume data that are stored transactionally. In that case, the options for analysis are significantly limited. Second, a sparse data set in transactional format is hard to store and process in distributed systems. Most techniques require that all transactions for a particular object be kept together; this assumption is violated when the transactions of that object are distributed to different nodes of the grid. This paper presents some different ideas about how to package all transactions of an object into a single row. Approaches include storing the sparse matrix densely, doing variable selection, doing variable extraction, and compressing the transactions into a few text variables by using Base64 encoding. These simple but effective techniques enable you to store and process your sparse data in better ways. This paper demonstrates how to use SAS^® Text Miner procedures to process sparse data sets and generate output data sets that are easy to store and can be readily processed by traditional SAS modeling procedures. The output of the system can be safely stored and distributed in any grid environment.

In a good clinical study, statisticians and various stakeholders are interested in assessing and isolating the effect of non-study drugs. One common practice in clinical trials is that clinical investigators follow the protocol to taper certain concomitant medications in an attempt to prevent or resolve adverse reactions and/or to minimize the number of subject withdrawals due to lack of efficacy or adverse event. To assess the impact of those tapering medicines during study is of high interest to clinical scientists and the study statistician. This paper presents the challenges and caveats of assessing the impact of tapering a certain type of concomitant medications using SAS^® 9.3 based on a hypothetical case. The paper also presents the advantages of visual graphs in facilitating communications between clinical scientists and the study statistician.

Duration and severity data arise in several fields including biostatistics, demography, economics, engineering, and sociology. SAS^® procedures LIFETEST, LIFEREG. and PHREG are the workhorses for analysis of time to event data in applications in biostatistics. Similar methods apply to the magnitude or severity of a random event, where the outcome might be right, left, or interval censored and/or, right or left truncated. All combinations of types of censoring and truncation could be present in the data set. Regression models such as the accelerated failure time model, the Cox model, and the non-homogeneous Poisson model have extensions to address time-varying covariates in the analysis of clustered outcomes, multivariate outcomes of mixed types, and recurrent events. We present an overview of new capabilities that are available in the procedures QLIM, QUANTLIFE, RELIABILITY, and SEVERITY with examples illustrating their application using empirical data sets drawn from easily accessible sources.

SAS/STAT^® 13.1 brings valuable new techniques to all sectors of the audience forSAS statistical software. Updates for survival analysis include nonparametricmethods for interval censoring and models for competing risks. Multipleimputation methods are extended with the addition of sensitivity analysis.Bayesian discrete choice models offer a modern approach for consumer research.Path diagrams are a welcome addition to structural equation modeling, and itemresponse models are available for educational assessment. This paper providesoverviews and introductory examples for each of the new focus areas in SAS/STAT13.1. The paper also provides a sneak preview of the follow-up release,SAS/STAT 13.2, which brings additional strategies for missing data analysis andother important updates to statistical customers.

Are you wondering what is causing your valuable machine asset to fail? What could those drivers be, and what is the likelihood of failure? Do you want to be proactive rather than reactive? Answers to these questions have arrived with SAS^® Predictive Asset Maintenance. The solution provides an analytical framework to reduce the amount of unscheduled downtime and optimize maintenance cycles and costs. An all new (R&D-based) version of this offering is now available. Key aspects of this paper include: Discussing key business drivers for and capabilities of SAS Predictive Asset Maintenance. Detailed analysis of the solution, including: Data model Explorations Data selections Path I: analysis workbench maintenance analysis and stability monitoring Path II: analysis workbench JMP^®, SAS^® Enterprise Guide^®, and SAS^® Enterprise Miner^™ Analytical case development using SAS Enterprise Miner, SAS^® Model Manager, and SAS^® Data Integration Studio SAS Predictive Asset Maintenance Portlet for reports A realistic business example in the oil and gas industry is used.

SAS^® has a number of procedures for smoothing scatter plots. In this tutorial, we review the nonparametric technique called LOESS, which estimates local regression surfaces. We review the LOESS procedure and then compare it to a parametric regression methodology that employs restricted cubic splines to fit nonlinear patterns in the data. Not only do these two methods fit scatterplot data, but they can also be used to fit multivariate relationships.

The scatter plot is a basic tool for examining the relationship between two variables. While the basic plot is good, enhancements can make it better. In addition, there might be problems of overplotting. In this paper, I cover ways to create basic and enhanced scatter plots and to deal with overplotting.

Universities strive to be competitive in the quality of education as well as cost of attendance. Peer institutions are selected to make comparisons pertaining to academics, costs, and revenues. These comparisons lead to strategic decisions and long-range planning to meet goals. The process of finding comparable institutions could be completed with cluster analysis, a statistical technique. Cluster analysis places universities with similar characteristics into groups or clusters. A process to determine peer universities will be illustrated using PROC STANDARD, PROC FASTCLUS, and PROC CLUSTER.

Multiple imputation, a popular strategy for dealing with missing values, usually assumes that the data are missing at random (MAR). That is, for a variable X, the probability that an observation is missing depends only on the observed values of other variables, not on the unobserved values of X. It is important to examine the sensitivity of inferences to departures from the MAR assumption, because this assumption cannot be verified using the data. The pattern-mixture model approach to sensitivity analysis models the distribution of a response as the mixture of a distribution of the observed responses and a distribution of the missing responses. Missing values can then be imputed under a plausible scenario for which the missing data are missing not at random (MNAR). If this scenario leads to a conclusion different from that of inference under MAR, then the MAR assumption is questionable. This paper reviews the concepts of multiple imputation and explains how you can apply the pattern-mixture model approach in the MI procedure by using the MNAR statement, which is new in SAS/STAT^® 13.1. You can specify a subset of the observations to derive the imputation model, which is used for pattern imputation based on control groups in clinical trials. You can also adjust imputed values by using specified shift and scale parameters for a set of selected observations, which are used for sensitivity analysis with a tipping-point approach.

One beautiful graph provides visual clarity of data summaries reported in tables and listings. Waterfall graphs show, at a glance, the increase or decrease of data analysis results from various industries. The introduction of SAS^® 9.2 ODS Statistical Graphics enables SAS^® programmers to produce high-quality results with less coding effort. Also, SAS programmers can create sophisticated graphs in stylish custom layouts using the SAS^® 9.3 Graph Template Language and ODS style template. This poster presents two sets of example waterfall graphs in the setting of clinical trials using SAS^® 9.3 and later. The first example displays colorful graphs using new SAS 9.3 options. The second example displays simple graphs with gray-scale color coding and patterns. SAS programmers of all skill levels can create these graphs on UNIX or Windows.

Systematic reviews have become increasingly important in healthcare, particularly when there is a need to compare new treatment options and to justify clinical effectiveness versus cost. This paper describes a method in SAS/STAT^® 9.2 for computing weighted averages and weighted standard deviations of clinical variables across treatment options while correctly using these summary measures to make accurate statistical inference. The analyses of data from systematic reviews typically involve computations of weighted averages and comparisons across treatment groups. However, the application of the TTEST procedure does not currently take into account weighted standard deviations when computing p-values. The use of a default non-weighted standard deviation can lead to incorrect statistical inference. This paper introduces a method for computing correct p-values using weighted averages and weighted standard deviations. Given a data set containing variables for three treatment options, we want to make pairwise comparisons of three independent treatments. This is done by creating two temporary data sets using PROC MEANS, which yields the weighted means and weighted standard deviations. Subsequently, we then perform a t-test on each temporary data set.The resultant data sets containing all comparisons of each treatment options are merged and then transposed to obtain the necessary statistics. The resulting output provides pairwise comparisons of each treatment option and uses the weighted standard deviations to yield the correct p-values in a desired format. This method allows the use of correct weighted standard deviations using PROC MEANS and PROC TTEST in summarizing data from a systematic review while providing correct p-values.

Global businesses must react to daily changes in market conditions over multiple geographies and industries. Consuming reputable daily economic reports assists in understanding these changing conditions, but requires both a significant human time commitment and a subjective assessment of each topic area of interest. To combat these constraints, Dow's Advanced Analytics team has constructed a process to calculate sentence-level topic frequency and sentiment scoring from unstructured economic reports. Daily topic sentiment scores are aggregated to weekly and monthly intervals and used as exogenous variables to model external economic time series data. These models serve to both validate the relationship between our sentiment scoring process and also as near-term forecasts where daily or weekly variables are unavailable. This paper will first describe our process of using SAS^® Text Miner to import and discover economic topics and sentiment from unstructured economic reports. The next section describes sentiment variable selection techniques that use SAS/STAT^®, SAS/ETS^®, and SAS^® Enterprise Miner^™ to generate similarity measures to economic indices. Our process then uses ARIMAX modeling in SAS^® Forecast Studio to create economic index forecasts with topic sentiments. Finally, we show how the sentiment model components are used as a matrix of economic key performance indicators by topic and geography.

Identifying claim fraud using predictive analytics represents a unique challenge. 1. Predictive analytics generally requires that you have a target variable which can be analyzed. Fraud is unique in this regard in that there is a lot of fraud that has occurred historically that has not been identified. Therefore, the definition of the target variable is difficult. 2.There is also a natural assumption that the past will bear some resemblance to the future. In the case of fraud, methods of defrauding insurance companies change quickly and can make the analysis of a historical database less valuable for identifying future fraud. 3. In an underlying database of claims that may have been determined to be fraudulent by an insurance company, there is many times an inconsistency between different claim adjusters regarding which claims are referred for investigation. This inconsistency can lead to erroneous model results due to data that is not homogenous. This paper will demonstrate how analytics can be used in several ways to help identify fraud: 1. More consistent referral of suspicious claims 2. Better identification of new types of suspicious claims 3. Incorporating claim adjuster insight into the analytics results. As part of this paper, we will demonstrate the application of several approaches to fraud identification: 1. Clustering 2. Association analysis 3. PRIDIT (Principal Component Analysis of RIDIT scores).

The independent means t-test is commonly used for testing the equality of two population means. However, this test is very sensitive to violations of the population normality and homogeneity of variance assumptions. In such situations, Yuen s (1974) trimmed t-test is recommended as a robust alternative. The purpose of this paper is to provide a SAS^® macro that allows easy computation of Yuen s symmetric trimmed t-test. The macro output includes a table with trimmed means for each of two groups, Winsorized variance estimates, degrees of freedom, and obtained value of t (with two-tailed p-value). In addition, the results of a simulation study are presented and provide empirical comparisons of the Type I error rates and statistical power of the independent samples t-test, Satterthwaite s approximate t-test, and the trimmed t-test when the assumptions of normality and homogeneity of variance are violated.

Epidemic modeling is an increasingly important tool in the study of infectious diseases. As technology advances and more and more parameters and data are incorporated into models, it is easy for programs to get bogged down and become unacceptably slow. The use of arrays for importing real data and collecting generated model results in SAS^® can help to streamline the process so results can be obtained and analyzed more efficiently. This paper describes a stochastic mathematical model for transmission of influenza among residents and healthcare workers in long-term care facilities (LTCFs) in New Mexico. The purpose of the model was to determine to what extent herd immunity among LTCF residents could be induced by varying the vaccine coverage among LTCF healthcare workers. Using arrays in SAS made it possible to efficiently incorporate real surveillance data into the model while also simplifying analyses of the results, which ultimately held important implications for LTCF policy and practice.

Regression is a helpful statistical tool for showing relationships between two or more variables. However, many users can find the barrage of numbers at best unhelpful, and at worst undecipherable. Using the shipments and inventories historical data from the U.S. Census Bureau's office of Manufacturers' Shipments, Inventories, and Orders (M3), we can create a graphical representation of two time series with PROC GPLOT and map out reported and expected results. By combining this output with results from PROC REG, we are able to highlight problem areas that might need a second look. The resulting graph shows which dates have abnormal relationships between our two variables and presents the data in an easy-to-use format that even users unfamiliar with SAS^® can interpret. This graph is ideal for analysts finding problematic areas such as outliers and trend-breakers or for managers to quickly discern complications and the effect they have on overall results.

The new Markov chain Monte Carlo (MCMC) procedure introduced in SAS/STAT^® 9.2 and further exploited in SAS/STAT^® 9.3 enables Bayesian computations to run efficiently with SAS^®. The MCMC procedure allows one to carry out complex statistical modeling within Bayesian frameworks under a wide spectrum of scientific research; in psychometrics, for example, the estimation of item and ability parameters is a kind. This paper describes how to use PROC MCMC for Bayesian inferences of item and ability parameters under a variety of popular item response models. This paper also covers how the results from SAS PROC MCMC are different from or similar to the results from WinBUGS. For those who are interested in the Bayesian approach to item response modeling, it is exciting and beneficial to shift to SAS, based on its flexibility of data managements and its power of data analysis. Using the resulting item parameter estimates, one can continue to test form constructions, test equatings, etc., with all these test development processes being accomplished with SAS!

Existing health literacy assessment tools developed for research purposes have constraints that limit their utility for clinical practice. The measurement of health literacy in clinical practice can be impractical due to the time requirements of existing assessment tools. Single Item Literacy Screener (SILS) items, which are self-administered brief screening questions, have been developed to address this constraint. We developed a model to predict limited health literacy that consists of two SILS and demographic information (for example, age, race, and education status) using a sample of patients in a St. Louis emergency department. In this paper, we validate this prediction model in a separate sample of patients visiting a primary care clinic in St. Louis. Using the prediction model developed in the previous study, we use SAS/STAT^® software to validate this model based on three goodness of fit criteria: rescaled R-squared, AIC, and BIC. We compare models using two different measures of health literacy, Newest Vital Sign (NVS) and Rapid Assessment of Health Literacy in Medicine Revised (REALM-R). We evaluate the prediction model by examining the concordance, area under the ROC curve, sensitivity, specificity, kappa, and gamma statistics. Preliminary results show 69% concordance when comparing the model results to the REALM-R and 66% concordance when comparing to the NVS. Our conclusion is that validating a prediction model for inadequate health literacy would provide a feasible way to assess health literacy in fast-paced clinical settings. This would allow us to reach patients with limited health literacy with educational interventions and better meet their information needs.

The Affordable Care Act that is being implemented now is expected to fundamentally reshape the health care industry. All current participants--providers, subscribers, and payers--will operate differently under a new set of key performance indicators (KPIs). This paper uses public data and SAS^® software to establish a baseline for the health care industry today so that structural changes can be measured in the future to establish the impact of the new laws.

Health plans use wide-ranging interventions based on criteria set by nationally recognized organizations (for example, NCQA and CMS) to change health-related behavior in large populations. Evaluation of these interventions has become more important with the increased need to report patient-centered quality of care outcomes. Findings from evaluations can detect successful intervention elements and identify at-risk patients for further targeted interventions. This paper describes how SAS^® was applied to evaluate the effectiveness of a patient-directed intervention designed to increase medication adherence and a health plan s CMS Part D Star Ratings. Topics covered include querying data warehouse tables, merging pharmacy and eligibility claims, manipulating data to create outcome variables, and running statistical tests to measure pre-post intervention differences.

Comprehensive cancer centers have been mandated to engage communities in their work; thus, measurement of community engagement is a priority area. Siteman Cancer Center s Program for the Elimination of Cancer Disparities (PECaD) projects seek to align with 11 Engagement Principles (EP) previously developed in the literature. Participants in a PECaD pilot project were administered a survey with questions on community engagement in order to evaluate how well the project aligns with the EPs. Internal consistency is examined using PROC CORR with the ALPHA option to calculate Cronbach s alpha for questions that relate to the same EP. This allows items that have a lack of internal consistency to be identified and to be edited or removed from the assessment. EP-specific scores are developed on quantity and quality scales. Lack of internal consistency was found for six of the 16 EP s examined items (alpha<.70). After editing the items, all EP question groups had strong internal consistency (alpha>.85). There was a significant positive correlation between quantity and quality scores (r=.918, P<.001). Average EP-specific scores ranged from 6.87 to 8.06; this suggests researchers adhered to the 11 EPs between sometime and most of the time on the quantity scale and between good and very good on the quality scale. Examining internal consistency is necessary to develop measures that accurately determine how well PECaD projects align with EPs. Using SAS^® to determine internal consistency is an integral step in the development of community engagement scores.

Researchers often rely on self-report for survey based studies. The accuracy of this self-reported data is often unknown, particularly in a medical setting that serves an under-insured patient population with varying levels of health literacy. We recruited participants from the waiting room of a St. Louis primary care safety net clinic to participate in a survey investigating the relationship between health environments and health outcomes. The survey included questions regarding personal and family history of chronic disease (diabetes, heart disease, and cancer) as well as BMI and self-perceived weight. We subsequently accessed the participant s electronic medical record (EMR) and collected physician-reported data on the same variables. We calculated concordance rates between participant answers and information gathered from EMRs using McNemar s chi-squared test. Logistic regression was then performed to determine the demographic predictors of concordance. Three hundred thirty-two patients completed surveys as part of the pilot phase of the study; 64% female, 58% African American, 4% Hispanic, 15% with less than high school level education, 76% annual household income less than $20,000, and 29% uninsured. Preliminary findings suggest an 82-94% concordance rate between self-reported and medical record data across outcomes, with the exception of family history of cancer (75%) and heart disease (42%). Our conclusion is that determining the validity of the self-reported data in the pilot phase influences whether self-reported personal and family history of disease and BMI are appropriate for use in this patient population.

The world's first wind resource assessment buoy, residing in Lake Michigan, uses a pulsing laser wind sensor to accurately measure wind speed, direction, and turbulence offshore up to wind turbine hub-height and across the blade span every second. Understanding wind behavior would be tedious and fatiguing with such large data sets. However, SAS/GRAPH^® 9.4 helps the user grasp wind characteristics over time and at different altitudes by exploring the data visually. This paper covers graphical approaches to evaluate wind speed validity, seasonal wind speed variation, and storm systems to inform engineers on the candidacy of Lake Michigan offshore wind farms.

Missing observations caused by dropouts or skipped visits present a problem in studies of longitudinal data. When the analysis is restricted to complete cases and the missing data depend on previous responses, the generalized estimating equation (GEE) approach, which is commonly used when the population-average effect is of primary interest, can lead to biased parameter estimates. The new GEE procedure in SAS/STAT^® 13.2 implements a weighted GEE method, which provides consistent parameter estimates when the dropout mechanism is correctly specified. When none of the data are missing, the method is identical to the usual GEE approach, which is available in the GENMOD procedure. This paper reviews the concepts and statistical methods. Examples illustrate how you can apply the GEE procedure to incomplete longitudinal data.

Over the last year, the SAS^® Enterprise Miner^™ development team has made numerous and wide-ranging enhancements and improvements. New utility nodes that save data, integrate better with open-source software, and register models make your routine tasks easier. The area of time series data mining has three new nodes. There are also new models for Bayesian network classifiers, generalized linear models (GLMs), support vector machines (SVMs), and more.