SAS Global Forum 2014 Proceedings

One of the first lessons that SAS^® programmers learn on the job is that numeric and character variables do not play well together, and that type mismatches are one of the more common source of errors in their otherwise flawless SAS programs. Luckily, converting variables from one type to another in SAS (that is, casting) is not difficult, requiring only the judicious use of either the input() or put() function. There remains, however, the danger of data being lost in the conversion process. This type of error is most likely to occur in cases of character-to-numeric variable conversion, most especially when the user does not fully understand the data contained in the data set. This paper will review the basics of data storage for character and numeric variables in SAS, the use of formats and informats for conversions, and how to ensure accurate type conversion of even high-precision numeric values.

Most marketers today are trying to use Facebook s network of 1.1 billion plus registered users for social media marketing. Local television stations and newspapers are no exception. This paper investigates what makes a post effective. A Facebook page that is owned by a brand has fans, or people who like the page and follow the stories posted on that page. The posts on a brand page, however, do not appear on all the fans News Feeds. This is determined by EdgeRank, a Facebook proprietary algorithm that determines what content users see and how it s prioritized on their News Feed. If marketers can understand how EdgeRank works, then they can develop more impactful posts and ultimately produce more effective social marketing using Facebook. The objective of this paper is to find the characteristics of a Facebook post that enhance the efficacy of a news outlet s page among their fans using Facebook Power Ratio as the target variable. Power Ratio, a surrogate to EdgeRank, was developed by experts at Frank N. Magid Associates, a research-based media consulting firm. Seventeen variables that describe the characteristics of a post were extracted from more than 8,000 posts, which were encoded by 10 media experts at Magid. Numerous models were built and compared to predict Power Ratio. The most useful model is a polynomial regression with the top three important factors as whether a post asks fans to like the post, content category of a post (including news, weather, etc.), and number of fans of the page.

Many different neuroscience researchers have explored how various parts of the brain are connected, but no one has performed association mining using brain data. In this study, we used SAS^® Enterprise Miner^™ 7.1 for association mining of brain data collected by a 14-channel EEG device. An application of the association mining technique is presented in this novel context of brain activities and by linking our results to theories of cognitive neuroscience. The brain waves were collected while a user processed information about Facebook, the most well-known social networking site. The data was cleaned using Independent Component Analysis via an open source MATLAB package. Next, by applying the LORETA algorithm, activations at every fraction of the second were recorded. The data was codified into transactions to perform association mining. Results showing how various parts of brain get excited while processing the information are reported. This study provides preliminary insights into how brain wave data can be analyzed by widely available data mining techniques to enhance researcher s understanding of brain activation patterns.

In our previous work, we often needed to perform large numbers of repetitive and data-driven post-campaign analyses to evaluate the performance of marketing campaigns in terms of customer response. These routine tasks were usually carried out manually by using Microsoft Excel, which was tedious, time-consuming, and error-prone. In order to improve the work efficiency and analysis accuracy, we managed to automate the analysis process with SAS^® programming and replace the manual Excel work. Through the use of SAS macro programs and other advanced skills, we successfully automated the complicated data-driven analyses with high efficiency and accuracy. This paper presents and illustrates the creative analytical ideas and programming skills for developing the automatic analysis process, which can be extended to apply in a variety of business intelligence and analytics fields.

This paper kicks off a project to write a comprehensive book of best practices for documenting SAS^® projects. The presenter s existing documentation styles are explained. The presenter wants to discuss and gather current best practices used by the SAS user community. The presenter shows documentation styles at three different levels of scope. The first is a style used for project documentation, the second a style for program documentation, and the third a style for variable documentation. This third style enables researchers to repeat the modeling in SAS research, in an alternative language, or conceptually.

The Affordable Care Act (ACA) contains provisions that have stimulated interest in analytics among health care providers, especially those provisions that address quality of outcomes. High Impact Technologies (HIT) has been addressing these issues since before passage of the ACA and has a Health Care Data Model recognized by Gartner and implemented at several health care providers. Recently, HIT acquired SAS^® Visual Analytics, and this paper reports our successful efforts to use SAS Visual Analytics for visually exploring Big Data for health care providers. Health care providers can suffer significant financial penalties for readmission rates above a certain threshold and other penalties related to quality of care. We have been able to use SAS Visual Analytics, coupled with our experience gained from implementing the HIT Healthcare Data Model at a number of Healthcare providers, to identify clinical measures that are significant predictors for readmission. As a result, we can help health care providers reduce the rate of 30-day readmissions.

The big questions in consumer research lead to statistical methods appropriate to them. 'What do consumers say?' is all about analyzing surveys and finding relationships between preferences and background attributes. 'What do consumers think? is about looking at higher-level structures like preference mappings that can be derived from ratings. 'What will consumers pay?' is about conducting choice experiments to pin down the way consumers trade off among features and with prices, with the willingness to pay. 'How do you trigger purchases?' is about experiments that determine which interventions work, and how to target them to potential consumers, with uplift modeling. The SAS product JMP^® version 11 was released last fall with a new group of modeling tools to address these and other questions in consumer research. Traditionally JMP has specialized in engineering tools, but consumer research is an important part of engineering, in product planning, to make sure you produce the products with the attributes consumers want.

Patient safety in a neonatal intensive care unit (NICU) as in any hospital unit is critically dependent on appropriate staffing. We used SAS^® Simulation Studio to create a discrete-event simulation model of a specific NICU that can be used to predict the number of nurses needed per shift. This model incorporates the complexities inherent in determining staffing needs, including variations in patient acuity, referral patterns, and length of stay. To build our model, the group first estimated probability distributions for the number and type of patients admitted each day to the unit. Using both internal and published data, the team also estimated distributions for various NICU-specific patient morbidities, including type and timing of each morbidity event and its temporal effect on a patient s acuity. We then built a simulation model that samples from these input distributions and simulates the flow of individual patients through the NICU (consisting of critical-care and step-down beds) over a one-year time period. The general basis of our model represents a method that can be applied to any unit in any hospital, thereby providing clinicians and administrators with a tool to rigorously and quantitatively support staffing decisions. With additional refinements, the use of such a model over time can provide significant benefits in both patient safety and operational efficiency.

Your company s chronically overloaded SAS^® environment, adversely impacted user community, and the resultant lackluster productivity have finally convinced your upper management that it is time to upgrade to a SAS^® grid to eliminate all the resource problems once and for all. But after the contract is signed and implementation begins, you as the SAS administrator suddenly realize that your company-wide standard mode of SAS operations, that is, using the traditional SAS^® Display Manager on a server machine, runs counter to the expectation of the SAS grid your users are now supposed to switch to SAS^® Enterprise Guide^® on a PC. This is utterly unacceptable to the user community because almost everything has to change in a big way. If you like to play a hero in your little world, this is your opportunity. There are a number of things you can do to make the transition to the SAS grid as smooth and painless as possible, and your users get to keep their favorite SAS Display Manager.

New York City boasts a wide variety of cuisine owing to the rich tourism and the vibrant immigrant population. The quality of food and hygiene maintained at the restaurants serving different cuisines has a direct impact on the people dining in them. The objective of this paper is to build a model that predicts the grade of the restaurants in New York City. It also provides deeper statistical insights into the distribution of restaurants, cuisine categories, grades, criticality of violations, etc., and concludes with the sequence analysis performed on the complete set of violations recorded for the restaurants at different time periods over the years 2012 and 2013. The data for 2013 is used to test the model. The data set consists of 15 variables that capture to restaurant location-specific and violation details. The target is an ordinal variable with three levels, A, B, and C, in descending order of the quality representation. Various SAS^® Enterprise Miner^™ models, logistic regression, decision trees, neural networks, and ensemble models are built and compared using validation misclassification rate. The stepwise regression model appears to be the best model, with prediction accuracy of 75.33%. The regression model is trained at step 3. The number of critical violations at 8.5 gives the root node for the split of the target levels, and the rest of the tree splits are guided by the predictor variables such as number of critical and non-critical violations, number of critical violations for the year 2011, cuisine group, and the borough.

The role of the Data Scientist is the viral job description of the decade. And like LOLcats, there are many types of Data Scientists. What is this new role? Who is hiring them? What do they do? What skills are required to do their job? What does this mean for the SAS^® programmer and the statistician? Are they obsolete? And finally, if I am a SAS user, how can I become a Data Scientist? Come learn about this job of the future and what you can do to be part of it.

Understanding the actual gambling behavior of an individual over the Internet, we develop markers which identify behavioral patterns, which in turn can be used to predict the level of risk a subscriber is prone to gambling. The data set contains 4,056 subscribers. Using SAS^® Enterprise Miner^™ 12.1, a set of models are run to predict which subscriber is likely to become a high-risk internet gambler. The data contains 114 variables such as first active date and first active product used on the website as well as the characteristics of the game such as fixed odds, poker, casino, games, etc. Other measures of a subscriber s data such as money put at stake and what odds are being bet are also included. These variables provide a comprehensive view of a subscriber s behavior while gambling over the website. The target variable is modeled as a binary variable, 0 indicating a risky gambler and 1 indicating a controlled gambler. The data is a typical example of real-world data with many missing values and hence had to be transformed, imputed, and then later considered for analysis. The model comparison algorithm of SAS Enterprise Miner 12.1 was used to determine the best model. The stepwise Regression performs the best among a set of 25 models which were run using over a 100 permutations of each model. The Stepwise Regression model predicts a high-risk Internet gambler at an accuracy of 69.63% with variables such as wk4frequency and wk3frequency of bets.

In business environments, a common obstacle to effective data-informed decision making occurs when key stakeholders are reluctant to embrace statistically derived predicted values or forecasts. If concerns regarding model inputs, underlying assumptions, and limitations are not addressed, decision makers might choose to trust their gut and reject the insight offered by a statistical model. This presentation explores methods for converting potential critics into partners by proactively involving them in the modeling process and by incorporating simple inputs derived from expert judgment, focus groups, market research, or other directional qualitative sources. Techniques include biasing historical data, what-if scenario testing, and Monte Carlo simulations.

PROC TABULATE is a powerful tool for creating tabular summary reports. Its advantages, over PROC REPORT, are that it requires less code, allows for more convenient table construction, and uses syntax that makes it easier to modify a table s structure. However, its inability to compute the sum, difference, product, and ratio of column sums has hindered its use in many circumstances. This paper illustrates and discusses some creative approaches and methods for overcoming these limitations, enabling users to produce needed reports and still enjoy the simplicity and convenience of PROC TABULATE. These methods and skills can have prominent applications in a variety of business intelligence and analytics fields.

The use of predictive models in healthcare has steadily increased over the decades. Statistical models now are assumed to be a necessary component in population health management. This session will review practical considerations in the choice of models to develop, criteria for assessing the utility of the models for production, and challenges with incorporating the models into business process flows. Specific examples of models will be provided based upon work by the Health Economics team at Blue Cross Blue Shield of North Carolina.

Over the years, there has been a growing concern about consumption of tobacco among youth. But no concrete studies have been done to find what exactly leads the children to start consuming tobacco. This study is an attempt to figure out the potential reasons for the same. Through our analysis, we have also tried to build A model to predict whether a child would smoke next year or not. This study is based on the 2011 National Youth Tobacco Survey data of 18,867 observations. In order to prepare data for insightful analysis, imputation operations were performed on the data using tree-based imputation methods. From a pool of 197 variables, 48 key variables were selected using variable selection methods, partial least squares, and decision tree models. Logistic Regression and Decision Tree models were built to predict whether a child would smoke in the next year or not. Comparing the models using Misclassification rate as the selection criteria, we found that the Stepwise Logistic Regression Model outperformed other models with a Validation Misclassification of 0.028497, 47.19% Sensitivity and 95.80% Specificity. Factors such as company of friends, cigarette brand ads, accessibility to the tobacco products, and passive smoking turned out to be the most important predictors in determining a child smoker. After this study, we could outline some important findings like the odds of a child taking up smoking are 2.17 times high when his close friends are also smoking.

In healthcare, we often express our analytics results as being adjusted . For example, you might have read a study in which the authors reported the data as age-adjusted or risk-adjusted. The concept of adjustment is widely used in program evaluation, comparing quality indicators across providers and systems, forecasting incidence rates, and in cost-effectiveness research. In order to make reasonable comparisons across time, place, or population, we need to account for small sample sizes and case-mix variation in other words, we need to level the playing field and account for differences in health status and for uniqueness in a given population. If you are new to healthcare. What it really means to adjust the data in order to make comparisons might not be obvious. In this paper, we explore the methods by which we control for potentially confounding variables in our data. We do so through a series of examples from the healthcare literature in both primary care and health insurance. In this survey of methods, we discuss the concepts of rates and how they can be adjusted for demographic strata (such as age, gender, and race), as well as health risk factors such as case mix.

Are you wondering what is causing your valuable machine asset to fail? What could those drivers be, and what is the likelihood of failure? Do you want to be proactive rather than reactive? Answers to these questions have arrived with SAS^® Predictive Asset Maintenance. The solution provides an analytical framework to reduce the amount of unscheduled downtime and optimize maintenance cycles and costs. An all new (R&D-based) version of this offering is now available. Key aspects of this paper include: Discussing key business drivers for and capabilities of SAS Predictive Asset Maintenance. Detailed analysis of the solution, including: Data model Explorations Data selections Path I: analysis workbench maintenance analysis and stability monitoring Path II: analysis workbench JMP^®, SAS^® Enterprise Guide^®, and SAS^® Enterprise Miner^™ Analytical case development using SAS Enterprise Miner, SAS^® Model Manager, and SAS^® Data Integration Studio SAS Predictive Asset Maintenance Portlet for reports A realistic business example in the oil and gas industry is used.

One in every four people dies of heart disease in the United States, and stress is an important factor which contributes towards a cardiac event. As the condition of the heart gradually worsens with age, the factors that lead to a myocardial infarction when the patients are subjected to stress are analyzed. The data used for this project was obtained from a survey conducted through the Department of Biostatistics at Vanderbilt University. The objective of this poster is to predict the chance of survival of a patient after a cardiac event. Then by using decision trees, neural networks, regression models, bootstrap decision trees, and ensemble models, we predict the target which is modeled as a binary variable, indicating whether a person is likely to survive or die. The top 15 models, each with an accuracy of over 70%, were considered. The model will give important survival characteristics of a patient which include his history with diabetes, smoking, hypertension, and angioplasty.

With smartphone and mobile apps market developing so rapidly, the expectations about effectiveness of mobile applications is high. Marketers and app developers need to analyze huge data available much before the app release, not only to better market the app, but also to avoid costly mistakes. The purpose of this poster is to build models to predict the success rate of an app to be released in a particular category. Data has been collected for 540 android apps under the Top free newly released apps category from https://play.google.com/store . The SAS^® Enterprise Miner^™ Text Mining node and SAS^® Sentiment Analysis Studio are used to parse and tokenize the collected customer reviews and also to calculate the average customer sentiment score for each app. Linear regression, neural, and auto-neural network models have been built to predict the rank of an app by considering average rating, number of installations, total number of reviews, number of 1-5 star ratings, app size, category, content rating, and average customer sentiment score as independent variables. A linear regression model with least Average Squared Error is selected as the best model, and number of installations, app maturity content are considered as significant model variables. App category, user reviews, and average customer sentiment score are also considered as important variables in deciding the success of an app. The poster summarizes the app success trends across various factors and also introduces a new SAS^® macro %getappdata, which we have developed for web crawling and text parsing.

Applying models to analyze sports data has always been done by teams across the globe. The film Moneyball has generated much hype about how a sports team can use data and statistics to build a winning team. The objective of this poster is to use the model comparison algorithm of SAS^® Enterprise Miner^™ to pick the best model that can predict the outcome of a soccer game. It is hence important to determine which factors influence the results of a game. The data set used contains input variables about a team s offensive and defensive abilities and the outcome of a game is modeled as a target variable. Using SAS Enterprise Miner, multinomial regression, neural networks, decision trees, ensemble models and gradient boosting models are built. Over 100 different versions of these models are run. The data contains statistics from the 2012-13 English premier league season. The competition has 20 teams playing each other in a home and away format. The season has a total of 380 games; the first 283 games are used to predict the outcome of the last 97 games. The target variable is treated as both nominal variable and ordinal variable with 3 levels for home win, away win, and tie. The gradient boosting model is the winning model which seems to predict games with 65% accuracy and identifies factors such as goals scored and ball possession as more important compared to fouls committed or red cards received.