SAS Global Forum 2017 Proceedings

Analyzing big data and visualizing trends in social media is a challenge that many companies face as large sources of publicly available data become accessible. While the sheer size of usable data can be staggering, knowing how to find trends in unstructured textual data is just as important an issue. At a big data conference, data scientists from several companies were invited to participate in tackling this challenge by identifying trends in cancer using unstructured data from Twitter users and presenting their results. This paper explains how our approach using SAS^® analytical methods were superior to other big data approaches in investigating these trends.

Read the paper (PDF)

This poster shows how to predict a past-due amount using traditional and machine learning techniques: logistic analysis, k-nearest neighbors, and random forest. The data set that was analyzed is about real-world commerce. It contains 305 categories of financial information from more than 11,787,287 unique businesses, from 2006 to 2014. The big challenge is how to handle the big and noisy real-world data sets. The first step of any model-building exercise is to define the outcome. A common prediction method in the financial services industry is to use binary outcomes, such as Good and Bad. For our research problem, we reduced past-due amounts into two cases, Good and Bad. Next, we built a two-stage model using the logistic regression method; that is, the first stage predicts the likelihood of a Bad outcome, and the second predicts a past-due amount, given a Bad outcome. Logistic analysis as a traditional statistical technique is commonly used for prediction and classification in the financial services industry. However, for analyzing big, noisy, or complex data sets, machine learning techniques are typically preferred to detect hard-to-discern patterns. To compare with both techniques, we use predictive accuracy, ROC index, sensitivity, and specificity as criteria.

A propensity score is the probability that an individual will be assigned to a condition or group, given a set of baseline covariates when the assignment is made. For example, the type of drug treatment given to a patient in a real-world setting might be non-randomly based on the patient's age, gender, geographic location, and socioeconomic status when the drug is prescribed. Propensity scores are used in many different types of observational studies to reduce selection bias. Subjects assigned to different groups are matched based on these propensity score probabilities, rather than matched based on the values of individual covariates. Although the underlying statistical theory behind the use of propensity scores is complex, implementing propensity score matching with SAS^® is relatively straightforward. An output data set of each subject's propensity score can be generated with SAS using PROC LOGISTIC. And, a generalized SAS macro can generate optimized N:1 propensity score matching of subjects assigned to different groups using the radius method. Matching can be optimized either for the number of matches within the maximum allowable radius or by the closeness of the matches within the radius. This presentation provides the general PROC LOGISTIC syntax to generate propensity scores, provides an overview of different propensity score matching techniques, and discusses how to use the SAS macro for optimized propensity score matching using the radius method.

Read the paper (PDF)

As the open-source community has been taking the technology world by storm, especially in the big data space, large corporations such as SAS, IBM, and Oracle have been working to embrace this new, quickly evolving ecosystem to continue to foster innovation and to remain competitive. For example, SAS, IBM, and others have aligned with the Open Data Platform initiative and are continuing to build out Hadoop and Spark solutions. And, Oracle has partnered with Cloudera to create the Big Data Appliance. This movement challenges companies that are consuming these products to select the right products and support partners. The hybrid approach using all tools available seems to be the methodology chosen by most successful companies. West Corporation, an Omaha-based provider of technology-enabled communication solutions, is no exception. West has been working with SAS for 10 years in the ETL, BI, and advanced analytics space, and West began its Hadoop journey a year ago. This paper focuses on how West data teams use both technologies to improve customer experience in the interactive voice response (IVR) system by storing massive semi-structure call logs in HDFS and by building models that predict a caller s intent to route the caller more efficiently and to reduce customer effort using familiar SAS code and the very user friendly SAS^® Enterprise Miner .

Read the paper (PDF)

A man with one watch always knows what time it is...but a man with two watches is never sure. Contrary to this adage, load forecasters at electric utilities would gladly wear an armful of watches. With only one model to choose from, it is certain that some forecasts will be wrong. But with multiple models, forecasters can have confidence about periods when the forecasts agree and can focus their attention on periods when the predictions diverge. Having a second opinion is preferred, and that's one of the six classic rules for forecasters as per Dr. Tao Hong of the University of North Carolina at Charlotte. Dr. Hong is the premiere thought leader and practitioner in the field of energy forecasting. This presentation discusses Dr. Hong's six rules, how they relate to the increasingly complex problem of forecasting electricity consumption, and the role that predictive analytics plays.

Read the paper (PDF)

Disseminating data to potential collaborators can be essential in the development of models, algorithms, and innovative research opportunities. However, it is often time-consuming to get approval to access sensitive data such as health data. An alternative to sharing the real data is to use synthetic data, which has similar properties to the original data but does not disclose sensitive information. The collaborators can use the synthetic data to make preliminary models or to work out bugs in their code while waiting to get approval to access the original data. A data owner can also use the synthetic data to crowdsource solutions from the public through competitions like Kaggle and then test those solutions on the original data. This paper implements a method that generates fully synthetic data in a way that matches the statistical moments of the true data up to a specified moment order as a SAS^® macro. Variables in the synthetic data set are of the same data type as the true data (for example, integer, binary, continuous). The implementation uses the linear programming solver within a column generation algorithm and the mixed integer linear programming solver from the OPTMODEL procedure in SAS/OR^® software. The COFOR statement in PROC OPTMODEL automatically parallelizes a portion of the algorithm. This paper demonstrates the method by using the Sashelp.Heart data set to generate fully synthetic data copies.

Read the paper (PDF)

The hospital Medicare readmission rate has become a key indicator for measuring the quality of health care in the US. This rate is currently used by major health-care stakeholders including the Centers for Medicare and Medicaid Services (CMS), the Agency for Healthcare Research and Quality (AHRQ), and the National Committee for Quality Assurance (NCQA) (Fan and Sarfarazi, 2014). Although many papers have been written about how to calculate readmissions, this paper provides updated code that includes ICD-10 (International Classification of Diseases) code and offers a novel and comprehensive approach using SAS^® DATA step options and PROC SQL. We discuss: 1) De-identifying patient data 2) Calculating sequential admissions 3) Subsetting criteria required to report for CMS 30-day readmissions. In addition, this papers demonstrates: 1) Using the output delivery system (ODS) to create a labeled and de-identified data set 2) Macro variables to examine data quality 3) Summary statistics for further reporting and analysis.

Read the paper (PDF) | View the e-poster or slides (PDF)

This presentation gives you the tools to begin using propensity scoring in SAS^® to answer research questions involving observational data. It is for both those attendees who have never used propensity scores and those who have a basic understanding of propensity scores but are unsure how to begin using them in SAS. It provides a brief introduction to the concept of propensity scores, and then turns its attention to giving you the tips and resources you need to get started. The presentation walks you through how the code in the book 'Analysis of Observational Health Care Data Using SAS^®', which was published by SAS Institute, is used to answer how a particular health care intervention impacted a health care outcome. It details how propensity scores are created and how propensity score matching is used to balance covariates between treated and untreated observations. With this case study in hand, you will feel confident that you have the tools necessary to begin answering some of your own research questions using propensity scores.

Read the paper (PDF)

Specifying the functional form of a covariate is a fundamental part of developing a regression model. The choice to include a variable as continuous, categorical, or as a spline can be determined by model fit. This paper offers an efficient and user-friendly SAS^® macro (%SPECI) to help analysts determine how best to specify the appropriate functional form of a covariate in a linear, logistic, and survival analysis model. For each model, our macro provides a graphical and statistical single-page comparison report of the covariate as a continuous, categorical, and restricted cubic spline variable so that users can easily compare and contrast results. The report includes the residual plot and distribution of the covariate. You can also include other covariates in the model for multivariable adjustment. The output displays the likelihood ratio statistic, the Akaike Information Criterion (AIC), as well as other model-specific statistics. The %SPECI macro is demonstrated using an example data set. The macro includes the PROC REG, PROC LOGISTIC, PROC PHREG, PROC REPORT, PROC SGPLOT, and more procedures in SAS^® 9.4.

Read the paper (PDF) | Download the data file (ZIP)

Visual+D2:D18ization is a critical part of turning data into knowledge. A customized graph is essential to make data visualization meaningful, powerful, and interpretable. Furthermore, customizing grouped data into a desired layout with specific requirements such as clusters, colors, symbols, and patterns for each group can be challenging. This paper provides a start-from-scratch, step-by-step solution to create a customized graph for grouped data using SAS^® Graph Template Language (GTL). By analyzing the data and target graph with the available tools and options that GTL provided, this paper demonstrates GTL is a powerful and flexible tool to create a customized, complex graph.

Read the paper (PDF)

Hierarchical models, also known as random-effects models, are widely used for data that consist of collections of units and are hierarchically structured. Bayesian methods offer flexibility in modeling assumptions that enable you to develop models that capture the complex nature of real-world data. These flexible modeling techniques include choice of likelihood functions or prior distributions, regression structure, multiple levels of observational units, and so on. This paper shows how you can fit these complex, multilevel hierarchical models by using the MCMC procedure in SAS/STAT^® software. PROC MCMC easily handles models that go beyond the single-level random-effects model, which typically assumes the normal distribution for the random effects and estimates regression coefficients. This paper shows how you can use PROC MCMC to fit hierarchical models that have varying degrees of complexity, from frequently encountered conditional independent models to more involved cases of modeling intricate interdependence. Examples include multilevel models for single and multiple outcomes, nested and non-nested models, autoregressive models, and Cox regression models with frailty. Also discussed are repeated measurement models, latent class models, spatial models, and models with nonnormal random-effects prior distributions.

Read the paper (PDF)

Technology plays an integral role in every aspect of daily life. As a result, educators should leverage technology-based learning to ensure that students are provided with authentic, engaging, and meaningful learning experiences (Pringle, Dawson, and Ritzhaupt, 2015).The significance and value of computer science understanding continue to increase. A major resource that can be credited with spreading support for computer science is the site Code.org. Its mission is to enable every student in every school to have the opportunity to learn computer science (https://code.org/about). Two years ago, our mentor partnered with Code.org to conduct workshops within the Charlotte, NC area to educate teachers on how to teach computer science activities and concepts in their classrooms. We had the opportunity to assist during the workshops to provide student perspectives and opinions. As we look back on the workshops, we wondered, How are the teachers who attended the workshops implementing the concepts they were taught? After each workshop, a survey was distributed to the attendees to receive workshop feedback and to follow up. We collected the data from the surveys sent to participants and analyzed it using SAS^® University Edition. The results of the survey concluded that the workshops were beneficial and that the educators had implemented a concept that they learned. We believe that computer science activity implementations will assist students across the curriculum.

View the e-poster or slides (PDF)

To determine whether there is a correlation between the repetitiveness of a song s lyrics and its popularity, the top 10 songs from the Billboard Hot 100 songs chart from 2006 to 2015 were collected. Song lyrics were assessed to determine the count of the top 10 words used. Word counts were used to predict the number of weeks the song was on the chart. The prediction model was analyzed to determine the quality of the model and whether word count was a significant predictor of a song s popularity. To investigate whether song lyrics are becoming more simplistic over time, several tests were performed to see whether the average word count has been changing over the years. All analysis was completed in SAS^® using various procedures.

View the e-poster or slides (PDF)

Significant growth of the Internet has created an enormous volume of unstructured text data. In recent years, the amount of this type of data that is available for analysis has exploded. While the amount of textual data is increasing rapidly, an ability to obtain key pieces of information from such data in a fast, flexible, and efficient way is still posing challenges. This paper introduces SAS^® Contextual Analysis In-Database Scoring for Hadoop, which integrates SAS^® Contextual Analysis with the SAS^® Embedded Process. SAS^® Contextual Analysis enables users to customize their text analytics models in order to realize the value of their text-based data. The SAS^® Embedded Process enables users to take advantage of SAS^® Scoring Accelerator for Hadoop to run scoring models. By using these key SAS^® technologies, the overall experience of analyzing unstructured text data can be greatly improved. The paper also provides guidelines and examples on how to publish and run category, concept, and sentiment models for text analytics in Hadoop.

Read the paper (PDF)

Data is generated every second. The term big data refers to the volume, variety, and velocity of data that is being produced. Now woven into every sector, its size and complexity has left organizations faced with difficulties in being able to create, manipulate, and manage big data. This research identifies and reviews a range of big data techniques within SAS^®, highlighting the fundamental opportunities that SAS provides for overcoming a variety of business challenges. Insurance is a data-dependent industry. This research focuses on understanding what SAS can offer to insurance companies and how it could interact with existing customer databases and online, user-generated content. A range of data sources have been identified for this purpose. The research demonstrates how models can be built based on existing relationships found in past data and then used to identify prospective customers. Principal component analysis, cluster analysis, and neural networks are all considered. You will learn how these techniques can be used to help capture valuable insight, create firm relationships, and support customer feedback. Whether it is prescriptive, predictive, descriptive, or diagnostic analytics, harnessing big data can add background and depth, providing insurance companies with a more complete story. You will see that you can reduce the complexity and dimensionality of data, provide actionable intelligence, and essentially make more informed business decisions.

Read the paper (PDF)

Machine learning is in high demand. Whether you are a citizen data scientist who wants to work interactively or you are a hands-on data scientist who wants to code, you have access to the latest analytic techniques with SAS^® Visual Data Mining and Machine Learning on SAS^® Viya . This offering surfaces in-memory machine learning techniques such as gradient boosting, factorization machines, neural networks, and much more through its interactive visual interface, SAS^® Studio tasks, procedures, and a Python client. Learn about this multi-faceted new product and see it in action.

Read the paper (PDF)

This presentation illustrates new technology that has been added to the SAS^® Customer Intelligence 360 analytics testing suite for digital campaign marketing. SAS Customer Intelligence 360 now has a multivariate testing tool (MVT). In digital marketing, MVT has become an increasingly popular process by which multiple components of a campaign can be tested in a live environment with the goal of finding an optimal mix, which drives a defined response metric. In simple terms, MVT is equivalent to running numerous A/B tests performed simultaneously. In theory, MVT can test the effectiveness of limitless combinations of factors. The number of factor-level combinations determines the test duration and the number of samples required to make statistically sound predictions for all permutations of a full factorial design. SAS^® applies experimental design analytics, driven by an interactive process that considers constraints and control cell definition, to guide the user toward an optimal reduced design of the test that can still adequately predict all factor-level combinations, given available resources. The results of the analysis enable the marketer to compare responses for both observed factor-level combinations to the predicted responses to the untested combinations.

Read the paper (PDF)

Data from an extensive survey conducted by the National Center for Education Statistics (NCES) is used for predicting qualified secondary school teachers across public schools in the U.S. The sample data includes socioeconomic data at the county level, which is used as a predictor for hiring a qualified teacher. The resultant model is used to score other regions and is presented on a heat map of the U.S. The survey family of procedures that SAS^® offers, such as PROC SURVEYFREQ and PROC SURVEYLOGISTIC, are used in the analyses since the data involves replicate weights. In looking at residuals from a logistic regression, since all the outcomes (observed values) are either 0 or 1, the residuals do not necessarily follow the normal distribution that is so often assumed in residual analysis. Furthermore, in dealing with survey data, the weights of the observations must be accounted for, as these affect the variance of the observations. To adjust for this, rather than looking at the difference in the observed and predicted values, the difference between the expected and actual counts is calculated by using the weights on each observation, and the predicted probability from the logistic model for the observation. Three types of residuals are analyzed: Pearson, Deviance, and Studentized residuals. The purpose is to identify which type of residuals best satisfy the assumption of normality when investigating residuals from a logistic regression.

Read the paper (PDF)

Chronic obstructive pulmonary disease (COPD) is the third leading cause of death in the US. An estimated 24 million people suffer from COPD, and the medical cost associated with it stands at a whopping $36 billion. Besides the emotional and physical impact, a patient with COPD has to undergo severe economic burden to pay for the medication. Hospitals are subjected to heavy penalties for high re-admissions. Identifying the best medicine combinations to treat COPD benefits patients and hospitals. This paper deals with analyzing the effectiveness of three popular drugs prescribed for COPD patients in terms of mortality rates and re-admission within 30 days of discharge. The data from Cerner Health Facts consists of over 1 million real-world, anonymized patient records collected in a real-world health environment. Base SAS^® is used to perform statistical analysis and data processing; re-admission of patients is analyzed using a lag function. The preliminary results show a re-admission rate of 5.96% and a mortality rate of 3.3% among all patients. The odds ratios computed using logistic regression show an increased mortality rate 2.4 times more for patients using Symbicort compared to Spiriva and Advair. This paper also uses text mining of social media, drug portals, and blogs to gauge the sentiments of patients using these drugs. The results obtained through sentiment analysis are then compared with the statistical analysis to determine the effectiveness of drugs prescribed to the COPD patients.

Read the paper (PDF)

Sovereign risk rating and country risk rating are conceptually distinct in that the former captures the risk of a country defaulting on its commercial debt obligations using economic variables while the latter covers the downside of a country's business environment including political and social variables alongside economic variables. Through this paper we would like to understand the differences between these risk approaches in assessing a country's credit worthiness by statistically examining the predictive power of political and social variables in determining country risk. To do this, we wish to build two models, first model with economic variables as regressors (sovereign risk model) and the second model with economic, political and social variables as regressors (country risk model) to compare the predictive power of regressors and model performance metrics between both the models. This will be an OLS regression model with country risk rating obtained from S&P as the target variable. With a general assumption that economic variables are driven by political processes and social factors, we would like to see if the second model has better predictive power. The economic, political and social indicators data that will be used as independent variables in the model will be obtained from world bank open data and target variable (country risk rating) will be obtained from S&P country risk ratings data.

View the e-poster or slides (PDF)

Customer churn is an important area of concern that affects not just the growth of your company, but also the profit. Conventional survival analysis can provide a customer's likelihood to churn in the near term, but it does not take into account the lifetime value of the higher-risk churn customers you are trying to retain. Not all customers are equally important to your company. Recency, frequency, and monetary (RFM) analysis can help companies identify customers that are most important and most likely to respond to a retention offer. In this paper, we use the IML and PHREG procedures to combine the RFM analysis and survival analysis in order to determine the optimal number of higher-risk and higher-value customers to retain.

Read the paper (PDF)

The Consumer Financial Protection Bureau (CFPB) collects tens of thousands of complaints against companies each year, many of which result in the companies in question taking action, including making payouts to the individuals who filed the complaints. Given the volume of the complaints, how can an overseeing organization quantitatively assess the data for various trends, including the areas of greatest concern for consumers? In this presentation, we propose a repeatable model of text analytics techniques to the publicly available CFPB data. Specifically, we use SAS^® Contextual Analysis to explore sentiment, and machine learning techniques to model the natural language available in each free-form complaint against a disposition code for the complaint, primarily focusing on whether a company paid out money. This process generates a taxonomy in an automated manner. We also explore methods to structure and visualize the results, showcasing how areas of concern are made available to analysts using SAS^® Visual Analytics and SAS^® Visual Statistics. Finally, we discuss the applications of this methodology for overseeing government agencies and financial institutions alike.

Read the paper (PDF)

The use of telematics data within the insurance industry is becoming prevalent as insurers use this data to give discounts, categorize drivers, and provide feedback to improve customers' driving. The data captured through in-vehicle or mobile devices includes acceleration, braking, speed, mileage, and many other events. Data elements are analyzed to determine high-risk events such as rapid acceleration, hard braking, quick turning, and so on. The time between these successive high-risk events is a function of the mileage driven and time in the telematics program. Our discussion highlights how we treated these high-risk events as recurrent events and analyzed them using the RELIABILITY procedure within SAS/QC^® software. The RELIABILITY procedure is used to determine a nonparametric mean cumulative function (MCF) of high-risk events. We illustrate the use of the MCF for identifying and categorizing average driver behavior versus individual driver behavior. We also discuss the use of the MCF to evaluate how a loss event or driver feedback can affect future driving behavior.

Read the paper (PDF)

Machine learning predictive modeling algorithms are governed by hyperparameters that have no clear defaults agreeable to a wide range of applications. A few examples of quantities that must be prescribed for these algorithms are the depth of a decision tree, number of trees in a random forest, number of hidden layers and neurons in each layer in a neural network, and degree of regularization to prevent overfitting. Not only do ideal settings for the hyperparameters dictate the performance of the training process, but more importantly they govern the quality of the resulting predictive models. Recent efforts to move from a manual or random adjustment of these parameters have included rough grid search and intelligent numerical optimization strategies. This paper presents an automatic tuning implementation that uses SAS/OR^® local search optimization for tuning hyperparameters of modeling algorithms in SAS^® Visual Data Mining and Machine Learning. The AUTOTUNE statement in the NNET, TREESPLIT, FOREST, and GRADBOOST procedures defines tunable parameters, default ranges, user overrides, and validation schemes to avoid overfitting. Given the inherent expense of training numerous candidate models, the paper addresses efficient distributed and parallel paradigms for training and tuning in SAS^® Viya . It also presents sample tuning results that demonstrate improved model accuracy over default configurations and offers recommendations for efficient and effective model tuning.

Read the paper (PDF)

The singular spectrum analysis (SSA) method of time series analysis applies nonparametric techniques to decompose time series into principal components. SSA is particularly valuable for long time series, in which patterns (such as trends and cycles) are difficult to visualize and analyze. An important step in SSA is determining the spectral groupings; this step can be automated by analyzing the w-correlations of the spectral components. This paper provides an introduction to singular spectrum analysis and demonstrates how to use SAS/ETS^® software to perform it. To illustrate, monthly data on temperatures in the United States over the last century are analyzed to discover significant patterns.

Read the paper (PDF)

Data that are gathered in modern data collection processes are often large and contain geographic information that enables you to examine how spatial proximity affects the outcome of interest. For example, in real estate economics, the price of a housing unit is likely to depend on the prices of housing units in the same neighborhood or nearby neighborhoods, either because of their locations or because of some unobserved characteristics that these neighborhoods share. Understanding spatial relationships and being able to represent them in a compact form are vital to extracting value from big data. This paper describes how to glean analytical insights from big data and discover their big value by using spatial econometric methods in SAS/ETS^® software.

Read the paper (PDF)

As in any analytical process, data sampling is a definitive step for unstructured data analysis. Sampling is of paramount importance if your data is fed from social media reservoirs such as Twitter, Facebook, Amazon, and Reddit, where information proliferation happens minute by minute. Unless you have a sophisticated analytical engine and robust physical servers to handle billions of pieces of data, you can't use all your data for analysis without sampling. So, how do you sample textual data? The standard method is to generate either a simple random sample, or a stratified random sample if a stratification variable exists in the data. Neither of these two methods can reliably produce a representative sample of documents from the population data simply because the process does not encompass a step to ensure that the distribution of terms between the population and sample sets remains similar. This shortcoming can cause the supervised or unsupervised learning to yield inaccurate results. If the generated sample is not representative of the population data, it is difficult to train and validate categories or sub-categories for those rare events during taxonomy development. In this paper, we show you new methods for sampling text data. We rely on a term-by-document matrix and SAS^® macros to generate representative samples. Using these methods helps generate sufficient samples to train and validate every category in rule-based modeling approaches using SAS^® Contextual Analysis.

Read the paper (PDF)

A Bayesian network is a directed acyclic graphical model that represents probability relationships and conditional independence structure between random variables. SAS^® Enterprise Miner implements a Bayesian network primarily as a classification tool; it includes na ve Bayes, tree-augmented na ve Bayes, Bayesian-network-augmented na ve Bayes, parent-child Bayesian network, and Markov blanket Bayesian network classifiers. The HPBNET procedure uses a score-based approach and a constraint-based approach to model network structures. This paper compares the performance of Bayesian network classifiers to other popular classification methods, such as classification tree, neural network, logistic regression, and support vector machines. The paper also shows some real-world applications of the implemented Bayesian network classifiers and a useful visualization of the results.

Read the paper (PDF)

Digital transformation and analytics for incumbents isn't a question of choice or strategy. It's a question of business survival. Go analytics!

Designing a survey is a meticulous process involving a number of steps and many complex choices. For most survey researchers, the choice of a probability or non-probability sample is somewhat simple. However, throughout the sample design process, there are more complex choices each of which can introduce bias into the survey estimates. For example, the sampling statistician must decide whether to stratify the frame. And, if so, he has to decide how many strata, whether to explicitly stratify, and how should a stratum be defined. He also has to decide whether to use clusters. And, if so, how to define a cluster and what should be the ideal cluster size. The factors affecting these choices, along with the impact of different sample designs on survey estimates, are explored in this paper. The SURVEYSELECT procedure in SAS/STAT^® 14.1 is used to select a number of samples based on different designs using data from Jamaica's 2011 Population and Housing Census. Census results are assumed to be equal to the true population parameter. The estimates from each selected sample are evaluated against this parameter to assess the impact of different sample designs on point estimates. Design-adjusted survey estimates are computed using the SURVEYMEANS and SURVEYFREQ procedures in SAS/STAT 14.1. The resultant variances are evaluated to determine the sample design that yields the most precise estimates.

Read the paper (PDF)

It takes months to find a customer and only seconds to lose one Unknown. Though the Business-to-Business (B2B) churn problem might not be as common as Business-to-Consumer (B2C) churn, it has become crucial for companies to address this effectively as well. Using statistical methods to predict churn is the first step in the process of retaining customers, which also includes model evaluation, prescriptive analytics (including outreach optimization), and performance reporting. Providing visibility into model and treatment performance enables the Data and Ops teams to tune models and adjust treatment strategy. West Corporation's Center for Data Science (CDS) has partnered with one of the lines of businesses in order to measure and prevent B2B customer churn. CDS has coupled firmographic and demographic data with internal CRM and past outreach data to build a Propensity to Churn model using SAS^®. CDS has provided the churn model output to an internal Client Success Team (CST), who focuses on high-risk/high-value customers in order to understand and provide resolution to any potential concerns that might be expressed by such customers. Furthermore, CDS automated weekly performance reporting using SAS and Microsoft Excel that not only focuses on model statistics, but also on CST actions and impact. This paper focuses on all of the steps involved in the churn-prevention process, including building and reviewing the model, treatment design and implementation, as well as performance reporting.

A/B testing is a form of statistical hypothesis testing on two business options (A and B) to determine which is more effective in the modern Internet age. The challenge for startups or new product businesses leveraging A/B testing are two-fold: a small number of customers and poor understanding of their responses. This paper shows you how to use the IML and POWER procedures to deal with the reassessment of sample size for adaptive multiple business stage designs based on conditional power arguments, using the data observed at the previous business stage.

Read the paper (PDF)

Epidemiologists and other health scientists are often tasked with solving health problems but find collecting original data prohibitive for a multitude of reasons. For this reason, it is common to instead use secondary data such as that from emergency departments (ED) or inpatient hospital stays. In order to use some of these secondary data sets to study problems over time, it is necessary to link them together using common identifiers and still keep all the unique information about each ED visit or hospitalization. This paper discusses a method that was used to combine five years worth of individual ED visits and five years worth of individual hospitalizations to create a single and (much) larger data set for longitudinal analysis.

Read the paper (PDF)

Regarding a human disease network, most studies have estimated the associations of disorders primarily with gene or protein information. Those studies, however, have some difficulties in the data because of the massive volume of data and the huge computational cost. Instead, we constructed a human disease network that can describe the associations between diseases, using the claim data of Korean health insurance. Through several statistical analyses, we show the applicability and suitability of the disease network. Furthermore, we develop a statistical model that can predict a prevalence rate for dementia by using significant associations of the network in a statistical perspective.

Read the paper (PDF)

This presentation discusses the options for including continuous covariates in regression models. In his book, 'Clinical Prediction Models,' Ewout Steyerberg presents a hierarchy of procedures for continuous predictors, starting with dichotomizing the variable and moving to modeling the variable using restricted cubic splines or using a fractional polynomial model. This presentation discusses all of the choices, with a focus on the last two. Restricted cubic splines express the relationship between the continuous covariate and the outcome using a set of cubic polynomials, which are constrained to meet at pre-specified points, called knots. Between the knots, each curve can take on the shape that best describes the data. A fractional polynomial model is another flexible method for modeling a relationship that is possibly nonlinear. In this model, polynomials with noninteger and negative powers are considered, along with the more conventional square and cubic polynomials, and the small subset of powers that best fits the data is selected. The presentation describes and illustrates these methods at an introductory level intended to be useful to anyone who is familiar with regression analyses.

Read the paper (PDF)

Computer and video games are complex these days. Events in video games are in some cases recorded automatically in text files, creating a history or story of game play. There are countable items in these event records that can be used as data for statistics and other types of modeling. This E-Poster shows you how to statistically analyze text files for video game events using SAS^®. Two games are analyzed. EVE Online, a massive multi-user online role-playing spaceship game, is one. The other game is Ingress, a cell phone game that combines exercise with a GPS and real-world environments. In both examples, the techniques involve parsing large amounts of text data to examine recurring patterns in text that describe events in the game play.

View the e-poster or slides (PDF)

Detection and adjustment of structural breaks are an important step in modeling time series and panel data. In some cases, such as studying the impact of a new policy or an advertising campaign, structural break analysis might even be the main goal of a data analysis project. In other cases, the adjustment of structural breaks is a necessary step to achieve other analysis objectives, such as obtaining accurate forecasts and effective seasonal adjustment. Structural breaks can occur in a variety of ways during the course of a time series. For example, a series can have an abrupt change in its trend, its seasonal pattern, or its response to a regressor. The SSM procedure in SAS/ETS^® software provides a comprehensive set of tools for modeling different types of sequential data, including univariate and multivariate time series data and panel data. These tools include options for easy detection and adjustment of a wide variety of structural breaks. This paper shows how you can use the SSM procedure to detect and adjust structural breaks in many different modeling scenarios. Several real-world data sets are used in the examples. The paper also includes a brief review of the structural break detection facilities of other SAS/ETS procedures, such as the ARIMA, AUTOREG, and UCM procedures.

Read the paper (PDF)

For all healthcare systems, considerable attention and resources are directed at gauging and improving patient satisfaction. Dignity Health has made considerable efforts in improving most areas of patient satisfaction. However, improving metrics around physician interaction with patients has been challenging. Failure to improve these publicly reported scores can result in reimbursement penalties, damage to Dignity's brand and an increased risk of patient harm. One possible way to improve these scores is to better identify the physicians that present the best opportunity for positive change. Currently, the survey tool mandated by the Centers for Medicare and Medicaid Services (CMS), the Hospital Consumer Assessment of Healthcare Providers and Systems (HCAHPS), has three questions centered on patient experience with providers, specifically concerning listening, respect, and clarity of conversation. For purposes of relating patient satisfaction scores to physicians, Dignity Health has assigned scores based on the attending physician at discharge. By conducting a manual record review, it was determined that this method rarely corresponds to the manual review (PPV = 20.7%, 95% CI: 9.9% -38.4%). Using a variety of SAS^® tools and predictive modeling programs, we developed a logistic regression model that had better agreement with chart abstractors (PPV = 75.9%, 95% CI: 57.9% - 87.8%). By attributing providers based on this predictive model, opportunities for improvement can be more accurately targeted, resulting in improved patient satisfaction and outcomes while protecting fiscal health.

Read the paper (PDF)

SAS^® has a very efficient and powerful way to get distances between an event and a customer. Using the tables and code located at http://support.sas.com/rnd/datavisualization/mapsonline/html/geocode.html#street, you can load latitude and longitude to addresses that you have for your events and customers. Once you have the tables downloaded from SAS, and you have run the code to get them into SAS data sets, this paper helps guide you through the rest using PROC GEOCODE and the GEODIST function. This can help you determine to whom to market an event. And, you can see how far a client is from one of your facilities.

Read the paper (PDF) | View the e-poster or slides (PDF)

This paper illustrates proper applications of multidimensional item response theory (MIRT), which is available in SAS^® PROC IRT. MIRT combines item response theory (IRT) modeling and factor analysis when the instrument carries two or more latent traits. Although it might seem convenient to accomplish two tasks simultaneously by using one procedure, users should be cautious of misinterpretations. This illustration uses the 2012 Program for International Student Assessment (PISA) data set collected by Organisation for Economic Co-operation and Development (OECD). Because there are two known sub-domains in the PISA test (reading and math), PROC IRT was programmed to adopt a two-factor solution. In additional, the loading plot, dual plot, item difficulty/discrimination plot, and test information function plot in JMP^® were used to examine the psychometric properties of the PISA test. When reading and math items were analyzed in SAS MIRT, seven to 10 latent factors are suggested. At first glance, these results are puzzling because ideally all items should be loaded into two factors. However, when the psychometric attributes yielded from a two-parameter IRT analysis are examined, it is evident that both the reading and math test items are well written. It is concluded that even if factor indeterminacy is present, it is advisable to evaluate its psychometric soundness based on IRT because content validity can supersede construct validity.

Read the paper (PDF) | View the e-poster or slides (PDF)

Learn neat new (and not so new) methods for joining text, graphs, and tables in a single document. This paper shows how you can create a report that includes all three with a single solution: SAS^®. The text portion is taken from a Microsoft Word document and joined with output from the GPLOT and REPORT procedures.

Read the paper (PDF)

Join this breakout session hosted by the Customer Value Management team from Saudi Telecommunications Company to understand the journey we took with SAS^® to evolve from simple below-the-line campaign communication to advanced customer value management (CVM). Learn how the team leveraged SAS tools ranging from SAS^® Enterprise Miner to SAS^® Customer Intelligence Suite in order to gain a deeper understanding of customers and move toward targeting customers with the right offer at the right time through the right channel.

Read the paper (PDF)

Customer feedback is a critical aspect of businesses in today's world as it is invaluable in determining what customers like and dislike about the business' service. This loop of regularly listening to customers' voice through survey comments and improving services based on it leads to better business, and more importantly, to an enhancement in customer experience. The challenge is to classify and analyze these unstructured text comments to gain insights and to focus on areas of improvement. The purpose of this paper is to illustrate how text mining in SAS^® Enterprise Miner 14.1 helped one of our clients a leading financial services company convert their customers problems into opportunities. The customers' feedback pertaining to their experience with an Interactive Voice Response (IVR) system is collected by an enterprise feedback management (EFM) company. The comments are then split into two groups, which helps us differentiate customer opinions. This grouping is based on customers who have given a rating of 0 6 and a rating of 9 10 on a Likert scale of 0 10 (10 being extremely satisfied) in the survey questionnaire. Text mining is performed on both these groups, and an algorithm creates clusters that are consequentially used to segment customers based on opinions they are interested in voicing. Furthermore, sentiment scores are calculated for each one of the segments. The scores classify the polarity of customer feedback and prioritizes the problems the client needs to focus on.

Read the paper (PDF)

Randomized control trials have long been considered the gold standard for establishing causal treatment effects. Can causal effects be reasonably estimated from observational data too? In observational studies, you observe treatment T and outcome Y without controlling confounding variables that might explain the observed associations between T and Y. Estimating the causal effect of treatment T therefore requires adjustments that remove the effects of the confounding variables. The new CAUSALTRT (causal-treat) procedure in SAS/STAT^® 14.2 enables you to estimate the causal effect of a treatment decision by modeling either the treatment assignment T or the outcome Y, or both. Specifically, modeling the treatment leads to the inverse probability weighting methods, and modeling the outcome leads to the regression methods. Combined modeling of the treatment and outcome leads to doubly robust methods that can provide unbiased estimates for the treatment effect even if one of the models is misspecified. This paper reviews the statistical methods that are implemented in the CAUSALTRT procedure and includes examples of how you can use this procedure to estimate causal effects from observational data. This paper also illustrates some other important features of the CAUSALTRT procedure, including bootstrap resampling, covariate balance diagnostics, and statistical graphics.

Read the paper (PDF)

Pooling two or more cross-sectional survey data sets (such as stacking the data sets on top of one another) is a strategy often used by researchers for one of two purposes: (1) to more efficiently conduct significance tests on point estimate changes observed over time or (2) to increase the sample size in hopes of improving the precision of a point estimate. The latter purpose is especially common when making inferences on a subgroup, or domain, of the target population insufficiently represented by a single survey data set. Using data from the National Survey of Family Growth (NSFG), the aim of this paper is to walk through a series of practical estimation objectives that can be tackled by analyzing data from two or more pooled survey data sets. Where applicable, we comment on the resulting interpretive nuances.

Read the paper (PDF)

Student growth percentile (SGP) is one of the most widely used score metrics for measuring a student's academic growth. Using longitudinal data, SGP describes a student's growth as the relative standing among students who had a similar level of academic achievement in previous years. Although several models for SGP estimation have been introduced, and some models have been implemented with R, no studies have yet described using SAS^®. As a result, this research describes various types of SGP models and demonstrates how practitioners can use SAS procedures to fit these models. Specifically, this study covers three types of statistical models for SGP: 1) quantile regression-based model 2) conditional cumulative density function-based model 3) multidimensional item response theory-based model. Each of the three models partly uses procedures in SAS, such as PROC QUANTREG, PROC LOGISTIC, PROC TRANSREG, PROC IRT, or PROC MCMC, for its computation. The program code is illustrated using a simulated longitudinal data set over two consecutive years, which is generated by SAS/IML^®. In addition, the interpretation of the estimation results and the advantages and disadvantages of implementing these three approaches in SAS are discussed.

View the e-poster or slides (PDF)

Model validation is an important step in the model building process because it provides opportunities to assess the reliability of models before their deployment. Predictive accuracy measures the ability of the models to predict future risks, and significant developments have been made in recent years in the evaluation of survival models. SAS/STAT^® 14.2 includes updates to the PHREG procedure with a variety of techniques to calculate overall concordance statistics and time-dependent receiver operator characteristic (ROC) curves for right-censored data. This paper describes how to use these criteria to validate and compare fitted survival models and presents examples to illustrate these applications.

Read the paper (PDF)

Given the proposed budget cuts to higher education in the state of Kentucky, public universities will likely be awarded financial appropriations based on several performance metrics. The purpose of this project was to conceptualize, design, and implement predictive models that addressed two of the state's metrics: six-year graduation rate and fall-to-fall persistence for freshmen. The Western Kentucky University (WKU) Office of Institutional Research analyzed five years' worth of data on first-time, full-time bachelor's degree seeking students. Two predictive models evaluated and scored current students on their likelihood to stay enrolled and their chances of graduating on time. Following an ensemble of machine-learning assessments, the scored data were imported into SAS^® Visual Analytics, where interactive reports allowed users to easily identify which students were at a high risk for attrition or at risk of not graduating on time.

Read the paper (PDF)

With an aim to improve rural healthcare, Oklahoma State University (OSU) Center for Health Systems Innovation (CHSI) conducted a study with primary care clinics (n=35) in rural Oklahoma to identify possible impediments to clinic workflows. The study entailed semi-structured personal interviews (n=241) and administered an online survey using an iPad (n=190). Respondents encompassed all consenting clinic constituents (physicians, nurses, practice managers, schedulers). Quantitative data from surveys revealed that electronic medical records (EMRs) are well accepted and contributed to increasing workflow efficiency. However, the qualitative data from interviews reveals that there are IT-related barriers like Internet connectivity, hardware problems, and inefficiencies in information systems. Interview responses identified six IT-related response categories (computer, connectivity, EMR-related, fax, paperwork, and phone calls) that routinely affect clinic workflow. These categories together account for more than 50% of all the routine workflow-related problems faced by the clinics. Text mining was performed on transcribed Interviews using SAS^® Text Miner to validate these six categories and to further identify concept linking for a quantifiable insight. Two variables (Redundancy Reduction and Idle Time Generation) were derived from survey questions with low scores of -129 and -64 respectively out of 384. Finally, ANOVA was run using SAS^® Enterprise Guide^® 6.1 to determine whether the six qualitative categories affect the two quantitative variables differently.

Read the paper (PDF)

Traditional analytical modeling, with roots in statistical techniques, works best on structured data. Structured data enables you to impose certain standards and formats in which to store the data values. For example, a variable indicating gas mileage in miles per gallon should always be a number (for example, 25). However, with unstructured data analysis, the free-form text no longer limits you to expressing this information in only one way (25 mpg, twenty-five mpg, and 25M/G). The nuances of language, context, and subjectivity of text make it more complex to fit generalized models. Although statistical methods using supervised learning prove efficient and effective in some cases, sometimes you need a different approach. These situations are when rule-based models with Natural Language Processing capabilities can add significant value. In what context would you choose a rule-based modeling versus a statistical approach? How do you assess the tradeoffs of choosing a rule-based modeling approach with higher interpretability versus a statistical model that is black-box in nature? How can we develop rule-based models that optimize model performance without compromising accuracy? How can we design, construct, and maintain a complex rule-based model? What is a data-driven approach to rule writing? What are the common pitfalls to avoid? In this paper, we discuss all these questions based on our experiences working with SAS^® Contextual Analysis and SAS^® Sentiment Analysis.

Read the paper (PDF)

Factorization machines are a new type of model that is well suited to very high-cardinality, sparsely observed transactional data. This paper presents the new FACTMAC procedure, which implements factorization machines in SAS^® Visual Data Mining and Machine Learning. This powerful and flexible model can be thought of as a low-rank approximation of a matrix or a tensor, and it can be efficiently estimated when most of the elements of that matrix or tensor are unknown. Thanks to a highly parallel stochastic gradient descent optimization solver, PROC FACTMAC can quickly handle data sets that contain tens of millions of rows. The paper includes examples that show you how to use PROC FACTMAC to recommend movies to users based on tens of millions of past ratings, predict whether fine food will be highly rated by connoisseurs, restore heavily damaged high-resolution images, and discover shot styles that best fit individual basketball players. ^®

Read the paper (PDF)

Credit card fraud. Loan fraud. Online banking fraud. Money laundering. Terrorism financing. Identity theft. The strains that modern criminals are placing on financial and government institutions demands new approaches to detecting and fighting crime. Traditional methods of analyzing large data sets on a periodic, batch basis are no longer sufficient. SAS^® Event Stream Processing provides a framework and run-time architecture for building and deploying analytical models that run continuously on streams of incoming data, which can come from virtually any source: message queues, databases, files, TCP\IP sockets, and so on. SAS^® Visual Scenario Designer is a powerful tool for developing, testing, and deploying aggregations, models, and rule sets that run in the SAS^® Event Stream Processing Engine. This session explores the technology architecture, data flow, tools, and methodologies that are required to build a solution based on SAS Visual Scenario Designer that enables organizations to fight crime in real time.

Read the paper (PDF)

SAS/STAT^® software has several procedures that estimate parameters from generalized linear models designed for both continuous and discrete response data (including proportions and counts). Procedures such as LOGISTIC, GENMOD, GLIMMIX, and FMM, among others, offer a flexible range of analysis options to work with data from a variety of distributions and also with correlated or clustered data. SAS^® procedures can also model zero-inflated and truncated distributions. This paper demonstrates how statements from PROC NLMIXED can be written to match the output results from these procedures, including the LS-means. Situations arise where the flexible programming statements of PROC NLMIXED are needed for other situations such as zero-inflated or hurdle models, truncated counts, or proportions (including legitimate zeros) that have random effects, and also for probability distributions not available elsewhere. A useful application of these coding techniques is that programming statements from NLMIXED can often be directly transferred into PROC MCMC with little or no modification to perform analyses from a Bayesian perspective with these various types of complex models.

Read the paper (PDF)

Cumulative logistic regression models are used to predict an ordinal response. They have the assumption of proportional odds. Proportional odds means that the coefficients for each predictor category must be consistent or have parallel slopes across all levels of the response. This paper uses a sample data set to demonstrate how to test the proportional odds assumption. It shows how to use the UNEQUALSLOPES option when the assumption is violated. A cumulative logistic regression model is built, and then the performance of the model on a test set is compared to the performance of a generalized multinomial model. This shows the utility and necessity of the UNEQUALSLOPES option when building a cumulative logistic regression model. The procedures shown are produced using SAS^® Enterprise Guide^® 7.1.

Read the paper (PDF)

Longitudinal count data arise when a subject's outcomes are measured repeatedly over time. Repeated measures count data have an inherent within subject correlation that is commonly modeled with random effects in the standard Poisson regression. A Poisson regression model with random effects is easily fit in SAS^® using existing options in the NLMIXED procedure. This model allows for overdispersion via the nature of the repeated measures; however, departures from equidispersion can also exist due to the underlying count process mechanism. We present an extension of the cross-sectional COM-Poisson (CMP) regression model established by Sellers and Shmueli (2010) (a generalized regression model for count data in light of inherent data dispersion) to incorporate random effects for analysis of longitudinal count data. We detail how to fit the CMP longitudinal model via a user-defined log-likelihood function in PROC NLMIXED. We demonstrate the model flexibility of the CMP longitudinal model via simulated and real data examples.

Read the paper (PDF)

The increasing complexity of data in research and business analytics requires versatile, robust, and scalable methods of building explanatory and predictive statistical models. Quantile regression meets these requirements by fitting conditional quantiles of the response with a general linear model that assumes no parametric form for the conditional distribution of the response; it gives you information that you would not obtain directly from standard regression methods. Quantile regression yields valuable insights in applications such as risk management, where answers to important questions lie in modeling the tails of the conditional distribution. Furthermore, quantile regression is capable of modeling the entire conditional distribution; this is essential for applications such as ranking the performance of students on standardized exams. This expository paper explains the concepts and benefits of quantile regression, and it introduces you to the appropriate procedures in SAS/STAT^® software.

Read the paper (PDF)

In fantasy football, it is a relatively common strategy to rotate which team's defense a player uses based on some combination of favorable/unfavorable player matchups, recent performance, and projection of expected points. However, there is danger in this strategy because defensive scoring volatility is high, and any team has the possibility of turning in a statistically bad performance in a given week. This paper uses data mining techniques to identify which National Football League (NFL) teams give up high numbers of defensive penalties on a week-to-week basis, and to what degree those high-penalty games correlate with poor team defensive fantasy scores. Examining penalty count and penalty yards allowed totals, we can narrow down which teams are consistently hurt by poor technique and find correlation between games with high penalty totals to their respective fantasy football score. By doing so, we seek to find which teams should be avoided in fantasy football due to their likelihood of poor performance.

The acquisition of new customers is fundamental to the success of every business. Data science methods can greatly improve the effectiveness of acquiring prospective customers and can contribute to the profitability of business operations. In a business-to-business (B2B) setting, a predictive model might target business prospects as individual firms listed by, for example, Dunn & Bradstreet. A typical acquisition model can be defined using a binary response with the values categorizing a firm's customers and non-customers, for which it is then necessary to identify which of the prospects are actually customers of the firm. The methods of fuzzy logic, for example, based on the distance between strings, might help in matching customers' names and addresses with the overall universe of prospects. However, two errors can occur: false positives (when the prospect is incorrectly classified as a firm's customer), and false negatives (when the prospect is incorrectly classified as a non-customer). In the current practice of building acquisition models, these errors are typically ignored. In this presentation, we assess how these errors affect the performance of the predictive model as measured by a lift. In order to improve the model's performance, we suggest using a pre-determined sample of correct matches and to calibrate its predicted probabilities based on actual take-up rates. The presentation is illustrated with real B2B data and includes elements of SAS^® code that was used in the research.

Read the paper (PDF)

The analysis of longitudinal data requires a model that correctly accounts for both the inherent correlation amongst the responses as a result of the repeated measurements, as well as the feedback between the responses and predictors at different time points. Lalonde, Wilson, and Yin (2013) developed an approach based on generalized method of moments (GMM) for identifying and using valid moment conditions to account for time-dependent covariates in longitudinal data with binary outcomes. However, the model developed using this approach does not provide information about the specific relationships that exist across time points. We present a SAS^® macro that extends the work of Lalonde, Wilson, and Yin by using valid moment conditions to estimate and evaluate the relationships between the response and predictors at different time periods. The performance of this method is compared to previously established results.

Read the paper (PDF)

An important component of insurance pricing is the insured location and the associated riskiness of that location. Recently, we have experienced a large increase in the availability of external risk classification variables and associated risk factors by geospatial location. As additional geospatial data becomes available, it is prudent for insurers to take advantage of the new information to better match price to risk. Generalized additive models using penalized likelihood (GAMPL) have been explored as a way to incorporate new location-based information. This type of model can leverage the new geospatial information and incorporate it with traditional insurance rating variables in a regression-based model for rating. In our method, we propose a local regression model in conjunction with our GAMPL model. Our discussion demonstrates the use of the LOESS procedure as well as the GAMPL procedure in a combined solution. Both procedures are in SAS/STAT^® software. We discuss in detail how we built a local regression model and used the predictions from this model as an offset into a generalized additive model. We compare the results of the combined approach to results of each model individually.

Read the paper (PDF)

When creating statistical models that include multiple covariates (for example, Cox proportional hazards models or multiple linear regression), it is important to address which variables are categorical and continuous for proper analysis and interpretation in SAS^®. Categorical variables, regardless of SAS data type, should be added in the MODEL statement with an additional CLASS statement. In larger models containing many continuous or categorical variables, it is easy to overlook variables that should be added to the CLASS statement. To solve this problem, we have created a macro that uses simple input from the model variables, with PROC CONTENTS and additional logic checks, to create the necessary CLASS statement and to run the desired model. With this macro, variables are evaluated on multiple conditions to see whether they should be considered class variables. Then, they are added automatically to the CLASS statement.

Read the paper (PDF) | View the e-poster or slides (PDF)

The presentation will give a brief introduction to Bayesian Analysis within SAS. Participants will learn the difference between Bayesian and Classical Statistics and be introduced to PROC MCMC.

Machine Learning algorithms have been available in SAS software since 1979. This session provides practical examples of machine learning applications. The evolution of machine learning at SAS is illustrated with examples of nearest-neighbor discriminant analysis in SAS/STAT PROC DISCRIM to advanced predictive modeling in SAS Enterprise Miner. Machine learning techniques addressed include memory based reasoning, decision trees, neural networks, and gradient boosting algorithms.

In this presentation you will learn the basics of working with nested data, such as students within classes, customers within households, or patients within clinics through the use of multilevel models. Multilevel models can accommodate correlation among nested units through random intercepts and slopes, and generalize easily to 2, 3, or more levels of nesting. These models represent a statistically efficient and powerful way to test your key hypotheses while accounting for the hierarchical nesting of the design. The GLIMMIX procedure is used to demonstrate analyses in SAS.

Getting Started with ARIMA Models will introduce the basic features of time series variation, and the model components used to accommodate them; stationary (ARMA), trend and seasonal (the 'I' in ARIMA) and exogenous (input variable related). The Identify, Estimate and Forecast framework for building ARIMA models is illustrated with two demonstrations.

This workshop provides hands-on experience with using SAS Enterprise Miner. Workshop participants will learn to do the following: open a project; create and explore a data source; build and compare models; and produce and examine score code that can be used for deployment.

This workshop provides hands-on experience with SAS Viya Data Mining and Machine Learning through the programming interface to SAS Viya. Workshop participants will learn how to start and stop a CAS session; move data into CAS; prepare data for machine learning; use SAS Studio tasks for supervised learning; and evaluate the results of analyses.

This workshop provides hands-on experience performing statistical analysis with the Statistics tasks in SAS Studio. Workshop participants will learn to perform statistical analyses using tasks, evaluate which tasks are ideal for different kinds of analyses, edit the generated code, and customize a task.

This workshop provides hands-on experience using SAS^® Text Miner. For a collection of documents, workshop participants will learn how to: read and convert documents for use by SAS Text Miner; retrieve information from the collection using query features of the software; identify the dominant themes and concepts in the collection; and classify documents having pre-assigned categories.

When modeling time series data, we often use a LAG of the dependent variable. The LAG function works great for this, until you try to predict out into the future and need the model's predicted value from one record as an independent value for a future record. This paper examines exactly how the LAG function works, and explains why it doesn't in this case. It also explains how to create a hash object that will accomplish a LAG of any value, how to load the initial data, how to add predicted values to the hash object, and how to extract those values when needed as an independent variable for future observations.

Read the paper (PDF)

How would you answer this question? Most of us struggle to articulate the value of the tools, techniques, and teams we use when using analytics. How do you help the new director understand the value of SAS^® to you, your job, and the company? In this interactive session, you will discover the components that make up total cost of ownership (TCO) as they apply to the analytics lifecycle. What should you consider when you evaluate total cost of ownership and why should you measure it? How can you help your management team understand the value that SAS provides?

Read the paper (PDF)

Investors usually trade stocks or exchange-traded funds (ETFs) based on a methodology, such as a theory, a model, or a specific chart pattern. There are more than 10,000 securities listed on the US stock market. Picking the right one based on a methodology from so many candidates is usually a big challenge. This paper presents the methodology based on the CANSLIM1 theorem and momentum trading (MT) theorem. We often hear of the cup and handle shape (C&H), double bottoms and multiple bottoms (MB), support and resistance lines (SRL), market direction (MD), fundamental analyses (FA), and technical analyses (TA). Those are all covered in CANSLIM theorem. MT is a trading theorem based on stock moving direction or momentum. Both theorems are easy to learn but difficult to apply without an appropriate tool. The brokers' application system usually cannot provide such filtering due to its complexity. For example, for C&H, where is the handle located? For the MB, where is the last bottom you should trade at? Now, the challenging task can be fulfilled through SAS^®. This paper presents the methods on how to apply the logic and graphically present them though SAS. All SAS users, especially those who work directly on capital market business, can benefit from reading this document to achieve their investment goals. Much of the programming logic can also be adopted in SAS finance packages for clients.

Read the paper (PDF)

The traditional view is that a utility's long-term forecast must have a standard against which it is judged. Weather normalization is one of the industry-standard practices that utilities use to assess the efficacy of a forecasting solution. While recent advances in probabilistic load forecasting techniques are proving to be a methodology that brings many benefits to a forecast, many utilities still require the benchmarking process to determine the accuracy of their long-term forecasts. Due to climatological volatility and the potentially large annual variances in temperature, humidity, and other relevant weather variables, most utilities create normalized weather profiles through various processes in order to estimate what is traditionally called a weather normalized load profile. However, new research shows that due to the nonlinear response of electric demand to weather variations, a simple normal weather profile in many cases might not equate to a normal load. In this paper, we introduce a probabilistic approach to deriving normalized load profiles and monthly peak and energy in through a process we label load normalization against the effects of weather . We compare it with the traditional weather normalization process to quantify the costs and benefits of using such a process. The proposed method has been successfully deployed at utilities for their long-term operation and planning purposes, and risk management.

Read the paper (PDF)

What if you had analytics near the edge for your Internet of Things (IoT) devices that would tell you whether a piece of equipment is operating within its normal range? And what if those same analytics could help you intelligently determine what data you should keep and what data should be filtered at the edge? This session focuses on classifying streaming data near the edge by showcasing a demo that implements a single-class classification model within a gateway device. The model identifies observations that are normal and abnormal to help determine possible machine issues and preventative maintenance opportunities. The classification also helps to provide a method for filtering data at the edge by capturing all abnormal data but taking only a sample of the normal operating data. The model is developed using SAS^® Viya and implemented on a gateway device using SAS^® Event Stream Processing. By using a single-class classification technique, the demo also illustrates how to avoid issues with binary classification that would require failure observations in order to build an accurate model. Problems that this demo addresses include: identifying potential and future equipment failures in near real time; filtering sensor data near the edge to prevent unnecessary transport and storage of less valuable data; and building a classification model for failure that doesn't require observations relating to failures.

Read the paper (PDF)

This paper describes an effective real-time contextual marketing system based on a successful case implemented in a private communication company in Chile. Implementing real-time cases is becoming a major challenge due to stronger competition, which generates an increase of churn and higher operational costs, among other issues. All of these can have an enormous effect on revenue and profit. A set of predictive machine learning models can help to improve response rates of outbound campaigns, but it s not enough to be more proactive in this business. Our real-time system for contextual marketing uses the two SAS^® solutions: SAS^® Event Stream Processing and SAS^® Real-Time Decision Manager, which are connected in cascade. In this configuration, SAS Event Stream Processing can read massive amounts of data from call detail records (CDRs) and antennas, and SAS Real-Time Decision Manager receives the resulting golden events, which trigger the right responses. Time elapsed from the detection of a golden event until a response is processed is approximately 5 seconds. Since implementing seven use cases of this real-time system, the results show an average augmentation in revenue of two million dollars in a testing period of four months, thus returning the investment in a short-term period. The implementation of this system has changed the way Telef nica Chile generates value from big data. Moreover, an outstanding, long-term working relationship between Telef nica Chile and SAS has been started.

Read the paper (PDF)

This presentation demonstrates that applying Fast Fourier Transformation (FFT) on smart meter data can provide enhanced customer segmentation and discovery. The FFT is a mathematical method for transforming a function of time into a function of frequency. It's vastly used in analyzing sound but is also relevant for utilities. Advanced Metering Infrastructure (AMI) refers to the full measurement and collection system that includes meters at the customer site and communication networks between the customer and the utility. With the inception of AMI, utilities experienced an explosion of data that provides vast analytical opportunities to improve reliability, customer satisfaction, and safety. However, the data explosion comes with its own challenges. The first challenge is the volume. Consider that just 20,000 customers with AMI data can reach over 300 GB of data per year. Simply aggregating the data from minutes to hours or even days can skew results and not provide accurate segmentations. The second challenge is the bad data that is being collected. Outliers caused by missing or incorrect reads, outages, or other factors must be addressed. FFT can eliminate this noise. The proposed framework is expected to identify various customer segments that could be used for demand response programs. The framework also has the potential to investigate diversion or fraud or failing meters (revenue protection), which is a big problem for many utilities.

SAS^® Visual Analytics has two offerings, SAS^® Visual Statistics and SAS^® Visual Data Mining and Machine Learning, that provide knowledge workers and data scientists an interactive interface for data partition, data exploration, feature engineering, and rapid modeling. These offerings are powered by the SAS^® Viya platform, thus enabling big data and big analytic problems to be solved. This paper focuses on the steps a user would perform during an interactive modeling session.

Read the paper (PDF)

Interrupted time series analysis (ITS) is a tool that can be used to help Learning Healthcare Systems evaluate programs in settings where randomization is not feasible. Interrupted time series is a statistical method to assess repeated snap shots over regular intervals of time before and after a system-level intervention or program is implemented. This method can be used by Learning Healthcare Systems to evaluate programs aimed at improving patient outcomes in real-world, clinical settings. In practice, the number of patients and the timing of observations are restricted. This presentation describes a program that helps statisticians identify optimal segments of time within a fixed population size for an interrupted time series analysis. A macro creates simulations based on DO loops to calculate power to detect changes over time due to system-level interventions. Parameters used in the macro are sample size, number of subjects in each time frame in each year, number of intervals in a year, and the probability of the event before and after the intervention. The macro gives the user the ability to specify different assumptions that result in design options that yield varying power based on the number of patients in each time intervals given the fixed parameters. The output from the macro can help stakeholders understand necessary parameters to help determine the optimal evaluation design.

Read the paper (PDF)

Statistical analysis is like detective work, and a data set is like the crime scene. The data set contains unorganized clues and patterns that can, with proper analysis, ultimately lead to meaningful conclusions. Using SAS^® tools, a statistical analyst (like any good crime scene investigator) performs a preliminary analysis of the data set through visualization and descriptive statistics. Based on the preliminary analysis, followed by a detailed analysis, both the crime scene investigator (CSI) and the statistical analyst (SA) can use scientific or analytical tools to answer the key questions: What happened? What were the causes and effects? Why did this happen? Will it happen again? Applying the CSI analogy, this paper presents an example case study using a two-step process to investigate a big-data crime scene. Part I shows the general procedures that are used to identify clues and patterns and to obtain preliminary insights from those clues. Part II narrows the focus on the specific statistical analyses that provide answers to different questions.

Read the paper (PDF)

Throughout history, the phrase know thyself has been the aspiration of many. The trend of wearable technologies has certainly provided the opportunity to collect personal data. These technologies enable individuals to know thyself on a more sophisticated level. Specifically, wearable technologies that can track a patient's medical profile in a web-based environment, such as continuous blood glucose monitors, are saving lives. The main goal for diabetics is to replicate the functions of the pancreas in a manner that allows them to live a normal, functioning lifestyle. Many diabetics have access to a visual analytics website to track their blood glucose readings. However, they often are unreadable and overloaded with information. Analyzing these readings from the glucose monitor and insulin pump with SAS^®, diabetics can parse their own information into more simplified and readable graphs. This presentation demonstrates the ease in creating these visualizations. Not only is this beneficial for diabetics, but also for the doctors that prescribe the necessary basal and bolus levels of insulin for a patient s insulin pump.

View the e-poster or slides (PDF)

Data is your friend. This presentation discusses the use of data for quality improvement (QI). Measurement over time is integral to quality improvement, and statistical process control charts (also known as Shewhart or SPC charts) are a good way to learn from the way measures change over time, in response to our improvement efforts. The presentation explains what an SPC chart is, how to chose the correct type of chart, how to create and update a chart using SAS^®, and how to learn from the chart. The examples come from QI projects in health care, and the material is based on the Institute for Healthcare Improvement's Model for Improvement. However, the material is applicable to other fields, including manufacturing and business. The presentation is intended for people newly considering a QI project, people who want to graph their data and need help with getting started, and anyone interested in interpreting SPC charts created by someone else.

Read the paper (PDF)

Linear regression, which is widely used, can be improved by the inclusion of the penalizing parameter. This helps reduce variance (at the cost of a slight increase in bias) and improves prediction accuracy and model interpretability. The regularization model is implemented on the sample data set, and recommendations for the practice are included.

View the e-poster or slides (PDF)

Every organization, from the most mature to a day-one start-up, needs to grow organically. A deep understanding of internal customer and operational data is the single biggest catalyst to develop and sustain the data. Advanced analytics and big data directly feed into this, and there are best practices that any organization (across the entire growth curve) can adopt to drive success. Analytics teams can be drivers of growth. But to be truly effective, key best practices need to be implemented. These practices include in-the-weeds details, like the approach to data hygiene, as well as strategic practices, like team structure and model governance. When executed poorly, business leadership and the analytics team are unable to communicate with each other they talk past each other and do not work together toward a common goal. When executed well, the analytics team is part of the business solution, aligned with the needs of business decision-makers, and drives the organization forward. Through our engagements, we have discovered best practices in three key areas. All three are critical to analytics team effectiveness. 1) Data Hygiene 2) Complex Statistical Modeling 3) Team Collaboration

Read the paper (PDF)

In the increasingly competitive environment for banks and credit unions, every potential advantage should be pursued. One of these advantages is to market additional products to your existing customers rather than to new customers, since your existing customers already know (and hopefully trust) you, and you have so much data on them. But how can this best be done? How can you market the right products to the right customers at the right time? Predictive analytics can do this by forecasting which customers have the highest chance of purchasing a given financial product. This paper provides a step-by-step overview of a relatively simple but comprehensive approach to maximize cross-sell opportunities among your customers. We first prepare the data for a statistical analysis. With some basic predictive analytics techniques, we can then identify those members who have the highest chance of buying a financial product. For each of these members, we can also gain insight into why they would purchase, thus suggesting the best way to market to them. We then make suggestions to improve the model for better accuracy.

Read the paper (PDF)

Meta-analysis is a method for combining multiple independent studies on the same subject or question, producing a single large study with increased accuracy and enhanced ability to detect overall trends and smaller effects. This is done by treating the results of each study as a single observation and performing analysis on the set, while controlling for differences between individual studies. These differences can be treated as either fixed or random effects, depending on context. This paper demonstrates the process and techniques used in meta-analysis using human trafficking studies. This problem has seen increasing interest in the past few years, and there are now a number of localized studies for one state or a metropolitan area. This meta-analysis combines these to begin development of a comprehensive analytic understanding of human trafficking across the United States. Both fixed and random effects are described. All elements of this analysis were performed using SAS^® University Edition.

Read the paper (PDF)

Many practitioners of machine learning are familiar with support vector machines (SVMs) for solving binary classification problems. Two established methods of using SVMs in multinomial classification are the one-versus-all approach and the one-versus-one approach. This paper describes how to use SAS^® software to implement these two methods of multinomial classification, with emphasis on both training the model and scoring new data. A variety of data sets are used to illustrate the pros and cons of each method.

Read the paper (PDF)

This paper presents a case study in which social media posts by individuals related to public transport companies in the United Kingdom were collected from social media sites such as Twitter and Facebook and also from forums using SAS^® and Python. The posts were then further processed by SAS^® Text Miner and SAS^® Visual Analytics to retrieve brand names, means of public transport (underground, trains, buses), and any mentioned attributes. Relevant concepts and topics are identified using text mining techniques and visualized using concept maps and word clouds. Later, we aim to identify and categorize sentiments against public transport in the corpus of the posts. Finally, we create an association map/mind-map of the different service dimensions/topics and the brands of public transport, using correspondence analysis.

Read the paper (PDF)

Banks can create a competitive advantage in their business by using business intelligence (BI) and by building models. In the credit domain, the best practice is to build risk-sensitive models (Probability of Default, Exposure at Default, Loss Given Default, Unexpected Loss, Concentration Risk, and so on) and implement them in decision-making, credit granting, and credit risk management. There are models and tools on the next level that are built on these models and that are used to help in achieving business targets, setting risk-sensitive pricing, capital planning, optimizing Return on Equity/Risk Adjusted Return on Capital (ROE/RAROC), managing the credit portfolio, setting the level of provisions, and so on. It works remarkably well as long as the models work. However, over time, models deteriorate, and their predictive power can drop dramatically. As a result, heavy reliance on models in decision-making (some decisions are automated following the model's results-without human intervention) can result in a huge error, which might have dramatic consequences for the bank's performance. In my presentation, I share our experience in reducing model risk and establishing corporate governance of models with the following SAS^® tools: SAS^® Model Monitoring Microservice, SAS^® Model Manager, dashboards, and SAS^® Visual Analytics.

Read the paper (PDF)

Prince Niccolo Machiavelli said things on the order of, The promise given was a necessity of the past: the word broken is a necessity of the present. His utilitarian philosophy can be summed up by the phrase, The ends justify the means. As a personality trait, Machiavelianism is characterized by the drive to pursue one's own goals at the cost of others. In 1970, Richard Christie and Florence L. Geis created the MACH-IV test to assign a MACH score to an individual, using 20 Likert-scaled questions. The purpose of this study was to build a regression model that can be used to predict the MACH score of an individual using fewer factors. Such a model could be useful in screening processes where personality is considered, such as in job screening, offender profiling, or online dating. The research was conducted on a data set from an online personality test similar to the MACH-IV test. It was hypothesized that a statistically significant model exists that can predict an average MACH score for individuals with similar factors. This hypothesis was accepted.

View the e-poster or slides (PDF)

This paper establishes the conceptualization of the dimension of the shopping cart (or market basket) on apparel retail websites. It analyzes how the cart dimension (describing anonymous shoppers) and the customer dimension (describing non-anonymous shoppers) impact merchandise return behavior. Five data-mining techniques-namely logistic regression, decision tree, neural network, gradient boosting, and support vector machine-are used for predicting the likelihood of merchandise return. The target variable is a dichotomous response variable: return vs not return. The primary input variables are conceptualized as constituents of the cart dimension, derived from engineering merchandise-related variables such as item style, item size, and item color, as well as free-shipping-related thresholds. By further incorporating the constituents of the customer dimension such as tenure, loyalty membership, and purchase histories, the predictive accuracy of the model built using each of the five data-mining techniques was found to improve substantially. This research also highlights the relative importance of the constituents of the cart and customer dimensions governing the likelihood of merchandise return. Recommendations for possible applications and research areas are provided.

Read the paper (PDF)

Dynamic social networks can be used to monitor the constantly changing nature of interactions and relationships between people and groups. The size and complexity of modern dynamic networks can make this task extremely challenging. Using the combination of SAS/IML^®, SAS/QC^®, and R, we propose a fast approach to monitor dynamic social networks. A discrepancy score at edge level was developed to measure the unusualness of the observed social network. Then, multivariate and univariate change-point detection methods were applied on the aggregated discrepancy score to identify the edges and vertices that have experienced changes. Stochastic block model (SBM) networks were simulated to demonstrate this method using SAS/IML and R. PROC SHEWHART and PROC CUSUM in SAS/QC and PROC SGRENDER heat maps were applied on the aggregated discrepancy score to monitor the dynamic social network. The combination of SAS/IML, SAS/QC, and R make it an ideal tool to monitor dynamic social networks.

View the e-poster or slides (PDF)

In item response theory (IRT), the distribution of examinees' abilities is needed to estimate item parameters. However, specifying the ability distribution is difficult, if not impossible, because examinees' abilities are latent variables. Therefore, IRT estimation programs typically assume that abilities follow a standard normal distribution. When estimating item parameters using two separate computer runs, one problem with this approach is that it causes item parameter estimates obtained from two groups that differ in ability level to be on different scales. There are several methods that can be used to place the item parameter estimates on a common scale, one of which is multi-group calibration. This method is also called concurrent calibration because all items are calibrated concurrently with a single computer run. There are two ways to implement multi-group calibration in SAS^®: 1) Using PROC IRT. 2) Writing an algorithm from scratch using SAS/IML^®. The purpose of this study is threefold. First, the accuracy of the item parameter estimates are evaluated using a simulation study. Second, the item parameter estimates are compared to those produced by an item calibration program flexMIRT. Finally, the advantages and disadvantages of using these two approaches to conduct multi-group calibration are discussed.

View the e-poster or slides (PDF)

Multicollinearity can be briefly described as the phenomenon in which two or more identified predictor variables in a multiple regression model are highly correlated. The presence of this phenomenon can have a negative impact on the analysis as a whole and can severely limit the conclusions of the research study. This paper reviews and provides examples of the different ways in which multicollinearity can affect a research project, and tells how to detect multicollinearity and how to reduce it once it is found. In order to demonstrate the effects of multicollinearity and how to combat it, this paper explores the proposed techniques by using the Behavioral Risk Factor Surveillance System data set. This paper is intended for any level of SAS^® user. This paper is also written to an audience with a background in behavioral science or statistics.

Read the paper (PDF)

Recent advances in computing technology, monitoring systems, and data collection mechanisms have prompted renewed interest in multivariate time series analysis. In contrast to univariate time series models, which focus on temporal dependencies of individual variables, multivariate time series models also exploit the interrelationships between different series, thus often yielding improved forecasts. This paper focuses on cointegration and long memory, two phenomena that require careful consideration and are observed in time series data sets from several application areas, such as finance, economics, and computer networks. Cointegration of time series implies a long-run equilibrium between the underlying variables, and long memory is a special type of dependence in which the impact of a series' past values on its future values dies out slowly with the increasing lag. Two examples illustrate how you can use the new features of the VARMAX procedure in SAS/ETS^® 14.1 and 14.2 to glean important insights and obtain improved forecasts for multivariate time series. One example examines cointegration by using the Granger causality tests and the vector error correction models, which are the techniques frequently applied in the Federal Reserve Board's Comprehensive Capital Analysis and Review (CCAR), and the other example analyzes the long-memory behavior of US inflation rates.

Read the paper (PDF) | Download the data file (ZIP)

Data has to be a moderate size when you estimate parameters with machine learning. For data with huge numbers of records like healthcare big data (for example, receipt data) or super multi-dimensional data like genome big data, it is important to follow a procedure in which data is cleaned first and then selection of data or variables for modeling is performed. Big data often consists of macroscopic and microscopic groups. With these groups, it is possible to increase the accuracy of estimation by following the above procedure in which data is cleaned from a macro perspective and the selection of data or variables for modeling is performed from a micro perspective. This kind of step-wise procedure can be expected to help reduce bias. We also propose a new analysis algorithm with N-stage machine learning. For simplicity, we assume N =2. Note that different machine learning approaches should be applied; that is, a random forest method is used at the first stage for data cleaning, and an elastic net method is used for the selection of data or variables. For programming N-stage machine learning, we use the LUA procedure that is not only efficient, but also enables an easily readable iteration algorithm to be developed. Note that we use well-known machine learning methods that are implementable with SAS^® 9.4, SAS^® In-Memory Statistics, and so on.

Read the paper (PDF)

Confidence intervals are critical to understanding your survey data. If your intervals are too narrow, you might inadvertently judge a result to be statistically significant when it is not. While many familiar SAS^® procedures, such as PROC MEANS and PROC REG, provide statistical tests, they rely on the assumption that the data comes from a simple random sample. However, almost no real-world survey uses such sampling. Learn how to use the SURVEYMEANS procedure and its SURVEY cousins to estimate confidence intervals and perform significance tests that account for the structure of the underlying survey, including the replicate weights now supplied by some statistical agencies. Learn how to extract the results you need from the flood of output that these procedures deliver.

Read the paper (PDF)

As a data scientist, you need analytical tools and algorithms, whether commercial or open source, and you have some favorites. But how do you decide when to use what? And how can you integrate their use to your maximum advantage? This presentation provides several best practices for deploying both SAS^® and open-source analytical tools to increase productivity and efficiency in your enterprise ecosystem. See an example of a marketing analysis using SAS and R algorithms in SAS^® Enterprise Miner to develop a predictive model, and then operationalize that model for performance monitoring and in-database scoring. Also learn about using Python and SAS integration for developing predictive models from a Jupyter Notebook environment. Seeing these cases will help you decide how to improve your analytics with similar integration of SAS and open source.

Read the paper (PDF)

Many communication channels exist for customers to engage with businesses, yet an interactive voice response (IVR) system remains the most critical of them. The reason is is because IVR acts as the front end to consumer interaction and is the most effective method for customers to do business with companies in order to resolve their issues before talking to an agent. If the IVR interface is not designed properly, customers can be stuck in an endless loop of pressing buttons that can lead to consumer annoyance. The bottom line is: An IVR system should be set up to quickly resolve as many routine inbound inquires as possible and to allow customers to speak to an agent when necessary. In order to accomplish this, the IVR interface has to be optimized so that it is fully effective and provides a great customer experience. This paper demonstrates how SAS^® tools helped optimize the IVR system of a book publishing company. The data set used in this study was obtained from a telecom services company and contained IVR logs of more than 300,000 calls with 1.4 million observations. To gain insights into customer behaviors, path analysis was performed on this data using SAS^® Enterprise Miner and obstacles faced by customers were identified. This helped in determining underperforming prompts, and analysis using SAS procedures was conducted on such prompts. Prompts tuning was recommended and new self-service areas were identified that avoid transfers and can save clients thousands of dollars in investments in call centers.

Read the paper (PDF)

People typically invest in more than one stock to help diversify their risk. These stock portfolios are a collection of assets that each have their own inherit risk. If you know the future risk of each of the assets, you can optimize how much of each asset to keep in the portfolio. The real challenge is trying to evaluate the potential future risk of these assets. Different techniques provide different forecasts, which can drastically change the optimal allocation of assets. This talk presents a case study of portfolio optimization in three different scenarios historical standard deviation estimation, capital asset pricing model (CAPM), and GARCH-based volatility modeling. The structure and results of these three approaches are discussed.

Read the paper (PDF)

Optimizing delivery routes and efficiently using delivery drivers are examples of classic problems in Operations Research, such as the Traveling Salesman Problem. In this paper, Oberweis and Zencos collaborate to describe how to leverage SAS/OR^® procedures to solve these problems and optimize delivery routes for a retail delivery service. Oberweis Dairy specializes in home delivery service that delivers premium dairy products directly to customers homes. Because freshness is critical to delivering an excellent customer experience, Oberweis is especially motivated to optimize their delivery logistics. As Oberweis works to develop an expanding footprint and a growing business, Zencos is helping to ensure that delivery routes are optimized and delivery drivers are used efficiently.

Read the paper (PDF)

The analytical data life cycle consists of 4 stages: data exploration, preparation, model development, and model deployment. Traditionally, these stages can consume 80% of the time and resources within your organization. With innovative techniques such as in-database and in-memory processing, managing data and analytics can be streamlined, with an increase in performance, economics, and governance. This session explores how you can optimize the analytical data life cycle with some best practices and tips using SAS^® and Teradata.

Outliers, such as unusual, violated, unexpected or rare events, have been focused on intensively by researchers and practitioners, providing their impacts on estimated statistics and developed models. Today, some business disciplines are focusing primarily on outliers such as defaults of credit, operational risks, quality nonconformities, fraud, or even the results of marketing initiatives in highly competitive environments with low response rates of a couple percent or even less. This paper discusses the importance of detecting, isolating, and categorizing business outliers to discover their root causes and to monitor them dynamically. Addressing not only extreme values or multivariable densities detecting outliers, but also addressing distributions, patterns, clusters, combinations of items, and sequences of events will allow for opportunities to be established for business improvement. SAS^® Enterprise Miner can be used to perform such detections. Thus, creating special business segments or running specialized outlier oriented data mining processes, such as decision trees, allows for isolation of business important outliers, which are normally masked in traditional statistical techniques. This process combined with 'What-If' scenario generation prepares businesses for future possible surges even when having no current specific type outliers. Furthermore, analyzing some specific outliers may play a role in assessing business stability to corresponding stress tests.

Read the paper (PDF)

Are secondary schools in the United States hiring enough qualified math teachers? In which regions is there a disparity of qualified teachers? Data from an extensive survey conducted by the National Center for Education Statistics (NCES) was used for predicting qualified secondary school teachers across public schools in the US. The three criteria examined to determine whether a teacher is qualified to teach a given subject are: 1) Whether the teacher has a degree in the subject he or she is teaching 2) Whether he or she has a teaching certification in the subject 3) Whether he or she has five years of experience in the subject. A qualified teacher is defined as one who has all three of the previous qualifications. The sample data included socioeconomic data at the county level, which was used as predictors for hiring a qualified teacher. Data such as the number of students on free or reduced lunch at the school was used to assign schools as high-needs or low-needs schools. Other socioeconomic factors included were the income and education levels of working adults within a given school district. Some of the results show that schools with higher-needs students (a school that has more than 40% of the students on some form of reduced lunch program) have less-qualified teachers. The resultant model is used to score other regions and is presented on a heat map of the US. SAS^® procedures such as PROC SURVEYFREQ and PROC SURVEYLOGISTIC are used.

View the e-poster or slides (PDF)

Research using electronic health records (EHR) is emerging, but questions remain about its completeness, due in part to physicians' time to enter data in all fields. This presentation demonstrates the use of SAS^® Enterprise Miner to predict completeness of clinical data using claims data as the standard 'source of truth' against which to compare it. A method for assessing and predicting the completeness of clinical data is presented using the tools and techniques from SAS Enterprise Miner. Some of the topics covered include: tips for preparing your sample data set for use in SAS Enterprise Miner; tips for preparing your sample data set for modeling, including effective use of the Input Data, Data Partition, Filter, and Replacement nodes; and building predictive models using Stat Explore, Decision Tree, Regression, and Model Compare nodes.

View the e-poster or slides (PDF)

The most commonly reported model evaluation metric is the accuracy. This metric can be misleading when the data are imbalanced. In such cases, other evaluation metrics should be considered in addition to the accuracy. This study reviews alternative evaluation metrics for assessing the effectiveness of a model in highly imbalanced data. We used credit card clients in Taiwan as a case study. The data set contains 30,000 instances (22.12% risky and 77.88% non-risky) assessing the likeliness of a customer defaulting on a payment. Three different techniques were used during the model building process. The first technique involved down-sampling the majority class in the training subset. The second used the original imbalanced data whereas prior probabilities were set to account for oversampling in the third technique. The same sets of predictive models were then built for each technique after which the evaluation metrics were computed. The results suggest that model evaluation metrics might reveal more about distribution of classes than they do about the actual performance of models when the data are imbalanced. Moreover, some of the predictive models were identified to be very sensitive to imbalance. The final decision in model selection should consider a combination of different measures instead of relying on one measure. To minimize imbalance-biased estimates of performance, we recommend reporting both the obtained metric values and the degree of imbalance in the data.

Read the paper (PDF)

Airbnb is the world's largest home-sharing company and has over 800,000 listings in more than 34,000 cities and 190 countries. Therefore, the pricing of their property, done by the Airbnb hosts, is crucial to the business. Setting low prices during a high-demand period might hinder profits, while setting high prices during a low-demand period might result in no bookings at all. In this paper, we suggest a price recommendation methodology for Airbnb hosts that helps in overcoming the problems of overpricing and underpricing. Through this methodology, we try to identify key factors related to Airbnb pricing: factors influential in determining a price for a property; the relation between the price of a property and the frequency of its booking; and similarities among successful and profitable properties. The constraints outlined in the analysis were entered into SAS^® optimization procedures to achieve a best possible price. As a part of this methodology, we built a scraping tool to get details of New York City host user data along with their metrics. Using this data, we build a pricing model to predict the optimal price of an Airbnb home.

Read the paper (PDF)

In a randomized study, subjects are randomly assigned to either a treated group or a control group. Random assignment ensures that the distribution of the covariates is the same in both groups and that the treatment effect can be estimated by directly comparing the outcomes for the subjects in the two groups. In contrast, subjects in an observational study are not randomly assigned. In order to establish causal interpretations of the treatment effects in observational studies, special statistical approaches that adjust for the covariate confounding are required to obtain unbiased estimation of causal treatment effects. One strategy for correctly estimating the treatment effect is based on the propensity score, which is the conditional probability of the treatment assignment given the observed covariates. Prior to the analysis, you use propensity scores to adjust the data by weighting observations, stratifying subjects that have similar propensity scores, or matching treated subjects to control subjects. This paper reviews propensity score methods for causal inference and introduces the PSMATCH procedure, which is new in SAS/STAT^® 14.2. The procedure provides methods of weighting, stratification, and matching. Matching methods include greedy matching, matching with replacement, and optimal matching. The procedure assesses covariate balance by comparing distributions between the adjusted treated and control groups.

Read the paper (PDF)

Predictive analytics has been evolving in property and casualty insurance for the past two decades. This paper first provides a high-level overview of predictive analytics in each of the following core business operations in the property and casualty (P&C) insurance industry: marketing, underwriting, actuarial pricing, actuarial reserving, and claims. Then, a common P&C insurance predictive modeling technical process in SAS^® dealing with large data sets is introduced. The steps of this process include data acquisition, data preparation, variable creation, variable selection, model building (also known as model fitting), model validation, model testing, and so on. Finally, some successful models are introduced. Base SAS^®, SAS/STAT^® software, SAS^® Enterprise Guide^®, and SAS^® Enterprise Miner are presented as the main tools for this process. This predictive modeling process could be tweaked or directly used in many other industries as the statistical foundations of predictive analytics have large overlaps across P&C insurance, health care, life insurance, banking, pharmaceutical, genetics industries, and so on. This paper is intended for any level of SAS^® user or business people from different industries who are interested in learning about general predictive analytics.

Read the paper (PDF)

Conferences for SAS^® programming are replete with the newest software capabilities and clever programming techniques. However, discussion about quality control (QC) is lacking. QC is fundamental to ensuring both correct results and sound interpretation of data. It is not industry specific, and it simply makes sense. Most QC procedures are a function of regulatory requirements, industry standards, and corporate philosophies. Good QC goes well beyond just reviewing results, and should also consider the underlying data. It should be driven by a thoughtful consideration of relevance and impact. While programmers strive to produce correct results, it is no wonder that programming mistakes are common despite rigid QC processes in an industry where expedited deliverables and a lean workforce are the norm. This leads to a lack of trust in team members and an overall increase in resource requirements as these errors are corrected, particularly when SAS programming is outsourced. Is it possible to produce results with a high degree of accuracy, even when time and budget are limited? Thorough QC is easy to overlook in a high-pressure environment with increased expectations of workload and expedited deliverables. Does this suggest that QC programming is becoming a lost art, or does it simply suggest that we need to evolve with technology? The focus of the presentation is to review the who, what, when, how, why, and where of QC programming implementation.

Read the paper (PDF)

This session introduces how Equifax uses SAS^® to develop a streamlined model automation process, including development and performance monitoring. The tool increases modeling efficiency and accuracy of the model, reduces error, and generates insights. You can integrate more advanced analytics tools in a later phase. The process can apply in any given business problem in risk and marketing, which helps leaders to make precise and accurate business decisions.

A random forest is an ensemble of decision trees that often produce more accurate results than a single decision tree. The predictions of the individual trees in the forest are averaged to produce a final prediction. The question now arises whether a better or more accurate final prediction cannot be obtained by a more intelligent use of the trees in the forest. In particular, in the way random forests are currently defined, every tree contributes the same fraction to the final result (for example, if there are 50 trees, each tree contributes 1/50th to the final result). This ignores model uncertainty as less accurate trees are treated exactly like more accurate trees. Replacing averaging with Bayesian Model Averaging will give better trees the opportunity to contribute more to the final result, which might lead to more accurate predictions. However, there are several complications to this approach that have to be resolved, such as the computation of an SBC value for a decision tree. Two novel approaches to solving this problem are presented and the results compared to that obtained with the standard random forest approach.

Read the paper (PDF)

AdaBoost (or Adaptive Boosting) is a machine learning method that builds a series of decision trees, adapting each tree to predict difficult cases missed by the previous trees and combining all trees into a single model. I discuss the AdaBoost methodology, introduce the extension called Real AdaBoost, which is so similar to stepwise weight of evidence logistic regression (SWOELR) that it might offer a framework with which we can understand the power of the SWOELR approach. I discuss the advantages of Real AdaBoost, including variable interaction and adaptive, stage-wise binning, and demonstrate a SAS^® macro that uses Real AdaBoost to generate predictive models.

Read the paper (PDF)

Binary logistic regression models are widely used in CRM (customer relationship management) or credit risk modeling. In these models, it is common to use nominal, ordinal, or discrete (NOD) predictors. NOD predictors typically are binned (reducing the number of their levels) before usage in a logistic model. The primary purpose of binning is to obtain parsimony without greatly reducing the strength of association of the predictor X to the binary target Y. In this paper, two SAS^® macros are discussed. The %NOD_BIN macro bins predictors with nominal values (and ordinal and discrete values) by collapsing levels to maximize information value (IV). The %ORDINAL_BIN macro is applied to predictors that are ordered and in which collapsing can occur only for levels that are adjacent in the ordering of X. The %ORDINAL_BIN macro finds all possible binning solutions by complete enumeration. Solutions are ranked by IV, and monotonic solutions are identified.

Read the paper (PDF)

Mediation analysis is a statistical technique for investigating the extent to which a mediating variable transmits the relation of an independent variable to a dependent variable. Because it is useful in many fields, there have been rapid developments in statistical mediation methods. The most cutting-edge statistical mediation analysis focuses on the causal interpretation of mediated effect estimates. Cause-and-effect inferences are particularly challenging in mediation analysis because of the difficulty of randomizing subjects to levels of the mediator (MacKinnon, 2008). The focus of this paper is how incorporating longitudinal measures of the mediating and outcome variables aides in the causal interpretation of mediated effects. This paper provides useful SAS^® tools for designing adequately powered studies to detect the mediated effect. Three SAS macros were developed using the powerful but easy-to-use REG, CALIS, and SURVEYSELECT procedures to do the following: (1) implement popular statistical models for estimating the mediated effect in the pretest-posttest control group design; (2) conduct a prospective power analysis for determining the required sample size for detecting the mediated effect; and (3) conduct a retrospective power analysis for studies that have already been conducted and a required sample to detect an observed effect is desired. We demonstrate the use of these three macros with an example.

Read the paper (PDF)

Web Intelligence is a business objects web-based application used by the FDA for accessing and querying data files, and ultimately creating reports from multiple databases. The system allows querying of different databases using common business terms, and in the case of the FDA's Center for Food Safety and Applied Nutrition (CFSAN), careful review of dietary supplement information. However, in order to create timely and efficient reports for detection of safety signals leading to adverse events, a more efficient system is needed to obtain and visually display the data. Using SAS^® Visual Analytics and SAS^® Enterprise Guide^® can assist with timely extraction of data from multiple databases commonly used by CFSAN and create a more user friendly interface for management's review to help make key decisions in prevention of adverse events for the public.

Read the paper (PDF)

The purpose of this paper is to show a SAS^® macro named %SURVEYGENMOD developed in a SAS/IML^® procedure as an upgrade of macro %SURVEYGLM developed by Silva and Silva (2014) to deal with complex survey design in generalized linear models (GLMs). The new capabilities are the inclusion of negative binomial distribution, zero-inflated Poisson (ZIP) model, zero-inflated negative binomial (ZINB) model, and the possibility to get estimates for domains. The R function svyglm (Lumley, 2004) and Stata software were used as background, and the results showed that estimates generated by the %SURVEYGENMOD macro are close to the R function and Stata software.

Read the paper (PDF)

With the 2014 launch of SAS^® University Edition, the reach of SAS^® was greatly expanded to educators, students, researchers, non-profits, and the curious, who for the first time could use a full version of Base SAS^® software for free. Because SAS University Edition allows a maximum of two CPUs, however, performance is curtailed sharply from more substantial SAS environments that can benefit from parallel and distributed processing, such as environments that implement SAS^® Grid Manager, Teradata, or Hadoop solutions. Even when comparing performance of SAS University Edition against the most straightforward implementation of the SAS windowing environment, the SAS windowing environment demonstrates greater performance when run on the same computer. With parallel processing and distributed computing becoming the status quo in SAS production environments, SAS University Edition will unfortunately fall behind counterpart SAS solutions if it cannot harness parallel processing best practices and performance. To curb this disparity, this session introduces groundbreaking programmatic methods that enable commodity hardware to be networked so that multiple instances of SAS University Edition can communicate and work collectively to divide and conquer complex tasks. With parallel processing facilitated, a SAS practitioner can now harness an endless number of computers to produce blitzkrieg solutions with the SAS University Edition that rival the performance of more costly, complex infrastructure.

Ensemble models have become increasingly popular in boosting prediction accuracy over the last several years. Stacked ensemble techniques combine predictions from multiple machine learning algorithms and use these predictions as inputs to a second level-learning algorithm. This paper shows how you can generate a diverse set of models by various methods (such as neural networks, extreme gradient boosting, and matrix factorizations) and then combine them with popular stacking ensemble techniques, including hill-climbing, generalized linear models, gradient boosted decision trees, and neural nets, by using both the SAS^® 9.4 and SAS^® Visual Data Mining and Machine Learning environments. The paper analyzes the application of these techniques to real-life big data problems and demonstrates how using stacked ensembles produces greater prediction accuracy than individual models and na ve ensembling techniques. In addition to training a large number of models, model stacking requires the proper use of cross validation to avoid overfitting, which makes the process even more computationally expensive. The paper shows how to deal with the computational expense and efficiently manage an ensemble workflow by using parallel computation in a distributed framework.

Read the paper (PDF)

Has the rapid pace of SAS/STAT^® releases left you unaware of powerful enhancements that could make a difference in your work? Are you still using PROC REG rather than PROC GLMSELECT to build regression models? Do you understand how the GENMOD procedure compares with the newer GEE and HPGENSELECT procedures? Have you grasped the distinction between PROC PHREG and PROC ICPHREG? This paper will increase your awareness of modern alternatives to well-established tools in SAS/STAT by using succinct, high-level comparisons rather than detailed descriptions to explain the relative benefits of procedures and methods. The paper focuses on alternatives in the areas of regression modeling, mixed models, generalized linear models, and survival analysis. When you see the advantages of these newer tools, you will want to put them into practice. This paper points you to helpful resources for getting started.

Read the paper (PDF)

There is an industry-wide push toward making workflows seamless and reproducible. Incorporating reproducibility into the workflow has many benefits; among them are increased transparency, time savings, and accuracy. We walk through how to seamlessly integrate SAS^®, LaTeX, and R into a single reproducible document. We also discuss best practices for general principles such as literate programming and version control.

Read the paper (PDF)

Survival analysis differs from other types of statistical analysis, including graphical summaries and regression modeling procedures, because data is almost always censored. The purpose of this project is to apply survival analysis techniques in SAS^® to practical survival data, aiming to understand the effects of gender and age on lung cancer patient survival at different cancer sites. Results show that both gender and age are significant variables in predicting lung cancer patient survival using the Cox proportional hazards model. Females have better survival than males when other variables in the model are fixed (p-value 0.0254). Moreover, the hazard of patients who are over 65 is 1.385 times that of patients who are under 65 (p-value 0.0145).

View the e-poster or slides (PDF)

Have you ever used a control chart to assess the variation in a process? Did you wonder how you could modify the chart to tell a more complete story about the process? This paper explains how you can use the SHEWHART procedure in SAS/QC^® software to make the following enhancements: display multiple sets of control limits that visualize the evolution of the process, visualize stratified variation, explore within-subgroup variation with box-and-whisker plots, and add information that improves the interpretability of the chart. The paper begins by reviewing the basics of control charts and then illustrates the enhancements with examples drawn from real-world quality improvement efforts.

Read the paper (PDF)

Temporal text mining (TTM) is the discovery of temporal patterns in documents that are collected over time. It involves discovery of latent themes, construction of a thematic evolution graph, and analysis of thematic patterns. This paper uses text mining and time series analysis techniques to explore Don Quixote de la Mancha, a two-volume master work of Western literature. First, it uses singular value decomposition in SAS^® Text Miner to discover 25 key themes that characterize the two volumes. Then it treats the chapters of the two books as time-ordered documents and creates a semiautomated visual summary of the two volumes. It also explores the trajectory of individual themes over the course of the chapters and identifies episodes, recurring themes, and climaxes. Finally, it uses time series clustering in SAS^® Enterprise Miner to group chapters that have similar themes and to group themes that have similar trajectories. The TTM methods demonstrated in this paper lend themselves to business applications such as monitoring changes in customer sentiment and summarizing research and legislative trends.

Read the paper (PDF)

This project described the method to classify movie genres based on synopses text data by two approaches: term frequency, and inverse document frequency (tf-idf) and C4.5 decision tree. Using the performance comparison of the classifiers by manipulating the different parameters, the strength and improvement of this method in substantial text analysis were also interpreted. As the result, these two approaches are powerful to identify movie genres.

Read the paper (PDF) | View the e-poster or slides (PDF)

Propensity to Buy models comprise one of the most widely used techniques in supporting business strategy for customer segmentation and targeting. Some of the key challenges every data scientist faces in building predictive models are the utilization of all known predictor variables, uncovering any unknown signals, and adjusting for latent variable errors. Often, the business demands inclusion of certain variables based on a previous understanding of process dynamics. To meet such client requirements, these inputs are forced into the model, resulting in either a complex model with too many inputs or a fragile model that might decay faster than expected. West Corporation's Center for Data Science (CDS) has found a work around to strike a balance between meeting client requirements and building a robust model by using clustering techniques. A leading telecom services provider uses West's SMS Outbound Notification Platform to notify their customers about an upcoming Pay-Per-View event. As part of the modeling process, the client has identified a few variables as key business drivers and CDS used those variables to build clusters, which were then used as inputs to the predictive model. In doing so, not only all the effects of the client-mandated variables were captured successfully, but this also helped to reduce the number of inputs to the model, making it parsimonious. This paper illustrates how West has used clustering in the data preparation process and built a robust model.

Socioeconomic status (SES) is a major contributor to health disparities in the United States. Research suggests that those with a low SES versus a high SES are more likely to have lower life expectancy; participate in unhealthy behaviors such as smoking and alcohol consumption; experience higher rates of depression, childhood obesity, and ADHD; and experience problems accessing appropriate health care. Interpreting SES can be difficult due to the complexity of data, multiple data sources, and the large number of socioeconomic and demographic measures available. When SES is expanded to include additional social determinants of health (SDOH) such as language barriers and transportation barriers to care; access to employment and affordable housing; adequate nutrition, family support and social cohesion; health literacy; crime and violence; quality of housing; and other environmental conditions, the ability to measure and interpret the concept becomes even more difficult. This paper presents an approach to measuring SES and SDOH using publicly available data. Various statistical modeling techniques are used to define state-specific composite SES scores at local areas-ZIP Code and Census Tract. Once developed, the SES/SDOH models are applied to health care claims data to evaluate the relationship between health services utilization, cost, and social factors. The analysis includes a discussion of the potential impact of social factors on population risk adjustment.

Read the paper (PDF)

As a freshman at a large university, life can be fun as well as stressful. The choices a freshman makes while in college might impact his or her overall health. In order to examine the overall health and different behaviors of students at Oklahoma State University, a survey was conducted among the freshmen students. The survey focused on capturing the psychological, environmental, diet, exercise, and alcohol and drug use among students. A total of 795 out of 1,036 freshman students completed the survey, which included around 270 questions that covered the range of issues mentioned above. An exploratory factor analysis identified 26 factors. For example, two factors that relate to the behavior of students under stress are eating and relaxing. Further understanding the variables that contribute to alcohol and drug use might help the university in planning appropriate interventions and preventions. Factor analysis with Cronbach's alpha provided insight into a more defined set of variables to help address these types of issues. We used SAS^® to do factor analysis as well as to create different clusters of students with unique characteristics and profiled these clusters

Read the paper (PDF)

In the 2015-2016 season of the National Basketball Association (NBA), the Golden State Warriors achieved a record-breaking 73 regular-season wins. This accomplishment would not have been possible without their reigning Most Valuable Player (MVP) champion Stephen Curry and his historic shooting performance. Shattering his previous NBA record of 286 three-point shots made during the 2014-2015 regular season, he accrued an astounding 402 in the next season. With an increased emphasis on the advantages of the three-point shot and guard-heavy offenses in the NBA today, organizations are naturally eager to investigate player statistics related to shooting at long ranges, especially for the best of shooters. Furthermore, the addition of more advanced data-collecting entities such as SportVU creates an incredible opportunity for data analysis, moving beyond simply using aggregated box scores. This work uses quantile regression within SAS^® 9.4 to explore the relationships between the three-point shot and other relevant advanced statistics, including some SportVU player-tracking data, for the top percentile of three-point shooters from the 2015-2016 NBA regular season.

View the e-poster or slides (PDF)

The advent of robust and thorough data collection has resulted in the term big data. With Census data becoming richer, more nationally representative, and voluminous, we need methodologies that are designed to handle the manifold survey designs that Census data sets implement. The relatively nascent PROC SURVEYLOGISTIC, an experimental procedure in SAS^®9 and fully supported in SAS 9.1, addresses some of these methodologies, including clusters, strata, and replicate weights. PROC SURVEYLOGISTIC handles data that is not a straightforward random sample. Using Census data sets, this paper provides examples highlighting the appropriate use of survey weights to calculate various estimates, as well as the calculation and interpretation of odds ratios between categorical variable interactions when predicting a binary outcome.

Read the paper (PDF)

Many organizations need to analyze large numbers of time series that have time-varying or frequency-varying properties (or both). The time-varying properties can include time-varying trends, and the frequency-varying properties can include time-varying periodic cycles. Time-frequency analysis simultaneously analyzes both time and frequency; it is particularly useful for monitoring time series that contain several signals of differing frequency. These signals are commonplace in data that are associated with the internet of things. This paper introduces techniques for large-scale time-frequency analysis and uses SAS^® Forecast Server and SAS/ETS^® software to demonstrate these techniques.

Read the paper (PDF)

Predictive analytics is powerful in its ability to predict likely future behavior, and is widely used in marketing. However, the old adage timing is everything continues to hold true the right customer and the right offer at the wrong time is less than optimal. In life, timing matters a great deal, but predictive analytics seldom takes timing into account explicitly. We should be able to do better. Financial service consumption changes often have precursor changes in behavior, and a behavior change can lead to multiple subsequent consumption changes. One way to improve our awareness of the customer situation is to detect significant events that have meaningful consequences that warrant proactive outreach. This session presents a simple time series approach to event detection that has proven to work successfully. Real case studies are discussed to illustrate the approach and implementation. Adoption of this practice can augment and enhance predictive analytics practice to elevate our game to the next level.

Read the paper (PDF)

Public water supplies contain disease-causing microorganisms in the water or distribution ducts. To kill off these pathogens, a disinfectant, such as chlorine, is added to the water. Chlorine is the most widely used disinfectant in all US water treatment facilities. Chlorine is known to be one of the most powerful disinfectants to restrict harmful pathogens from reaching the consumer. In the interest of obtaining a better understanding of what variables affect the levels of chlorine in the water, this presentation analyzed a particular set of water samples randomly collected from locations in Orange County, Florida. Thirty water samples were collected and their chlorine level, temperature, and pH were recorded. A linear regression analysis was performed on the data collected with several qualitative and quantitative variables. Water storage time, temperature, time of day, location, pH, and dissolved oxygen level were the independent variables collected from each water sample. All data collected was analyzed using various SAS^® procedures. Partial residual plots were used to determine possible relationships between the chlorine level and the independent variables. A stepwise selection was used to eliminate possible insignificant predictors. From there, several possible models for the data were selected. F-tests were conducted to determine which of the models appeared to be the most useful.

View the e-poster or slides (PDF)

Trends in predictive modeling, specifically in machine learning, are moving toward automated approaches where well-known predictive algorithms are applied and the best one is chosen according to some evaluation metric. This approach's efficacy relies on the underlying data preparation in general, and on data preprocessing in particular. Commonly used data preprocessing techniques include missing value imputation, outlier detection and treatment, functional transformation, discretization, and nominal grouping. Excluding toy problems, the composition of the best data preprocessing step depends on the modeling task at hand. This necessitates an iterative generation and evaluation of predictive pipelines, which consist of a mix of data preprocessing techniques and predictive algorithms. This is a combinatorial problem that can be a bottleneck in the analytics workflow. In this paper, we discuss the SAS^® Cloud Analytic Services (CAS) actions in SAS^® Viya that can be used to effect this end-to-end predictive pipeline in a scalable way, with special emphasis on CAS actions for data exploration, preprocessing, and feature transformation. In addition, we discuss how the whole process can be automated.

As part of the Talanx Group, HDI Insurance has been one of the leading insurers in Brazil. Recently HDI Brazil implemented an innovative and integrated solution to prevent fraud in the Auto Claims process based on SAS^® Fraud Framework and SAS^® Real-time Decision Manager. A car fix or a refund is approved immediately after the claim registration for those customers who have no suspicious information. On the other hand, the high-scored claims are checked by the inspectors using SAS^® Social Network Analysis. In terms of analytics, the solution has a hybrid approach working with predictive models, business rules, anomalies, and network relationship. The main benefits are a reduction in the amount of fraud, more accuracy in determining the claims to be investigated, a decrease in the false-positive rate, and the use of a relationship network to investigate suspicious connections.

Read the paper (PDF)

This paper discusses a specific example of using graph analytics or social network analysis (SNA) in predictive modeling in the life insurance industry. The methods of social network analysis are applied to agents that share compensation, and the results are used to derive input variables for a model to predict the likelihood of certain behavior by insurance agents. Both SAS^® code and SAS^® Enterprise Miner are used to illustrate implementing different graph analytical methods. This paper assumes that the reader is familiar with the basic process of creating predictive models using multiple (linear or logistic) regression, and, in some sections, familiarity with SAS Enterprise Miner.

Read the paper (PDF)

Hash tables are powerful tools when building an electronic code book, which often requires a lot of match-merging between the SAS^® data sets. In projects that span multiple years (e.g., longitudinal studies), there are usually thousands of new variables introduced at the end of every year or at the end of each phase of the project. These variables usually have the same stem or core as the previous year's variables. However, they differ only in a digit or two that usually signifies the year number of the project. So, every year, there is this extensive task of comparing thousands of new variables to older variables for the sake of carrying forward key database elements corresponding to the previously defined variables. These elements can include the length of the variable, data type, format, discrete or continuous flag, and so on. In our SAS program, hash objects are efficiently used to cut down not only time, but also the number of DATA and PROC steps used to accomplish the task. Clean and lean code is much easier to understand. A macro is used to create the data set containing new and older variables. For a specific new variable, the FIND method in hash objects is used in a loop to find the match to the most recent older variable. What was taking around a dozen PROC SQL steps is now a single DATA step using hash tables.

Read the paper (PDF)

This paper uses a simulation comparison to evaluate quantile approximation methods in terms of their practical usefulness and potential applicability in an operational risk context. A popular method in modeling the aggregate loss distribution in risk and insurance is the Loss Distribution Approach (LDA). Many banks currently use the LDA for estimating regulatory capital for operational risk. The aggregate loss distribution is a compound distribution resulting from a random sum of losses, where the losses are distributed according to some severity distribution and the number (of losses) distributed according to some frequency distribution. In order to estimate the regulatory capital, an extreme quantile of the aggregate loss distribution has to be estimated. A number of numerical approximation techniques have been proposed to approximate the extreme quantiles of the aggregate loss distribution. We use PROC SEVERITY to fit various severity distributions to simulated samples of individual losses from a preselected severity distribution. The accuracy of the approximations obtained is then evaluated against a Monte Carlo approximation of the extreme quantiles of the compound distribution resulting from the preselected severity distribution. We find that the second-order perturbative approximation, a closed-form approximation, performs very well at the extreme quantiles and over a wide range of distributions and is very easy to implement.

Read the paper (PDF)

A Middle Eastern company is responsible for daily distribution of over 230 million liters of oil products. For this distribution network, a failure scenario is defined as occurring when oil transport is interrupted or slows down, and/or when product demands fluctuate outside the normal range. Under all failure scenarios, the company plans to provide additional transport capacity at minimum cost so as to meet all point-to-point product demands. Currently, the company uses a wait-and-see strategy, which carries a high operating cost and depends on the availability of third-party transportation. This paper describes the use of the OPTMODEL procedure to implement a mixed integer programming model to model and solve this problem. Experimental results are provided to demonstrate the utility of this approach. It was discovered that larger instances of the problem, with greater numbers of potential failure scenarios, can become computationally extensive. In order to efficiently handle such instances of the problem, we have also implemented a Benders decomposition algorithm in PROC OPTMODEL.

Read the paper (PDF)

Access to care for Medicaid beneficiaries is a topic of frequent study and debate. Section 1202 of the Affordable Care Act (ACA) requires states to raise Medicaid primary care payment rates to Medicare levels in 2013 and 2014. The federal government paid 100% of the increase. This program was designed to encourage primary care providers to participate in Medicaid, since this has long been a challenge for Medicaid. Whether this fee increase has increased access to primary care providers is still debated. Using SAS^®, we evaluated whether Medicaid patients have a higher incidence of non-urgent visits to local emergency departments (ED) than do patients with other payment sources. The National Hospital Ambulatory Medical Care Survey (NHAMCS) data set, obtained from the Centers for Disease Control (CDC), was selected, since it contains data relating to hospital emergency departments. This emergency room data, for years 2003 2011, was analyzed by diagnosis, expected payment method, reason for the visit, region, and year. To evaluate whether the ED visits were considered urgent or non-urgent, we used the NYU Billings algorithm for classifying ED utilization (NYU Wagner 2015). Three models were used for the analyses: Binary Classification, Multi-Classification, and Regression. In addition to finding no regional differences, decision trees and SAS^® Visual Analytics revealed that Medicaid patients do not have a higher rate of non-emergent visits when compared to other payment types.

Read the paper (PDF)

One of the research goals in public health is to estimate the burden of diseases on the US population. We describe burden of disease by analyzing the statistical association of various diseases with hospitalizations, emergency department (ED) visits, ambulatory/outpatient (doctors' offices) visits, and deaths. In this short paper, we discuss the use of large, nationally representative databases, such as those offered by the National Center for Health Statistics (NCHS) or the Agency for Healthcare Research and Quality (AHRQ), to produce reliable estimates of diseases for studies. In this example, we use SAS^® and SUDAAN to analyze the Nationwide Emergency Department Sample (NEDS), offered by AHRQ, to estimate ED visits for hand, foot, and mouth disease (HFMD) in children less than five years old.

Read the paper (PDF) | View the e-poster or slides (PDF)

Chemical incidents involving irritant chemicals such as chlorine pose a significant threat to life and require rapid assessment. Data from the Validating Triage for Chemical Mass Casualty Incidents A First Step R01 grant was used to determine the most predictive signs and symptoms (S/S) for a chlorine mass casualty incident. SAS^® 9.4 was used to estimate sensitivity, specificity, positive and negative predictive values, and other statistics of irritant gas syndrome agent S/S for two exiting systems designed to assist emergency responders in hazardous material incidents (Wireless Information System for Emergency Responders (WISER) and CHEMM Intelligent Syndrome Tool (CHEMM-IST)). The results for WISER showed the sensitivity was .72 to 1.0; specificity .25 to .47; and the positive predictive value and negative predictive value were .04 to .87 and .33 to 1.0, respectively. The results for CHEMM-IST showed the sensitivity was .84 to .97; specificity .29 to .45; and the positive predictive value and negative predictive value were .18 to .42 and .86 to .97, respectively.

Read the paper (PDF) | View the e-poster or slides (PDF)

Transformation of raw data into sensible and useful information for prediction purposes is a priceless skill nowadays. Vast amounts of data, easily accessible at each step in a process, gives us a great opportunity to use it for countless applications. Unfortunately, not all of the valuable data is available for processing using classical data mining techniques. What happens if textual data is also used to create the analytical base table (ABT)? The goal of this study is to investigate whether scoring models that also use textual data are significantly better than models that include only quantitative data. This thesis is focused on estimating the probability of default (PD) for the social lending platform kokos.pl. The same methods used in banks are used to evaluate the accuracy of reported PDs. Data used for analysis is gathered directly from the platform via the API. This paper describes in detail the steps of the data mining process that is built using SAS^® Enterprise Miner . The results of the study support the thesis that models with a properly conducted text-mining process have better classification quality than models without text variables. Therefore, the use of this data mining approach is recommended when input data includes text variables.

Read the paper (PDF)

Panel data, which are collected on a set (panel) of individuals over several time points, are ubiquitous in economics and other analytic fields because their structure allows for individuals to act as their own control groups. The PANEL procedure in SAS/ETS^® software models panel data that have a continuous response, and it provides many options for estimating regression coefficients and their standard errors. Some of the available estimation methods enable you to estimate a dynamic model by using a lagged dependent variable as a regressor, thus capturing the autoregressive nature of the underlying process. Including lagged dependent variables introduces correlation between the regressors and the residual error, which necessitates using instrumental variables. This paper guides you through the process of using the typical estimation method for this situation-the generalized method of moments (GMM)-and the process of selecting the optimal set of instrumental variables for your model. Your goal is to achieve unbiased, consistent, and efficient parameter estimates that best represent the dynamic nature of the model.

Read the paper (PDF)

In many healthcare settings, patients are like customers they have a choice. One example is whether to participate in a procedure. In population-based screening in which the goal is to reduce deaths, the success of a program hinges on the patient's choice to accept and comply with the procedure. Like in many other industries, this not only relies on the program to attract new eligible patients to attend for the first time, but it also relies on the ability of the program to retain existing customers. The success of a new customer retention strategy within a breast screening environment is examined by applying a population averaged model (also know as marginal models), which uses generalized estimating equations (GEEs) to account for the lack of independence of the observations. Arguments for why a population average model was applied instead of a mixed effects model (or random effects model) are provided. This business case provides a great introductory session for people to better understand the difference between mixed effects and marginal models, and illustrates how to implement a population average model within SAS^® by using the GENMOD procedure.

Read the paper (PDF)

The challenge is to assign outbound calling agents in a telemarketing campaign to geographic districts. The districts have a variable number of leads, and each agent needs to be assigned entire districts with the total number of leads being as close as possible to a specified number for each of the agents (usually, but not always, an equal number). In addition, there are constraints concerning the distribution of assigned districts across time zones, in order to maximize productivity and availability. The SAS/OR^® CLP procedure solves the problem by formulating the challenge as a constraint satisfaction problem (CSP). Our use of PROC CLP places the actual leads within a specified percentage of the target number.

Read the paper (PDF)

Graphs are mathematical structures capable of representing networks of objects and their relationships. Clustering is an area in graph theory where objects are split into groups based on their connections. Depending on the application domain, object clusters have various meanings (for example, in market basket analysis, clusters are families of products that are frequently purchased together). This paper provides a SAS^® macro featuring PROC OPTGRAPH, which enables the transformation of transactional data, or any data with a many-to-many relationship between two entities, into graph data, allowing for the generation and application of the co-occurrence graph and the probability graph.

Read the paper (PDF)

Bivariate Cox proportional models are used when we test the association between a single covariate and the outcome. The test repeats for each covariate of interest. SAS^® uses the last category as the default reference. This raises problems when we want to keep using 0 as our reference for each covariate. The reference group can be changed in the CLASS statement. But, if a format is associated with a covariate, we have to use the corresponding format instead of raw numeric data. This problem becomes even worse when we have to repeat the test and manually enter the reference every single time. This presentation demonstrates one way of fixing the problem using the MACRO function and SYMPUT function.

Read the paper (PDF) | Download the data file (ZIP)

Weight of evidence (WOE) coding of a nominal or discrete variable X is widely used when preparing predictors for usage in binary logistic regression models. The concept of WOE is extended to ordinal logistic regression for the case of the cumulative logit model. If the target (dependent) variable has L levels, then L-1 WOE variables are needed to recode X. The appropriate setting for implementing WOE coding is the cumulative logit model with partial proportionate odds. As in the binary case, it is important to bin X to achieve parsimony before the WOE coding. SAS^® code to perform this binning is discussed. An example is given that shows the implementation of WOE coding and binning for a cumulative logit model with the target variable having three levels.

Read the paper (PDF)

In the last few years, machine learning and statistical learning methods have gained increasing popularity among data scientists and analysts. Statisticians have sometimes been reluctant to embrace these methodologies, partly due to a lack of familiarity, and partly due to concerns with interpretability and usability. In fact, statisticians have a lot to gain by using these modern, highly computational tools. For certain types of problems, machine learning methods can be much more accurate predictors than traditional methods for regression and classification, and some of these methods are particularly well suited for the analysis of big and wide data. Many of these methods have origins in statistics or at the boundary of statistics and computer science, and some are already well established in statistical procedures, including LASSO and elastic net for model selection and cross validation for model evaluation. In this talk, I go through some examples illustrating the application of machine learning methods and interpretation of their results, and show how these methods are similar to, and differ from, traditional statistical methodology for the same problems.

Read the paper (PDF)

We are at a tipping point for credit risk modeling. To meet the technical and regulatory challenges of IFRS 9 and stress testing, and to strengthen model risk management, CBS aims to create an integrated, end-to-end, tools-based solution across the model lifecycle, with strong governance and controls and an improved scenario testing and forecasting capability. SAS has been chosen as the technology partner to enable CBS to meet these aims. A new predictive analytics platform combining well-known tools such as SAS^® Enterprise Miner , SAS^® Model Manager, and SAS^® Data Management alongside SAS^® Model Implementation Platform powered by SAS^® High-Performance Risk is being deployed. Driven by technology, CBS has also considered the operating model for credit risk, restructuring resources around the new technology with clear lines of accountability, and has incorporated a dedicated data engineering function within the risk modeling team. CBS is creating a culture of collaboration across credit risk that supports the development of technology-led, innovative solutions that not only meet regulatory and model risk management requirements but that set a platform for the effective use of predictive analytics enterprise-wide.

In 2013, the Centers for Medicare & Medicaid Services (CMS) changed the pharmacy mail-order member-acquisition process so that Humana Pharmacy may only call a member with cost savings greater than $2.00 to educate the member on the potential savings and instruct the member to call back. The Rx Education call center asked for analytics work to help prioritize member outreach, improve conversions, and decrease the number of members who are unable to be contacted. After a year of contacting members using this additional insight, the conversions after agreement rate rose from 71.5% to 77.5% and the unable to contact rate fell from 30.7% to 17.4%. This case study takes you on an analytics journey from the initial problem diagnosis and analytics solution, followed by refinements, as well as test and learn campaigns.