SAS Global Forum 2017 Proceedings

Data Mining, Predictive Modeling, Machine Learning, and Text Analytics Papers A-Z

Analyzing big data and visualizing trends in social media is a challenge that many companies face as large sources of publicly available data become accessible. While the sheer size of usable data can be staggering, knowing how to find trends in unstructured textual data is just as important an issue. At a big data conference, data scientists from several companies were invited to participate in tackling this challenge by identifying trends in cancer using unstructured data from Twitter users and presenting their results. This paper explains how our approach using SAS^® analytical methods were superior to other big data approaches in investigating these trends.

Read the paper (PDF)

This poster shows how to predict a past-due amount using traditional and machine learning techniques: logistic analysis, k-nearest neighbors, and random forest. The data set that was analyzed is about real-world commerce. It contains 305 categories of financial information from more than 11,787,287 unique businesses, from 2006 to 2014. The big challenge is how to handle the big and noisy real-world data sets. The first step of any model-building exercise is to define the outcome. A common prediction method in the financial services industry is to use binary outcomes, such as Good and Bad. For our research problem, we reduced past-due amounts into two cases, Good and Bad. Next, we built a two-stage model using the logistic regression method; that is, the first stage predicts the likelihood of a Bad outcome, and the second predicts a past-due amount, given a Bad outcome. Logistic analysis as a traditional statistical technique is commonly used for prediction and classification in the financial services industry. However, for analyzing big, noisy, or complex data sets, machine learning techniques are typically preferred to detect hard-to-discern patterns. To compare with both techniques, we use predictive accuracy, ROC index, sensitivity, and specificity as criteria.

As the open-source community has been taking the technology world by storm, especially in the big data space, large corporations such as SAS, IBM, and Oracle have been working to embrace this new, quickly evolving ecosystem to continue to foster innovation and to remain competitive. For example, SAS, IBM, and others have aligned with the Open Data Platform initiative and are continuing to build out Hadoop and Spark solutions. And, Oracle has partnered with Cloudera to create the Big Data Appliance. This movement challenges companies that are consuming these products to select the right products and support partners. The hybrid approach using all tools available seems to be the methodology chosen by most successful companies. West Corporation, an Omaha-based provider of technology-enabled communication solutions, is no exception. West has been working with SAS for 10 years in the ETL, BI, and advanced analytics space, and West began its Hadoop journey a year ago. This paper focuses on how West data teams use both technologies to improve customer experience in the interactive voice response (IVR) system by storing massive semi-structure call logs in HDFS and by building models that predict a caller s intent to route the caller more efficiently and to reduce customer effort using familiar SAS code and the very user friendly SAS^® Enterprise Miner .

Read the paper (PDF)

Significant growth of the Internet has created an enormous volume of unstructured text data. In recent years, the amount of this type of data that is available for analysis has exploded. While the amount of textual data is increasing rapidly, an ability to obtain key pieces of information from such data in a fast, flexible, and efficient way is still posing challenges. This paper introduces SAS^® Contextual Analysis In-Database Scoring for Hadoop, which integrates SAS^® Contextual Analysis with the SAS^® Embedded Process. SAS^® Contextual Analysis enables users to customize their text analytics models in order to realize the value of their text-based data. The SAS^® Embedded Process enables users to take advantage of SAS^® Scoring Accelerator for Hadoop to run scoring models. By using these key SAS^® technologies, the overall experience of analyzing unstructured text data can be greatly improved. The paper also provides guidelines and examples on how to publish and run category, concept, and sentiment models for text analytics in Hadoop.

Read the paper (PDF)

Machine learning is in high demand. Whether you are a citizen data scientist who wants to work interactively or you are a hands-on data scientist who wants to code, you have access to the latest analytic techniques with SAS^® Visual Data Mining and Machine Learning on SAS^® Viya . This offering surfaces in-memory machine learning techniques such as gradient boosting, factorization machines, neural networks, and much more through its interactive visual interface, SAS^® Studio tasks, procedures, and a Python client. Learn about this multi-faceted new product and see it in action.

Read the paper (PDF)

Sovereign risk rating and country risk rating are conceptually distinct in that the former captures the risk of a country defaulting on its commercial debt obligations using economic variables while the latter covers the downside of a country's business environment including political and social variables alongside economic variables. Through this paper we would like to understand the differences between these risk approaches in assessing a country's credit worthiness by statistically examining the predictive power of political and social variables in determining country risk. To do this, we wish to build two models, first model with economic variables as regressors (sovereign risk model) and the second model with economic, political and social variables as regressors (country risk model) to compare the predictive power of regressors and model performance metrics between both the models. This will be an OLS regression model with country risk rating obtained from S&P as the target variable. With a general assumption that economic variables are driven by political processes and social factors, we would like to see if the second model has better predictive power. The economic, political and social indicators data that will be used as independent variables in the model will be obtained from world bank open data and target variable (country risk rating) will be obtained from S&P country risk ratings data.

View the e-poster or slides (PDF)

The Consumer Financial Protection Bureau (CFPB) collects tens of thousands of complaints against companies each year, many of which result in the companies in question taking action, including making payouts to the individuals who filed the complaints. Given the volume of the complaints, how can an overseeing organization quantitatively assess the data for various trends, including the areas of greatest concern for consumers? In this presentation, we propose a repeatable model of text analytics techniques to the publicly available CFPB data. Specifically, we use SAS^® Contextual Analysis to explore sentiment, and machine learning techniques to model the natural language available in each free-form complaint against a disposition code for the complaint, primarily focusing on whether a company paid out money. This process generates a taxonomy in an automated manner. We also explore methods to structure and visualize the results, showcasing how areas of concern are made available to analysts using SAS^® Visual Analytics and SAS^® Visual Statistics. Finally, we discuss the applications of this methodology for overseeing government agencies and financial institutions alike.

Read the paper (PDF)

Machine learning predictive modeling algorithms are governed by hyperparameters that have no clear defaults agreeable to a wide range of applications. A few examples of quantities that must be prescribed for these algorithms are the depth of a decision tree, number of trees in a random forest, number of hidden layers and neurons in each layer in a neural network, and degree of regularization to prevent overfitting. Not only do ideal settings for the hyperparameters dictate the performance of the training process, but more importantly they govern the quality of the resulting predictive models. Recent efforts to move from a manual or random adjustment of these parameters have included rough grid search and intelligent numerical optimization strategies. This paper presents an automatic tuning implementation that uses SAS/OR^® local search optimization for tuning hyperparameters of modeling algorithms in SAS^® Visual Data Mining and Machine Learning. The AUTOTUNE statement in the NNET, TREESPLIT, FOREST, and GRADBOOST procedures defines tunable parameters, default ranges, user overrides, and validation schemes to avoid overfitting. Given the inherent expense of training numerous candidate models, the paper addresses efficient distributed and parallel paradigms for training and tuning in SAS^® Viya . It also presents sample tuning results that demonstrate improved model accuracy over default configurations and offers recommendations for efficient and effective model tuning.

Read the paper (PDF)

As in any analytical process, data sampling is a definitive step for unstructured data analysis. Sampling is of paramount importance if your data is fed from social media reservoirs such as Twitter, Facebook, Amazon, and Reddit, where information proliferation happens minute by minute. Unless you have a sophisticated analytical engine and robust physical servers to handle billions of pieces of data, you can't use all your data for analysis without sampling. So, how do you sample textual data? The standard method is to generate either a simple random sample, or a stratified random sample if a stratification variable exists in the data. Neither of these two methods can reliably produce a representative sample of documents from the population data simply because the process does not encompass a step to ensure that the distribution of terms between the population and sample sets remains similar. This shortcoming can cause the supervised or unsupervised learning to yield inaccurate results. If the generated sample is not representative of the population data, it is difficult to train and validate categories or sub-categories for those rare events during taxonomy development. In this paper, we show you new methods for sampling text data. We rely on a term-by-document matrix and SAS^® macros to generate representative samples. Using these methods helps generate sufficient samples to train and validate every category in rule-based modeling approaches using SAS^® Contextual Analysis.

Read the paper (PDF)

A Bayesian network is a directed acyclic graphical model that represents probability relationships and conditional independence structure between random variables. SAS^® Enterprise Miner implements a Bayesian network primarily as a classification tool; it includes na ve Bayes, tree-augmented na ve Bayes, Bayesian-network-augmented na ve Bayes, parent-child Bayesian network, and Markov blanket Bayesian network classifiers. The HPBNET procedure uses a score-based approach and a constraint-based approach to model network structures. This paper compares the performance of Bayesian network classifiers to other popular classification methods, such as classification tree, neural network, logistic regression, and support vector machines. The paper also shows some real-world applications of the implemented Bayesian network classifiers and a useful visualization of the results.

Read the paper (PDF)

Digital transformation and analytics for incumbents isn't a question of choice or strategy. It's a question of business survival. Go analytics!

It takes months to find a customer and only seconds to lose one Unknown. Though the Business-to-Business (B2B) churn problem might not be as common as Business-to-Consumer (B2C) churn, it has become crucial for companies to address this effectively as well. Using statistical methods to predict churn is the first step in the process of retaining customers, which also includes model evaluation, prescriptive analytics (including outreach optimization), and performance reporting. Providing visibility into model and treatment performance enables the Data and Ops teams to tune models and adjust treatment strategy. West Corporation's Center for Data Science (CDS) has partnered with one of the lines of businesses in order to measure and prevent B2B customer churn. CDS has coupled firmographic and demographic data with internal CRM and past outreach data to build a Propensity to Churn model using SAS^®. CDS has provided the churn model output to an internal Client Success Team (CST), who focuses on high-risk/high-value customers in order to understand and provide resolution to any potential concerns that might be expressed by such customers. Furthermore, CDS automated weekly performance reporting using SAS and Microsoft Excel that not only focuses on model statistics, but also on CST actions and impact. This paper focuses on all of the steps involved in the churn-prevention process, including building and reviewing the model, treatment design and implementation, as well as performance reporting.

Regarding a human disease network, most studies have estimated the associations of disorders primarily with gene or protein information. Those studies, however, have some difficulties in the data because of the massive volume of data and the huge computational cost. Instead, we constructed a human disease network that can describe the associations between diseases, using the claim data of Korean health insurance. Through several statistical analyses, we show the applicability and suitability of the disease network. Furthermore, we develop a statistical model that can predict a prevalence rate for dementia by using significant associations of the network in a statistical perspective.

Read the paper (PDF)

Computer and video games are complex these days. Events in video games are in some cases recorded automatically in text files, creating a history or story of game play. There are countable items in these event records that can be used as data for statistics and other types of modeling. This E-Poster shows you how to statistically analyze text files for video game events using SAS^®. Two games are analyzed. EVE Online, a massive multi-user online role-playing spaceship game, is one. The other game is Ingress, a cell phone game that combines exercise with a GPS and real-world environments. In both examples, the techniques involve parsing large amounts of text data to examine recurring patterns in text that describe events in the game play.

View the e-poster or slides (PDF)

For all healthcare systems, considerable attention and resources are directed at gauging and improving patient satisfaction. Dignity Health has made considerable efforts in improving most areas of patient satisfaction. However, improving metrics around physician interaction with patients has been challenging. Failure to improve these publicly reported scores can result in reimbursement penalties, damage to Dignity's brand and an increased risk of patient harm. One possible way to improve these scores is to better identify the physicians that present the best opportunity for positive change. Currently, the survey tool mandated by the Centers for Medicare and Medicaid Services (CMS), the Hospital Consumer Assessment of Healthcare Providers and Systems (HCAHPS), has three questions centered on patient experience with providers, specifically concerning listening, respect, and clarity of conversation. For purposes of relating patient satisfaction scores to physicians, Dignity Health has assigned scores based on the attending physician at discharge. By conducting a manual record review, it was determined that this method rarely corresponds to the manual review (PPV = 20.7%, 95% CI: 9.9% -38.4%). Using a variety of SAS^® tools and predictive modeling programs, we developed a logistic regression model that had better agreement with chart abstractors (PPV = 75.9%, 95% CI: 57.9% - 87.8%). By attributing providers based on this predictive model, opportunities for improvement can be more accurately targeted, resulting in improved patient satisfaction and outcomes while protecting fiscal health.

Read the paper (PDF)

Learn neat new (and not so new) methods for joining text, graphs, and tables in a single document. This paper shows how you can create a report that includes all three with a single solution: SAS^®. The text portion is taken from a Microsoft Word document and joined with output from the GPLOT and REPORT procedures.

Read the paper (PDF)

Join this breakout session hosted by the Customer Value Management team from Saudi Telecommunications Company to understand the journey we took with SAS^® to evolve from simple below-the-line campaign communication to advanced customer value management (CVM). Learn how the team leveraged SAS tools ranging from SAS^® Enterprise Miner to SAS^® Customer Intelligence Suite in order to gain a deeper understanding of customers and move toward targeting customers with the right offer at the right time through the right channel.

Read the paper (PDF)

Customer feedback is a critical aspect of businesses in today's world as it is invaluable in determining what customers like and dislike about the business' service. This loop of regularly listening to customers' voice through survey comments and improving services based on it leads to better business, and more importantly, to an enhancement in customer experience. The challenge is to classify and analyze these unstructured text comments to gain insights and to focus on areas of improvement. The purpose of this paper is to illustrate how text mining in SAS^® Enterprise Miner 14.1 helped one of our clients a leading financial services company convert their customers problems into opportunities. The customers' feedback pertaining to their experience with an Interactive Voice Response (IVR) system is collected by an enterprise feedback management (EFM) company. The comments are then split into two groups, which helps us differentiate customer opinions. This grouping is based on customers who have given a rating of 0 6 and a rating of 9 10 on a Likert scale of 0 10 (10 being extremely satisfied) in the survey questionnaire. Text mining is performed on both these groups, and an algorithm creates clusters that are consequentially used to segment customers based on opinions they are interested in voicing. Furthermore, sentiment scores are calculated for each one of the segments. The scores classify the polarity of customer feedback and prioritizes the problems the client needs to focus on.

Read the paper (PDF)

Given the proposed budget cuts to higher education in the state of Kentucky, public universities will likely be awarded financial appropriations based on several performance metrics. The purpose of this project was to conceptualize, design, and implement predictive models that addressed two of the state's metrics: six-year graduation rate and fall-to-fall persistence for freshmen. The Western Kentucky University (WKU) Office of Institutional Research analyzed five years' worth of data on first-time, full-time bachelor's degree seeking students. Two predictive models evaluated and scored current students on their likelihood to stay enrolled and their chances of graduating on time. Following an ensemble of machine-learning assessments, the scored data were imported into SAS^® Visual Analytics, where interactive reports allowed users to easily identify which students were at a high risk for attrition or at risk of not graduating on time.

Read the paper (PDF)

Traditional analytical modeling, with roots in statistical techniques, works best on structured data. Structured data enables you to impose certain standards and formats in which to store the data values. For example, a variable indicating gas mileage in miles per gallon should always be a number (for example, 25). However, with unstructured data analysis, the free-form text no longer limits you to expressing this information in only one way (25 mpg, twenty-five mpg, and 25M/G). The nuances of language, context, and subjectivity of text make it more complex to fit generalized models. Although statistical methods using supervised learning prove efficient and effective in some cases, sometimes you need a different approach. These situations are when rule-based models with Natural Language Processing capabilities can add significant value. In what context would you choose a rule-based modeling versus a statistical approach? How do you assess the tradeoffs of choosing a rule-based modeling approach with higher interpretability versus a statistical model that is black-box in nature? How can we develop rule-based models that optimize model performance without compromising accuracy? How can we design, construct, and maintain a complex rule-based model? What is a data-driven approach to rule writing? What are the common pitfalls to avoid? In this paper, we discuss all these questions based on our experiences working with SAS^® Contextual Analysis and SAS^® Sentiment Analysis.

Read the paper (PDF)

Factorization machines are a new type of model that is well suited to very high-cardinality, sparsely observed transactional data. This paper presents the new FACTMAC procedure, which implements factorization machines in SAS^® Visual Data Mining and Machine Learning. This powerful and flexible model can be thought of as a low-rank approximation of a matrix or a tensor, and it can be efficiently estimated when most of the elements of that matrix or tensor are unknown. Thanks to a highly parallel stochastic gradient descent optimization solver, PROC FACTMAC can quickly handle data sets that contain tens of millions of rows. The paper includes examples that show you how to use PROC FACTMAC to recommend movies to users based on tens of millions of past ratings, predict whether fine food will be highly rated by connoisseurs, restore heavily damaged high-resolution images, and discover shot styles that best fit individual basketball players. ^®

Read the paper (PDF)

Credit card fraud. Loan fraud. Online banking fraud. Money laundering. Terrorism financing. Identity theft. The strains that modern criminals are placing on financial and government institutions demands new approaches to detecting and fighting crime. Traditional methods of analyzing large data sets on a periodic, batch basis are no longer sufficient. SAS^® Event Stream Processing provides a framework and run-time architecture for building and deploying analytical models that run continuously on streams of incoming data, which can come from virtually any source: message queues, databases, files, TCP\IP sockets, and so on. SAS^® Visual Scenario Designer is a powerful tool for developing, testing, and deploying aggregations, models, and rule sets that run in the SAS^® Event Stream Processing Engine. This session explores the technology architecture, data flow, tools, and methodologies that are required to build a solution based on SAS Visual Scenario Designer that enables organizations to fight crime in real time.

Read the paper (PDF)

In fantasy football, it is a relatively common strategy to rotate which team's defense a player uses based on some combination of favorable/unfavorable player matchups, recent performance, and projection of expected points. However, there is danger in this strategy because defensive scoring volatility is high, and any team has the possibility of turning in a statistically bad performance in a given week. This paper uses data mining techniques to identify which National Football League (NFL) teams give up high numbers of defensive penalties on a week-to-week basis, and to what degree those high-penalty games correlate with poor team defensive fantasy scores. Examining penalty count and penalty yards allowed totals, we can narrow down which teams are consistently hurt by poor technique and find correlation between games with high penalty totals to their respective fantasy football score. By doing so, we seek to find which teams should be avoided in fantasy football due to their likelihood of poor performance.

The acquisition of new customers is fundamental to the success of every business. Data science methods can greatly improve the effectiveness of acquiring prospective customers and can contribute to the profitability of business operations. In a business-to-business (B2B) setting, a predictive model might target business prospects as individual firms listed by, for example, Dunn & Bradstreet. A typical acquisition model can be defined using a binary response with the values categorizing a firm's customers and non-customers, for which it is then necessary to identify which of the prospects are actually customers of the firm. The methods of fuzzy logic, for example, based on the distance between strings, might help in matching customers' names and addresses with the overall universe of prospects. However, two errors can occur: false positives (when the prospect is incorrectly classified as a firm's customer), and false negatives (when the prospect is incorrectly classified as a non-customer). In the current practice of building acquisition models, these errors are typically ignored. In this presentation, we assess how these errors affect the performance of the predictive model as measured by a lift. In order to improve the model's performance, we suggest using a pre-determined sample of correct matches and to calibrate its predicted probabilities based on actual take-up rates. The presentation is illustrated with real B2B data and includes elements of SAS^® code that was used in the research.

Read the paper (PDF)

Machine Learning algorithms have been available in SAS software since 1979. This session provides practical examples of machine learning applications. The evolution of machine learning at SAS is illustrated with examples of nearest-neighbor discriminant analysis in SAS/STAT PROC DISCRIM to advanced predictive modeling in SAS Enterprise Miner. Machine learning techniques addressed include memory based reasoning, decision trees, neural networks, and gradient boosting algorithms.

This workshop provides hands-on experience with using SAS Enterprise Miner. Workshop participants will learn to do the following: open a project; create and explore a data source; build and compare models; and produce and examine score code that can be used for deployment.

This workshop provides hands-on experience with SAS Viya Data Mining and Machine Learning through the programming interface to SAS Viya. Workshop participants will learn how to start and stop a CAS session; move data into CAS; prepare data for machine learning; use SAS Studio tasks for supervised learning; and evaluate the results of analyses.

This workshop provides hands-on experience using SAS^® Text Miner. For a collection of documents, workshop participants will learn how to: read and convert documents for use by SAS Text Miner; retrieve information from the collection using query features of the software; identify the dominant themes and concepts in the collection; and classify documents having pre-assigned categories.

When modeling time series data, we often use a LAG of the dependent variable. The LAG function works great for this, until you try to predict out into the future and need the model's predicted value from one record as an independent value for a future record. This paper examines exactly how the LAG function works, and explains why it doesn't in this case. It also explains how to create a hash object that will accomplish a LAG of any value, how to load the initial data, how to add predicted values to the hash object, and how to extract those values when needed as an independent variable for future observations.

Read the paper (PDF)

How would you answer this question? Most of us struggle to articulate the value of the tools, techniques, and teams we use when using analytics. How do you help the new director understand the value of SAS^® to you, your job, and the company? In this interactive session, you will discover the components that make up total cost of ownership (TCO) as they apply to the analytics lifecycle. What should you consider when you evaluate total cost of ownership and why should you measure it? How can you help your management team understand the value that SAS provides?

Read the paper (PDF)

Investors usually trade stocks or exchange-traded funds (ETFs) based on a methodology, such as a theory, a model, or a specific chart pattern. There are more than 10,000 securities listed on the US stock market. Picking the right one based on a methodology from so many candidates is usually a big challenge. This paper presents the methodology based on the CANSLIM1 theorem and momentum trading (MT) theorem. We often hear of the cup and handle shape (C&H), double bottoms and multiple bottoms (MB), support and resistance lines (SRL), market direction (MD), fundamental analyses (FA), and technical analyses (TA). Those are all covered in CANSLIM theorem. MT is a trading theorem based on stock moving direction or momentum. Both theorems are easy to learn but difficult to apply without an appropriate tool. The brokers' application system usually cannot provide such filtering due to its complexity. For example, for C&H, where is the handle located? For the MB, where is the last bottom you should trade at? Now, the challenging task can be fulfilled through SAS^®. This paper presents the methods on how to apply the logic and graphically present them though SAS. All SAS users, especially those who work directly on capital market business, can benefit from reading this document to achieve their investment goals. Much of the programming logic can also be adopted in SAS finance packages for clients.

Read the paper (PDF)

What if you had analytics near the edge for your Internet of Things (IoT) devices that would tell you whether a piece of equipment is operating within its normal range? And what if those same analytics could help you intelligently determine what data you should keep and what data should be filtered at the edge? This session focuses on classifying streaming data near the edge by showcasing a demo that implements a single-class classification model within a gateway device. The model identifies observations that are normal and abnormal to help determine possible machine issues and preventative maintenance opportunities. The classification also helps to provide a method for filtering data at the edge by capturing all abnormal data but taking only a sample of the normal operating data. The model is developed using SAS^® Viya and implemented on a gateway device using SAS^® Event Stream Processing. By using a single-class classification technique, the demo also illustrates how to avoid issues with binary classification that would require failure observations in order to build an accurate model. Problems that this demo addresses include: identifying potential and future equipment failures in near real time; filtering sensor data near the edge to prevent unnecessary transport and storage of less valuable data; and building a classification model for failure that doesn't require observations relating to failures.

Read the paper (PDF)

This presentation demonstrates that applying Fast Fourier Transformation (FFT) on smart meter data can provide enhanced customer segmentation and discovery. The FFT is a mathematical method for transforming a function of time into a function of frequency. It's vastly used in analyzing sound but is also relevant for utilities. Advanced Metering Infrastructure (AMI) refers to the full measurement and collection system that includes meters at the customer site and communication networks between the customer and the utility. With the inception of AMI, utilities experienced an explosion of data that provides vast analytical opportunities to improve reliability, customer satisfaction, and safety. However, the data explosion comes with its own challenges. The first challenge is the volume. Consider that just 20,000 customers with AMI data can reach over 300 GB of data per year. Simply aggregating the data from minutes to hours or even days can skew results and not provide accurate segmentations. The second challenge is the bad data that is being collected. Outliers caused by missing or incorrect reads, outages, or other factors must be addressed. FFT can eliminate this noise. The proposed framework is expected to identify various customer segments that could be used for demand response programs. The framework also has the potential to investigate diversion or fraud or failing meters (revenue protection), which is a big problem for many utilities.

SAS^® Visual Analytics has two offerings, SAS^® Visual Statistics and SAS^® Visual Data Mining and Machine Learning, that provide knowledge workers and data scientists an interactive interface for data partition, data exploration, feature engineering, and rapid modeling. These offerings are powered by the SAS^® Viya platform, thus enabling big data and big analytic problems to be solved. This paper focuses on the steps a user would perform during an interactive modeling session.

Read the paper (PDF)

Throughout history, the phrase know thyself has been the aspiration of many. The trend of wearable technologies has certainly provided the opportunity to collect personal data. These technologies enable individuals to know thyself on a more sophisticated level. Specifically, wearable technologies that can track a patient's medical profile in a web-based environment, such as continuous blood glucose monitors, are saving lives. The main goal for diabetics is to replicate the functions of the pancreas in a manner that allows them to live a normal, functioning lifestyle. Many diabetics have access to a visual analytics website to track their blood glucose readings. However, they often are unreadable and overloaded with information. Analyzing these readings from the glucose monitor and insulin pump with SAS^®, diabetics can parse their own information into more simplified and readable graphs. This presentation demonstrates the ease in creating these visualizations. Not only is this beneficial for diabetics, but also for the doctors that prescribe the necessary basal and bolus levels of insulin for a patient s insulin pump.

View the e-poster or slides (PDF)

Linear regression, which is widely used, can be improved by the inclusion of the penalizing parameter. This helps reduce variance (at the cost of a slight increase in bias) and improves prediction accuracy and model interpretability. The regularization model is implemented on the sample data set, and recommendations for the practice are included.

View the e-poster or slides (PDF)

In the increasingly competitive environment for banks and credit unions, every potential advantage should be pursued. One of these advantages is to market additional products to your existing customers rather than to new customers, since your existing customers already know (and hopefully trust) you, and you have so much data on them. But how can this best be done? How can you market the right products to the right customers at the right time? Predictive analytics can do this by forecasting which customers have the highest chance of purchasing a given financial product. This paper provides a step-by-step overview of a relatively simple but comprehensive approach to maximize cross-sell opportunities among your customers. We first prepare the data for a statistical analysis. With some basic predictive analytics techniques, we can then identify those members who have the highest chance of buying a financial product. For each of these members, we can also gain insight into why they would purchase, thus suggesting the best way to market to them. We then make suggestions to improve the model for better accuracy.

Read the paper (PDF)

Many practitioners of machine learning are familiar with support vector machines (SVMs) for solving binary classification problems. Two established methods of using SVMs in multinomial classification are the one-versus-all approach and the one-versus-one approach. This paper describes how to use SAS^® software to implement these two methods of multinomial classification, with emphasis on both training the model and scoring new data. A variety of data sets are used to illustrate the pros and cons of each method.

Read the paper (PDF)

This paper presents a case study in which social media posts by individuals related to public transport companies in the United Kingdom were collected from social media sites such as Twitter and Facebook and also from forums using SAS^® and Python. The posts were then further processed by SAS^® Text Miner and SAS^® Visual Analytics to retrieve brand names, means of public transport (underground, trains, buses), and any mentioned attributes. Relevant concepts and topics are identified using text mining techniques and visualized using concept maps and word clouds. Later, we aim to identify and categorize sentiments against public transport in the corpus of the posts. Finally, we create an association map/mind-map of the different service dimensions/topics and the brands of public transport, using correspondence analysis.

Read the paper (PDF)

Banks can create a competitive advantage in their business by using business intelligence (BI) and by building models. In the credit domain, the best practice is to build risk-sensitive models (Probability of Default, Exposure at Default, Loss Given Default, Unexpected Loss, Concentration Risk, and so on) and implement them in decision-making, credit granting, and credit risk management. There are models and tools on the next level that are built on these models and that are used to help in achieving business targets, setting risk-sensitive pricing, capital planning, optimizing Return on Equity/Risk Adjusted Return on Capital (ROE/RAROC), managing the credit portfolio, setting the level of provisions, and so on. It works remarkably well as long as the models work. However, over time, models deteriorate, and their predictive power can drop dramatically. As a result, heavy reliance on models in decision-making (some decisions are automated following the model's results-without human intervention) can result in a huge error, which might have dramatic consequences for the bank's performance. In my presentation, I share our experience in reducing model risk and establishing corporate governance of models with the following SAS^® tools: SAS^® Model Monitoring Microservice, SAS^® Model Manager, dashboards, and SAS^® Visual Analytics.

Read the paper (PDF)

This paper establishes the conceptualization of the dimension of the shopping cart (or market basket) on apparel retail websites. It analyzes how the cart dimension (describing anonymous shoppers) and the customer dimension (describing non-anonymous shoppers) impact merchandise return behavior. Five data-mining techniques-namely logistic regression, decision tree, neural network, gradient boosting, and support vector machine-are used for predicting the likelihood of merchandise return. The target variable is a dichotomous response variable: return vs not return. The primary input variables are conceptualized as constituents of the cart dimension, derived from engineering merchandise-related variables such as item style, item size, and item color, as well as free-shipping-related thresholds. By further incorporating the constituents of the customer dimension such as tenure, loyalty membership, and purchase histories, the predictive accuracy of the model built using each of the five data-mining techniques was found to improve substantially. This research also highlights the relative importance of the constituents of the cart and customer dimensions governing the likelihood of merchandise return. Recommendations for possible applications and research areas are provided.

Read the paper (PDF)

As a data scientist, you need analytical tools and algorithms, whether commercial or open source, and you have some favorites. But how do you decide when to use what? And how can you integrate their use to your maximum advantage? This presentation provides several best practices for deploying both SAS^® and open-source analytical tools to increase productivity and efficiency in your enterprise ecosystem. See an example of a marketing analysis using SAS and R algorithms in SAS^® Enterprise Miner to develop a predictive model, and then operationalize that model for performance monitoring and in-database scoring. Also learn about using Python and SAS integration for developing predictive models from a Jupyter Notebook environment. Seeing these cases will help you decide how to improve your analytics with similar integration of SAS and open source.

Read the paper (PDF)

The analytical data life cycle consists of 4 stages: data exploration, preparation, model development, and model deployment. Traditionally, these stages can consume 80% of the time and resources within your organization. With innovative techniques such as in-database and in-memory processing, managing data and analytics can be streamlined, with an increase in performance, economics, and governance. This session explores how you can optimize the analytical data life cycle with some best practices and tips using SAS^® and Teradata.

Outliers, such as unusual, violated, unexpected or rare events, have been focused on intensively by researchers and practitioners, providing their impacts on estimated statistics and developed models. Today, some business disciplines are focusing primarily on outliers such as defaults of credit, operational risks, quality nonconformities, fraud, or even the results of marketing initiatives in highly competitive environments with low response rates of a couple percent or even less. This paper discusses the importance of detecting, isolating, and categorizing business outliers to discover their root causes and to monitor them dynamically. Addressing not only extreme values or multivariable densities detecting outliers, but also addressing distributions, patterns, clusters, combinations of items, and sequences of events will allow for opportunities to be established for business improvement. SAS^® Enterprise Miner can be used to perform such detections. Thus, creating special business segments or running specialized outlier oriented data mining processes, such as decision trees, allows for isolation of business important outliers, which are normally masked in traditional statistical techniques. This process combined with 'What-If' scenario generation prepares businesses for future possible surges even when having no current specific type outliers. Furthermore, analyzing some specific outliers may play a role in assessing business stability to corresponding stress tests.

Read the paper (PDF)

Research using electronic health records (EHR) is emerging, but questions remain about its completeness, due in part to physicians' time to enter data in all fields. This presentation demonstrates the use of SAS^® Enterprise Miner to predict completeness of clinical data using claims data as the standard 'source of truth' against which to compare it. A method for assessing and predicting the completeness of clinical data is presented using the tools and techniques from SAS Enterprise Miner. Some of the topics covered include: tips for preparing your sample data set for use in SAS Enterprise Miner; tips for preparing your sample data set for modeling, including effective use of the Input Data, Data Partition, Filter, and Replacement nodes; and building predictive models using Stat Explore, Decision Tree, Regression, and Model Compare nodes.

View the e-poster or slides (PDF)

The most commonly reported model evaluation metric is the accuracy. This metric can be misleading when the data are imbalanced. In such cases, other evaluation metrics should be considered in addition to the accuracy. This study reviews alternative evaluation metrics for assessing the effectiveness of a model in highly imbalanced data. We used credit card clients in Taiwan as a case study. The data set contains 30,000 instances (22.12% risky and 77.88% non-risky) assessing the likeliness of a customer defaulting on a payment. Three different techniques were used during the model building process. The first technique involved down-sampling the majority class in the training subset. The second used the original imbalanced data whereas prior probabilities were set to account for oversampling in the third technique. The same sets of predictive models were then built for each technique after which the evaluation metrics were computed. The results suggest that model evaluation metrics might reveal more about distribution of classes than they do about the actual performance of models when the data are imbalanced. Moreover, some of the predictive models were identified to be very sensitive to imbalance. The final decision in model selection should consider a combination of different measures instead of relying on one measure. To minimize imbalance-biased estimates of performance, we recommend reporting both the obtained metric values and the degree of imbalance in the data.

Read the paper (PDF)

Predictive analytics has been evolving in property and casualty insurance for the past two decades. This paper first provides a high-level overview of predictive analytics in each of the following core business operations in the property and casualty (P&C) insurance industry: marketing, underwriting, actuarial pricing, actuarial reserving, and claims. Then, a common P&C insurance predictive modeling technical process in SAS^® dealing with large data sets is introduced. The steps of this process include data acquisition, data preparation, variable creation, variable selection, model building (also known as model fitting), model validation, model testing, and so on. Finally, some successful models are introduced. Base SAS^®, SAS/STAT^® software, SAS^® Enterprise Guide^®, and SAS^® Enterprise Miner are presented as the main tools for this process. This predictive modeling process could be tweaked or directly used in many other industries as the statistical foundations of predictive analytics have large overlaps across P&C insurance, health care, life insurance, banking, pharmaceutical, genetics industries, and so on. This paper is intended for any level of SAS^® user or business people from different industries who are interested in learning about general predictive analytics.

Read the paper (PDF)

This session introduces how Equifax uses SAS^® to develop a streamlined model automation process, including development and performance monitoring. The tool increases modeling efficiency and accuracy of the model, reduces error, and generates insights. You can integrate more advanced analytics tools in a later phase. The process can apply in any given business problem in risk and marketing, which helps leaders to make precise and accurate business decisions.

AdaBoost (or Adaptive Boosting) is a machine learning method that builds a series of decision trees, adapting each tree to predict difficult cases missed by the previous trees and combining all trees into a single model. I discuss the AdaBoost methodology, introduce the extension called Real AdaBoost, which is so similar to stepwise weight of evidence logistic regression (SWOELR) that it might offer a framework with which we can understand the power of the SWOELR approach. I discuss the advantages of Real AdaBoost, including variable interaction and adaptive, stage-wise binning, and demonstrate a SAS^® macro that uses Real AdaBoost to generate predictive models.

Read the paper (PDF)

Binary logistic regression models are widely used in CRM (customer relationship management) or credit risk modeling. In these models, it is common to use nominal, ordinal, or discrete (NOD) predictors. NOD predictors typically are binned (reducing the number of their levels) before usage in a logistic model. The primary purpose of binning is to obtain parsimony without greatly reducing the strength of association of the predictor X to the binary target Y. In this paper, two SAS^® macros are discussed. The %NOD_BIN macro bins predictors with nominal values (and ordinal and discrete values) by collapsing levels to maximize information value (IV). The %ORDINAL_BIN macro is applied to predictors that are ordered and in which collapsing can occur only for levels that are adjacent in the ordering of X. The %ORDINAL_BIN macro finds all possible binning solutions by complete enumeration. Solutions are ranked by IV, and monotonic solutions are identified.

Read the paper (PDF)

Web Intelligence is a business objects web-based application used by the FDA for accessing and querying data files, and ultimately creating reports from multiple databases. The system allows querying of different databases using common business terms, and in the case of the FDA's Center for Food Safety and Applied Nutrition (CFSAN), careful review of dietary supplement information. However, in order to create timely and efficient reports for detection of safety signals leading to adverse events, a more efficient system is needed to obtain and visually display the data. Using SAS^® Visual Analytics and SAS^® Enterprise Guide^® can assist with timely extraction of data from multiple databases commonly used by CFSAN and create a more user friendly interface for management's review to help make key decisions in prevention of adverse events for the public.

Read the paper (PDF)

Ensemble models have become increasingly popular in boosting prediction accuracy over the last several years. Stacked ensemble techniques combine predictions from multiple machine learning algorithms and use these predictions as inputs to a second level-learning algorithm. This paper shows how you can generate a diverse set of models by various methods (such as neural networks, extreme gradient boosting, and matrix factorizations) and then combine them with popular stacking ensemble techniques, including hill-climbing, generalized linear models, gradient boosted decision trees, and neural nets, by using both the SAS^® 9.4 and SAS^® Visual Data Mining and Machine Learning environments. The paper analyzes the application of these techniques to real-life big data problems and demonstrates how using stacked ensembles produces greater prediction accuracy than individual models and na ve ensembling techniques. In addition to training a large number of models, model stacking requires the proper use of cross validation to avoid overfitting, which makes the process even more computationally expensive. The paper shows how to deal with the computational expense and efficiently manage an ensemble workflow by using parallel computation in a distributed framework.

Read the paper (PDF)

Temporal text mining (TTM) is the discovery of temporal patterns in documents that are collected over time. It involves discovery of latent themes, construction of a thematic evolution graph, and analysis of thematic patterns. This paper uses text mining and time series analysis techniques to explore Don Quixote de la Mancha, a two-volume master work of Western literature. First, it uses singular value decomposition in SAS^® Text Miner to discover 25 key themes that characterize the two volumes. Then it treats the chapters of the two books as time-ordered documents and creates a semiautomated visual summary of the two volumes. It also explores the trajectory of individual themes over the course of the chapters and identifies episodes, recurring themes, and climaxes. Finally, it uses time series clustering in SAS^® Enterprise Miner to group chapters that have similar themes and to group themes that have similar trajectories. The TTM methods demonstrated in this paper lend themselves to business applications such as monitoring changes in customer sentiment and summarizing research and legislative trends.

Read the paper (PDF)

This project described the method to classify movie genres based on synopses text data by two approaches: term frequency, and inverse document frequency (tf-idf) and C4.5 decision tree. Using the performance comparison of the classifiers by manipulating the different parameters, the strength and improvement of this method in substantial text analysis were also interpreted. As the result, these two approaches are powerful to identify movie genres.

Read the paper (PDF) | View the e-poster or slides (PDF)

Propensity to Buy models comprise one of the most widely used techniques in supporting business strategy for customer segmentation and targeting. Some of the key challenges every data scientist faces in building predictive models are the utilization of all known predictor variables, uncovering any unknown signals, and adjusting for latent variable errors. Often, the business demands inclusion of certain variables based on a previous understanding of process dynamics. To meet such client requirements, these inputs are forced into the model, resulting in either a complex model with too many inputs or a fragile model that might decay faster than expected. West Corporation's Center for Data Science (CDS) has found a work around to strike a balance between meeting client requirements and building a robust model by using clustering techniques. A leading telecom services provider uses West's SMS Outbound Notification Platform to notify their customers about an upcoming Pay-Per-View event. As part of the modeling process, the client has identified a few variables as key business drivers and CDS used those variables to build clusters, which were then used as inputs to the predictive model. In doing so, not only all the effects of the client-mandated variables were captured successfully, but this also helped to reduce the number of inputs to the model, making it parsimonious. This paper illustrates how West has used clustering in the data preparation process and built a robust model.

In the 2015-2016 season of the National Basketball Association (NBA), the Golden State Warriors achieved a record-breaking 73 regular-season wins. This accomplishment would not have been possible without their reigning Most Valuable Player (MVP) champion Stephen Curry and his historic shooting performance. Shattering his previous NBA record of 286 three-point shots made during the 2014-2015 regular season, he accrued an astounding 402 in the next season. With an increased emphasis on the advantages of the three-point shot and guard-heavy offenses in the NBA today, organizations are naturally eager to investigate player statistics related to shooting at long ranges, especially for the best of shooters. Furthermore, the addition of more advanced data-collecting entities such as SportVU creates an incredible opportunity for data analysis, moving beyond simply using aggregated box scores. This work uses quantile regression within SAS^® 9.4 to explore the relationships between the three-point shot and other relevant advanced statistics, including some SportVU player-tracking data, for the top percentile of three-point shooters from the 2015-2016 NBA regular season.

View the e-poster or slides (PDF)

Predictive analytics is powerful in its ability to predict likely future behavior, and is widely used in marketing. However, the old adage timing is everything continues to hold true the right customer and the right offer at the wrong time is less than optimal. In life, timing matters a great deal, but predictive analytics seldom takes timing into account explicitly. We should be able to do better. Financial service consumption changes often have precursor changes in behavior, and a behavior change can lead to multiple subsequent consumption changes. One way to improve our awareness of the customer situation is to detect significant events that have meaningful consequences that warrant proactive outreach. This session presents a simple time series approach to event detection that has proven to work successfully. Real case studies are discussed to illustrate the approach and implementation. Adoption of this practice can augment and enhance predictive analytics practice to elevate our game to the next level.

Read the paper (PDF)

Trends in predictive modeling, specifically in machine learning, are moving toward automated approaches where well-known predictive algorithms are applied and the best one is chosen according to some evaluation metric. This approach's efficacy relies on the underlying data preparation in general, and on data preprocessing in particular. Commonly used data preprocessing techniques include missing value imputation, outlier detection and treatment, functional transformation, discretization, and nominal grouping. Excluding toy problems, the composition of the best data preprocessing step depends on the modeling task at hand. This necessitates an iterative generation and evaluation of predictive pipelines, which consist of a mix of data preprocessing techniques and predictive algorithms. This is a combinatorial problem that can be a bottleneck in the analytics workflow. In this paper, we discuss the SAS^® Cloud Analytic Services (CAS) actions in SAS^® Viya that can be used to effect this end-to-end predictive pipeline in a scalable way, with special emphasis on CAS actions for data exploration, preprocessing, and feature transformation. In addition, we discuss how the whole process can be automated.

As part of the Talanx Group, HDI Insurance has been one of the leading insurers in Brazil. Recently HDI Brazil implemented an innovative and integrated solution to prevent fraud in the Auto Claims process based on SAS^® Fraud Framework and SAS^® Real-time Decision Manager. A car fix or a refund is approved immediately after the claim registration for those customers who have no suspicious information. On the other hand, the high-scored claims are checked by the inspectors using SAS^® Social Network Analysis. In terms of analytics, the solution has a hybrid approach working with predictive models, business rules, anomalies, and network relationship. The main benefits are a reduction in the amount of fraud, more accuracy in determining the claims to be investigated, a decrease in the false-positive rate, and the use of a relationship network to investigate suspicious connections.

Read the paper (PDF)

This paper discusses a specific example of using graph analytics or social network analysis (SNA) in predictive modeling in the life insurance industry. The methods of social network analysis are applied to agents that share compensation, and the results are used to derive input variables for a model to predict the likelihood of certain behavior by insurance agents. Both SAS^® code and SAS^® Enterprise Miner are used to illustrate implementing different graph analytical methods. This paper assumes that the reader is familiar with the basic process of creating predictive models using multiple (linear or logistic) regression, and, in some sections, familiarity with SAS Enterprise Miner.

Read the paper (PDF)

Access to care for Medicaid beneficiaries is a topic of frequent study and debate. Section 1202 of the Affordable Care Act (ACA) requires states to raise Medicaid primary care payment rates to Medicare levels in 2013 and 2014. The federal government paid 100% of the increase. This program was designed to encourage primary care providers to participate in Medicaid, since this has long been a challenge for Medicaid. Whether this fee increase has increased access to primary care providers is still debated. Using SAS^®, we evaluated whether Medicaid patients have a higher incidence of non-urgent visits to local emergency departments (ED) than do patients with other payment sources. The National Hospital Ambulatory Medical Care Survey (NHAMCS) data set, obtained from the Centers for Disease Control (CDC), was selected, since it contains data relating to hospital emergency departments. This emergency room data, for years 2003 2011, was analyzed by diagnosis, expected payment method, reason for the visit, region, and year. To evaluate whether the ED visits were considered urgent or non-urgent, we used the NYU Billings algorithm for classifying ED utilization (NYU Wagner 2015). Three models were used for the analyses: Binary Classification, Multi-Classification, and Regression. In addition to finding no regional differences, decision trees and SAS^® Visual Analytics revealed that Medicaid patients do not have a higher rate of non-emergent visits when compared to other payment types.

Read the paper (PDF)

Transformation of raw data into sensible and useful information for prediction purposes is a priceless skill nowadays. Vast amounts of data, easily accessible at each step in a process, gives us a great opportunity to use it for countless applications. Unfortunately, not all of the valuable data is available for processing using classical data mining techniques. What happens if textual data is also used to create the analytical base table (ABT)? The goal of this study is to investigate whether scoring models that also use textual data are significantly better than models that include only quantitative data. This thesis is focused on estimating the probability of default (PD) for the social lending platform kokos.pl. The same methods used in banks are used to evaluate the accuracy of reported PDs. Data used for analysis is gathered directly from the platform via the API. This paper describes in detail the steps of the data mining process that is built using SAS^® Enterprise Miner . The results of the study support the thesis that models with a properly conducted text-mining process have better classification quality than models without text variables. Therefore, the use of this data mining approach is recommended when input data includes text variables.

Read the paper (PDF)

Graphs are mathematical structures capable of representing networks of objects and their relationships. Clustering is an area in graph theory where objects are split into groups based on their connections. Depending on the application domain, object clusters have various meanings (for example, in market basket analysis, clusters are families of products that are frequently purchased together). This paper provides a SAS^® macro featuring PROC OPTGRAPH, which enables the transformation of transactional data, or any data with a many-to-many relationship between two entities, into graph data, allowing for the generation and application of the co-occurrence graph and the probability graph.

Read the paper (PDF)

Weight of evidence (WOE) coding of a nominal or discrete variable X is widely used when preparing predictors for usage in binary logistic regression models. The concept of WOE is extended to ordinal logistic regression for the case of the cumulative logit model. If the target (dependent) variable has L levels, then L-1 WOE variables are needed to recode X. The appropriate setting for implementing WOE coding is the cumulative logit model with partial proportionate odds. As in the binary case, it is important to bin X to achieve parsimony before the WOE coding. SAS^® code to perform this binning is discussed. An example is given that shows the implementation of WOE coding and binning for a cumulative logit model with the target variable having three levels.

Read the paper (PDF)

In the last few years, machine learning and statistical learning methods have gained increasing popularity among data scientists and analysts. Statisticians have sometimes been reluctant to embrace these methodologies, partly due to a lack of familiarity, and partly due to concerns with interpretability and usability. In fact, statisticians have a lot to gain by using these modern, highly computational tools. For certain types of problems, machine learning methods can be much more accurate predictors than traditional methods for regression and classification, and some of these methods are particularly well suited for the analysis of big and wide data. Many of these methods have origins in statistics or at the boundary of statistics and computer science, and some are already well established in statistical procedures, including LASSO and elastic net for model selection and cross validation for model evaluation. In this talk, I go through some examples illustrating the application of machine learning methods and interpretation of their results, and show how these methods are similar to, and differ from, traditional statistical methodology for the same problems.

Read the paper (PDF)

We are at a tipping point for credit risk modeling. To meet the technical and regulatory challenges of IFRS 9 and stress testing, and to strengthen model risk management, CBS aims to create an integrated, end-to-end, tools-based solution across the model lifecycle, with strong governance and controls and an improved scenario testing and forecasting capability. SAS has been chosen as the technology partner to enable CBS to meet these aims. A new predictive analytics platform combining well-known tools such as SAS^® Enterprise Miner , SAS^® Model Manager, and SAS^® Data Management alongside SAS^® Model Implementation Platform powered by SAS^® High-Performance Risk is being deployed. Driven by technology, CBS has also considered the operating model for credit risk, restructuring resources around the new technology with clear lines of accountability, and has incorporated a dedicated data engineering function within the risk modeling team. CBS is creating a culture of collaboration across credit risk that supports the development of technology-led, innovative solutions that not only meet regulatory and model risk management requirements but that set a platform for the effective use of predictive analytics enterprise-wide.

In 2013, the Centers for Medicare & Medicaid Services (CMS) changed the pharmacy mail-order member-acquisition process so that Humana Pharmacy may only call a member with cost savings greater than $2.00 to educate the member on the potential savings and instruct the member to call back. The Rx Education call center asked for analytics work to help prioritize member outreach, improve conversions, and decrease the number of members who are unable to be contacted. After a year of contacting members using this additional insight, the conversions after agreement rate rose from 71.5% to 77.5% and the unable to contact rate fell from 30.7% to 17.4%. This case study takes you on an analytics journey from the initial problem diagnosis and analytics solution, followed by refinements, as well as test and learn campaigns.