Data Mining, Predictive Modeling, Machine Learning, and Text Analytics Papers A-Z

A
Session 1066-2017:
A Big-Data Challenge: Visualizing Social Media Trends about Cancer Using SAS® Text Miner
Analyzing big data and visualizing trends in social media is a challenge that many companies face as large sources of publicly available data become accessible. While the sheer size of usable data can be staggering, knowing how to find trends in unstructured textual data is just as important an issue. At a big data conference, data scientists from several companies were invited to participate in tackling this challenge by identifying trends in cancer using unstructured data from Twitter users and presenting their results. This paper explains how our approach using SAS® analytical methods were superior to other big data approaches in investigating these trends.
Read the paper (PDF)
Scott Koval, Pinnacle Solutions, Inc
Yijie Li, Pinnacle Solutions, Inc
Mia Lyst, Pinnacle Solutions, Inc
Session 1256-2017:
A Comparison of Machine Learning Methods and Logistic Analysis for the Prediction of Past-Due Amount
This poster shows how to predict a past-due amount using traditional and machine learning techniques: logistic analysis, k-nearest neighbors, and random forest. The data set that was analyzed is about real-world commerce. It contains 305 categories of financial information from more than 11,787,287 unique businesses, from 2006 to 2014. The big challenge is how to handle the big and noisy real-world data sets. The first step of any model-building exercise is to define the outcome. A common prediction method in the financial services industry is to use binary outcomes, such as Good and Bad. For our research problem, we reduced past-due amounts into two cases, Good and Bad. Next, we built a two-stage model using the logistic regression method; that is, the first stage predicts the likelihood of a Bad outcome, and the second predicts a past-due amount, given a Bad outcome. Logistic analysis as a traditional statistical technique is commonly used for prediction and classification in the financial services industry. However, for analyzing big, noisy, or complex data sets, machine learning techniques are typically preferred to detect hard-to-discern patterns. To compare with both techniques, we use predictive accuracy, ROC index, sensitivity, and specificity as criteria.
Jie Hao, Kennesaw State University
Peter Eberhardt, Fernwood Consulting Group Inc.
Session 0823-2017:
A Hadoop Journey along the SAS® Road
As the open-source community has been taking the technology world by storm, especially in the big data space, large corporations such as SAS, IBM, and Oracle have been working to embrace this new, quickly evolving ecosystem to continue to foster innovation and to remain competitive. For example, SAS, IBM, and others have aligned with the Open Data Platform initiative and are continuing to build out Hadoop and Spark solutions. And, Oracle has partnered with Cloudera to create the Big Data Appliance. This movement challenges companies that are consuming these products to select the right products and support partners. The hybrid approach using all tools available seems to be the methodology chosen by most successful companies. West Corporation, an Omaha-based provider of technology-enabled communication solutions, is no exception. West has been working with SAS for 10 years in the ETL, BI, and advanced analytics space, and West began its Hadoop journey a year ago. This paper focuses on how West data teams use both technologies to improve customer experience in the interactive voice response (IVR) system by storing massive semi-structure call logs in HDFS and by building models that predict a caller s intent to route the caller more efficiently and to reduce customer effort using familiar SAS code and the very user friendly SAS® Enterprise Miner .
Read the paper (PDF)
Sumit Sukhwani, West Corporation
Krutharth Peravalli, West Corporation
Amit Gautam, West Corporation
Session SAS0634-2017:
An Efficient Way to Deploy and Run Text Analytics Models in Hadoop
Significant growth of the Internet has created an enormous volume of unstructured text data. In recent years, the amount of this type of data that is available for analysis has exploded. While the amount of textual data is increasing rapidly, an ability to obtain key pieces of information from such data in a fast, flexible, and efficient way is still posing challenges. This paper introduces SAS® Contextual Analysis In-Database Scoring for Hadoop, which integrates SAS® Contextual Analysis with the SAS® Embedded Process. SAS® Contextual Analysis enables users to customize their text analytics models in order to realize the value of their text-based data. The SAS® Embedded Process enables users to take advantage of SAS® Scoring Accelerator for Hadoop to run scoring models. By using these key SAS® technologies, the overall experience of analyzing unstructured text data can be greatly improved. The paper also provides guidelines and examples on how to publish and run category, concept, and sentiment models for text analytics in Hadoop.
Read the paper (PDF)
Seung Lee, SAS
Xu Yang, SAS
Saratendu Sethi, SAS
Session SAS1492-2017:
An Overview of SAS® Visual Data Mining and Machine Learning on SAS® Viya™
Machine learning is in high demand. Whether you are a citizen data scientist who wants to work interactively or you are a hands-on data scientist who wants to code, you have access to the latest analytic techniques with SAS® Visual Data Mining and Machine Learning on SAS® Viya . This offering surfaces in-memory machine learning techniques such as gradient boosting, factorization machines, neural networks, and much more through its interactive visual interface, SAS® Studio tasks, procedures, and a Python client. Learn about this multi-faceted new product and see it in action.
Read the paper (PDF)
Jonathan Wexler, SAS
Susan Haller, SAS
Session 1370-2017:
Analyzing the Predictive Power of Political and Social Factors in Determining Country Risk
Sovereign risk rating and country risk rating are conceptually distinct in that the former captures the risk of a country defaulting on its commercial debt obligations using economic variables while the latter covers the downside of a country's business environment including political and social variables alongside economic variables. Through this paper we would like to understand the differences between these risk approaches in assessing a country's credit worthiness by statistically examining the predictive power of political and social variables in determining country risk. To do this, we wish to build two models, first model with economic variables as regressors (sovereign risk model) and the second model with economic, political and social variables as regressors (country risk model) to compare the predictive power of regressors and model performance metrics between both the models. This will be an OLS regression model with country risk rating obtained from S&P as the target variable. With a general assumption that economic variables are driven by political processes and social factors, we would like to see if the second model has better predictive power. The economic, political and social indicators data that will be used as independent variables in the model will be obtained from world bank open data and target variable (country risk rating) will be obtained from S&P country risk ratings data.
View the e-poster or slides (PDF)
Bhuvaneswari Yallabandi, Oklahoma State University
Vishwanath Srivatsa Kolar Bhaskara, Oklahoma State University
Session SAS0282-2017:
Applying Text Analytics and Machine Learning to Assess Consumer Financial Complaints
The Consumer Financial Protection Bureau (CFPB) collects tens of thousands of complaints against companies each year, many of which result in the companies in question taking action, including making payouts to the individuals who filed the complaints. Given the volume of the complaints, how can an overseeing organization quantitatively assess the data for various trends, including the areas of greatest concern for consumers? In this presentation, we propose a repeatable model of text analytics techniques to the publicly available CFPB data. Specifically, we use SAS® Contextual Analysis to explore sentiment, and machine learning techniques to model the natural language available in each free-form complaint against a disposition code for the complaint, primarily focusing on whether a company paid out money. This process generates a taxonomy in an automated manner. We also explore methods to structure and visualize the results, showcasing how areas of concern are made available to analysts using SAS® Visual Analytics and SAS® Visual Statistics. Finally, we discuss the applications of this methodology for overseeing government agencies and financial institutions alike.
Read the paper (PDF)
Tom Sabo, SAS
Session SAS0514-2017:
Automated Hyperparameter Tuning for Effective Machine Learning
Machine learning predictive modeling algorithms are governed by hyperparameters that have no clear defaults agreeable to a wide range of applications. A few examples of quantities that must be prescribed for these algorithms are the depth of a decision tree, number of trees in a random forest, number of hidden layers and neurons in each layer in a neural network, and degree of regularization to prevent overfitting. Not only do ideal settings for the hyperparameters dictate the performance of the training process, but more importantly they govern the quality of the resulting predictive models. Recent efforts to move from a manual or random adjustment of these parameters have included rough grid search and intelligent numerical optimization strategies. This paper presents an automatic tuning implementation that uses SAS/OR® local search optimization for tuning hyperparameters of modeling algorithms in SAS® Visual Data Mining and Machine Learning. The AUTOTUNE statement in the NNET, TREESPLIT, FOREST, and GRADBOOST procedures defines tunable parameters, default ranges, user overrides, and validation schemes to avoid overfitting. Given the inherent expense of training numerous candidate models, the paper addresses efficient distributed and parallel paradigms for training and tuning in SAS® Viya . It also presents sample tuning results that demonstrate improved model accuracy over default configurations and offers recommendations for efficient and effective model tuning.
Read the paper (PDF)
Patrick Koch, SAS
Brett Wujek, SAS
Oleg Golovidov, SAS
Steven Gardner, SAS
B
Session SAS0761-2017:
Breaking through the Barriers: Innovative Sampling Techniques for Unstructured Data Analysis
As in any analytical process, data sampling is a definitive step for unstructured data analysis. Sampling is of paramount importance if your data is fed from social media reservoirs such as Twitter, Facebook, Amazon, and Reddit, where information proliferation happens minute by minute. Unless you have a sophisticated analytical engine and robust physical servers to handle billions of pieces of data, you can't use all your data for analysis without sampling. So, how do you sample textual data? The standard method is to generate either a simple random sample, or a stratified random sample if a stratification variable exists in the data. Neither of these two methods can reliably produce a representative sample of documents from the population data simply because the process does not encompass a step to ensure that the distribution of terms between the population and sample sets remains similar. This shortcoming can cause the supervised or unsupervised learning to yield inaccurate results. If the generated sample is not representative of the population data, it is difficult to train and validate categories or sub-categories for those rare events during taxonomy development. In this paper, we show you new methods for sampling text data. We rely on a term-by-document matrix and SAS® macros to generate representative samples. Using these methods helps generate sufficient samples to train and validate every category in rule-based modeling approaches using SAS® Contextual Analysis.
Read the paper (PDF)
Murali Pagolu, SAS
Session SAS0474-2017:
Building Bayesian Network Classifiers Using the HPBNET Procedure
A Bayesian network is a directed acyclic graphical model that represents probability relationships and conditional independence structure between random variables. SAS® Enterprise Miner implements a Bayesian network primarily as a classification tool; it includes na ve Bayes, tree-augmented na ve Bayes, Bayesian-network-augmented na ve Bayes, parent-child Bayesian network, and Markov blanket Bayesian network classifiers. The HPBNET procedure uses a score-based approach and a constraint-based approach to model network structures. This paper compares the performance of Bayesian network classifiers to other popular classification methods, such as classification tree, neural network, logistic regression, and support vector machines. The paper also shows some real-world applications of the implemented Bayesian network classifiers and a useful visualization of the results.
Read the paper (PDF)
Ye Liu, SAS
Weihua Shi, SAS
Wendy Czika, SAS
C
Session 1484-2017:
Can Incumbents Take the Digital Curve ?!
Digital transformation and analytics for incumbents isn't a question of choice or strategy. It's a question of business survival. Go analytics!
Liav Geffen, Harel Insurance & Finance
Session SAS1414-2017:
Churn Prevention in the Telecom Services Industry: A Systematic Approach to Prevent B2B Churn Using SAS®
It takes months to find a customer and only seconds to lose one Unknown. Though the Business-to-Business (B2B) churn problem might not be as common as Business-to-Consumer (B2C) churn, it has become crucial for companies to address this effectively as well. Using statistical methods to predict churn is the first step in the process of retaining customers, which also includes model evaluation, prescriptive analytics (including outreach optimization), and performance reporting. Providing visibility into model and treatment performance enables the Data and Ops teams to tune models and adjust treatment strategy. West Corporation's Center for Data Science (CDS) has partnered with one of the lines of businesses in order to measure and prevent B2B customer churn. CDS has coupled firmographic and demographic data with internal CRM and past outreach data to build a Propensity to Churn model using SAS®. CDS has provided the churn model output to an internal Client Success Team (CST), who focuses on high-risk/high-value customers in order to understand and provide resolution to any potential concerns that might be expressed by such customers. Furthermore, CDS automated weekly performance reporting using SAS and Microsoft Excel that not only focuses on model statistics, but also on CST actions and impact. This paper focuses on all of the steps involved in the churn-prevention process, including building and reviewing the model, treatment design and implementation, as well as performance reporting.
Krutharth Peravalli, West Corporation
Dmitriy Khots, West Corporation
Session 1312-2017:
Construction of a Disease Network and a Prediction Model for Dementia
Regarding a human disease network, most studies have estimated the associations of disorders primarily with gene or protein information. Those studies, however, have some difficulties in the data because of the massive volume of data and the huge computational cost. Instead, we constructed a human disease network that can describe the associations between diseases, using the claim data of Korean health insurance. Through several statistical analyses, we show the applicability and suitability of the disease network. Furthermore, we develop a statistical model that can predict a prevalence rate for dementia by using significant associations of the network in a statistical perspective.
Read the paper (PDF)
Jinwoo Cho, Sung Kyun Kwan University
Session 1183-2017:
Creating Personal Game Statistics for Video Game Events: Examples from EVE Online and Ingress
Computer and video games are complex these days. Events in video games are in some cases recorded automatically in text files, creating a history or story of game play. There are countable items in these event records that can be used as data for statistics and other types of modeling. This E-Poster shows you how to statistically analyze text files for video game events using SAS®. Two games are analyzed. EVE Online, a massive multi-user online role-playing spaceship game, is one. The other game is Ingress, a cell phone game that combines exercise with a GPS and real-world environments. In both examples, the techniques involve parsing large amounts of text data to examine recurring patterns in text that describe events in the game play.
View the e-poster or slides (PDF)
Peter Timusk, Statistics Canada
D
Session 1089-2017:
Developing a Predictive Model of Physician Attribution of Patient Satisfaction Surveys
For all healthcare systems, considerable attention and resources are directed at gauging and improving patient satisfaction. Dignity Health has made considerable efforts in improving most areas of patient satisfaction. However, improving metrics around physician interaction with patients has been challenging. Failure to improve these publicly reported scores can result in reimbursement penalties, damage to Dignity's brand and an increased risk of patient harm. One possible way to improve these scores is to better identify the physicians that present the best opportunity for positive change. Currently, the survey tool mandated by the Centers for Medicare and Medicaid Services (CMS), the Hospital Consumer Assessment of Healthcare Providers and Systems (HCAHPS), has three questions centered on patient experience with providers, specifically concerning listening, respect, and clarity of conversation. For purposes of relating patient satisfaction scores to physicians, Dignity Health has assigned scores based on the attending physician at discharge. By conducting a manual record review, it was determined that this method rarely corresponds to the manual review (PPV = 20.7%, 95% CI: 9.9% -38.4%). Using a variety of SAS® tools and predictive modeling programs, we developed a logistic regression model that had better agreement with chart abstractors (PPV = 75.9%, 95% CI: 57.9% - 87.8%). By attributing providers based on this predictive model, opportunities for improvement can be more accurately targeted, resulting in improved patient satisfaction and outcomes while protecting fiscal health.
Read the paper (PDF)
Ken Ferrell, Dignity Health
Session 0801-2017:
Don't Let Your Annual Report Be Such a Manual Report: Combining Text, Graphs, and Tables in One Doc
Learn neat new (and not so new) methods for joining text, graphs, and tables in a single document. This paper shows how you can create a report that includes all three with a single solution: SAS®. The text portion is taken from a Microsoft Word document and joined with output from the GPLOT and REPORT procedures.
Read the paper (PDF)
Ben Cochran, The Bedford Group, Inc.
E
Session 1139-2017:
Enabling Advanced Customer Value Management with SAS®
Join this breakout session hosted by the Customer Value Management team from Saudi Telecommunications Company to understand the journey we took with SAS® to evolve from simple below-the-line campaign communication to advanced customer value management (CVM). Learn how the team leveraged SAS tools ranging from SAS® Enterprise Miner to SAS® Customer Intelligence Suite in order to gain a deeper understanding of customers and move toward targeting customers with the right offer at the right time through the right channel.
Read the paper (PDF)
Noorulain Malik, Saudi Telecom
Session 1057-2017:
Enhancing Customer Experience through Text Analysis of Survey Comments
Customer feedback is a critical aspect of businesses in today's world as it is invaluable in determining what customers like and dislike about the business' service. This loop of regularly listening to customers' voice through survey comments and improving services based on it leads to better business, and more importantly, to an enhancement in customer experience. The challenge is to classify and analyze these unstructured text comments to gain insights and to focus on areas of improvement. The purpose of this paper is to illustrate how text mining in SAS® Enterprise Miner 14.1 helped one of our clients a leading financial services company convert their customers problems into opportunities. The customers' feedback pertaining to their experience with an Interactive Voice Response (IVR) system is collected by an enterprise feedback management (EFM) company. The comments are then split into two groups, which helps us differentiate customer opinions. This grouping is based on customers who have given a rating of 0 6 and a rating of 9 10 on a Likert scale of 0 10 (10 being extremely satisfied) in the survey questionnaire. Text mining is performed on both these groups, and an algorithm creates clusters that are consequentially used to segment customers based on opinions they are interested in voicing. Furthermore, sentiment scores are calculated for each one of the segments. The scores classify the polarity of customer feedback and prioritizes the problems the client needs to focus on.
Read the paper (PDF)
Vinoth Kumar Raja, West Corporation
Sumit Sukhwani, West Corporation
Dmitriy Khots, West Corporation
Session 0788-2017:
Examining Higher Education Performance Metrics with SAS® Enterprise Miner™ and SAS® Visual Analytics
Given the proposed budget cuts to higher education in the state of Kentucky, public universities will likely be awarded financial appropriations based on several performance metrics. The purpose of this project was to conceptualize, design, and implement predictive models that addressed two of the state's metrics: six-year graduation rate and fall-to-fall persistence for freshmen. The Western Kentucky University (WKU) Office of Institutional Research analyzed five years' worth of data on first-time, full-time bachelor's degree seeking students. Two predictive models evaluated and scored current students on their likelihood to stay enrolled and their chances of graduating on time. Following an ensemble of machine-learning assessments, the scored data were imported into SAS® Visual Analytics, where interactive reports allowed users to easily identify which students were at a high risk for attrition or at risk of not graduating on time.
Read the paper (PDF)
Taylor Blaetz, Western Kentucky University
Tuesdi Helbig, Western Kentucky University
Gina Huff, Western Kentucky University
Matt Bogard, Western Kentucky University
Session SAS0587-2017:
Exploring the Art and Science of SAS® Text Analytics: Best Practices in Developing Rule-Based Models
Traditional analytical modeling, with roots in statistical techniques, works best on structured data. Structured data enables you to impose certain standards and formats in which to store the data values. For example, a variable indicating gas mileage in miles per gallon should always be a number (for example, 25). However, with unstructured data analysis, the free-form text no longer limits you to expressing this information in only one way (25 mpg, twenty-five mpg, and 25M/G). The nuances of language, context, and subjectivity of text make it more complex to fit generalized models. Although statistical methods using supervised learning prove efficient and effective in some cases, sometimes you need a different approach. These situations are when rule-based models with Natural Language Processing capabilities can add significant value. In what context would you choose a rule-based modeling versus a statistical approach? How do you assess the tradeoffs of choosing a rule-based modeling approach with higher interpretability versus a statistical model that is black-box in nature? How can we develop rule-based models that optimize model performance without compromising accuracy? How can we design, construct, and maintain a complex rule-based model? What is a data-driven approach to rule writing? What are the common pitfalls to avoid? In this paper, we discuss all these questions based on our experiences working with SAS® Contextual Analysis and SAS® Sentiment Analysis.
Read the paper (PDF)
Murali Pagolu, SAS
Cheyanne Baird, SAS
Christina Engelhardt, SAS
F
Session SAS0388-2017:
Factorization Machines: A New Tool for Sparse Data
Factorization machines are a new type of model that is well suited to very high-cardinality, sparsely observed transactional data. This paper presents the new FACTMAC procedure, which implements factorization machines in SAS® Visual Data Mining and Machine Learning. This powerful and flexible model can be thought of as a low-rank approximation of a matrix or a tensor, and it can be efficiently estimated when most of the elements of that matrix or tensor are unknown. Thanks to a highly parallel stochastic gradient descent optimization solver, PROC FACTMAC can quickly handle data sets that contain tens of millions of rows. The paper includes examples that show you how to use PROC FACTMAC to recommend movies to users based on tens of millions of past ratings, predict whether fine food will be highly rated by connoisseurs, restore heavily damaged high-resolution images, and discover shot styles that best fit individual basketball players. ®
Read the paper (PDF)
Jorge Silva, SAS
Ray Wright, SAS
Session SAS0686-2017:
Fighting Crime in Real Time with SAS® Visual Scenario Designer
Credit card fraud. Loan fraud. Online banking fraud. Money laundering. Terrorism financing. Identity theft. The strains that modern criminals are placing on financial and government institutions demands new approaches to detecting and fighting crime. Traditional methods of analyzing large data sets on a periodic, batch basis are no longer sufficient. SAS® Event Stream Processing provides a framework and run-time architecture for building and deploying analytical models that run continuously on streams of incoming data, which can come from virtually any source: message queues, databases, files, TCP\IP sockets, and so on. SAS® Visual Scenario Designer is a powerful tool for developing, testing, and deploying aggregations, models, and rule sets that run in the SAS® Event Stream Processing Engine. This session explores the technology architecture, data flow, tools, and methodologies that are required to build a solution based on SAS Visual Scenario Designer that enables organizations to fight crime in real time.
Read the paper (PDF)
John Shipway, SAS
Session 1422-2017:
Flags Flying: Avoiding Consistently Penalized Defenses in NFL Fantasy Football
In fantasy football, it is a relatively common strategy to rotate which team's defense a player uses based on some combination of favorable/unfavorable player matchups, recent performance, and projection of expected points. However, there is danger in this strategy because defensive scoring volatility is high, and any team has the possibility of turning in a statistically bad performance in a given week. This paper uses data mining techniques to identify which National Football League (NFL) teams give up high numbers of defensive penalties on a week-to-week basis, and to what degree those high-penalty games correlate with poor team defensive fantasy scores. Examining penalty count and penalty yards allowed totals, we can narrow down which teams are consistently hurt by poor technique and find correlation between games with high penalty totals to their respective fantasy football score. By doing so, we seek to find which teams should be avoided in fantasy football due to their likelihood of poor performance.
Robert Silverman, Franklin & Marshall College
Session 0881-2017:
Fuzzy Matching and Predictive Models for Acquisition of New Customers
The acquisition of new customers is fundamental to the success of every business. Data science methods can greatly improve the effectiveness of acquiring prospective customers and can contribute to the profitability of business operations. In a business-to-business (B2B) setting, a predictive model might target business prospects as individual firms listed by, for example, Dunn & Bradstreet. A typical acquisition model can be defined using a binary response with the values categorizing a firm's customers and non-customers, for which it is then necessary to identify which of the prospects are actually customers of the firm. The methods of fuzzy logic, for example, based on the distance between strings, might help in matching customers' names and addresses with the overall universe of prospects. However, two errors can occur: false positives (when the prospect is incorrectly classified as a firm's customer), and false negatives (when the prospect is incorrectly classified as a non-customer). In the current practice of building acquisition models, these errors are typically ignored. In this presentation, we assess how these errors affect the performance of the predictive model as measured by a lift. In order to improve the model's performance, we suggest using a pre-determined sample of correct matches and to calibrate its predicted probabilities based on actual take-up rates. The presentation is illustrated with real B2B data and includes elements of SAS® code that was used in the research.
Read the paper (PDF)
Daniel Marinescu, Concordia University
G
Session 1530-2017:
Getting Started with Machine Learning
Machine Learning algorithms have been available in SAS software since 1979. This session provides practical examples of machine learning applications. The evolution of machine learning at SAS is illustrated with examples of nearest-neighbor discriminant analysis in SAS/STAT PROC DISCRIM to advanced predictive modeling in SAS Enterprise Miner. Machine learning techniques addressed include memory based reasoning, decision trees, neural networks, and gradient boosting algorithms.
Terry Woodfield, SAS
H
Session 2000-2017:
Hands-On Workshop: Data Mining using SAS® Enterprise Miner™
This workshop provides hands-on experience with using SAS Enterprise Miner. Workshop participants will learn to do the following: open a project; create and explore a data source; build and compare models; and produce and examine score code that can be used for deployment.
Carlos Andre Reis Pinheiro, SAS
Session SAS2008-2017:
Hands-On Workshop: Python and CAS Integration on SAS® Viya™
Jay Laramore, SAS
Session SAS2006-2017:
Hands-On Workshop: SAS® Visual Data Mining and Machine Learning on SAS® Viya™
This workshop provides hands-on experience with SAS Viya Data Mining and Machine Learning through the programming interface to SAS Viya. Workshop participants will learn how to start and stop a CAS session; move data into CAS; prepare data for machine learning; use SAS Studio tasks for supervised learning; and evaluate the results of analyses.
Carlos Andre Reis Pinheiro, SAS
Session SAS2007-2017:
Hands-On Workshop: SAS® Visual Statistics on SAS® Viya™
Andy Ravenna, SAS
Session SAS2004-2017:
Hands-On Workshop: Text Mining using SAS® Text Miner
This workshop provides hands-on experience using SAS® Text Miner. For a collection of documents, workshop participants will learn how to: read and convert documents for use by SAS Text Miner; retrieve information from the collection using query features of the software; identify the dominant themes and concepts in the collection; and classify documents having pre-assigned categories.
Terry Woodfield, SAS
Session 1082-2017:
Hash Objects: When LAGging Behind Just Doesn't Work
When modeling time series data, we often use a LAG of the dependent variable. The LAG function works great for this, until you try to predict out into the future and need the model's predicted value from one record as an independent value for a future record. This paper examines exactly how the LAG function works, and explains why it doesn't in this case. It also explains how to create a hash object that will accomplish a LAG of any value, how to load the initial data, how to add predicted values to the hash object, and how to extract those values when needed as an independent variable for future observations.
Read the paper (PDF)
Andrea Wainwright-Zimmerman, Experis
Session SAS0239-2017:
Help! My New Director of Analytics Wants to Get Rid of SAS®! What Can I Do?
How would you answer this question? Most of us struggle to articulate the value of the tools, techniques, and teams we use when using analytics. How do you help the new director understand the value of SAS® to you, your job, and the company? In this interactive session, you will discover the components that make up total cost of ownership (TCO) as they apply to the analytics lifecycle. What should you consider when you evaluate total cost of ownership and why should you measure it? How can you help your management team understand the value that SAS provides?
Read the paper (PDF)
Melodie Rush, SAS
Session 0340-2017:
How to Use SAS® to Filter Stock for Trade
Investors usually trade stocks or exchange-traded funds (ETFs) based on a methodology, such as a theory, a model, or a specific chart pattern. There are more than 10,000 securities listed on the US stock market. Picking the right one based on a methodology from so many candidates is usually a big challenge. This paper presents the methodology based on the CANSLIM1 theorem and momentum trading (MT) theorem. We often hear of the cup and handle shape (C&H), double bottoms and multiple bottoms (MB), support and resistance lines (SRL), market direction (MD), fundamental analyses (FA), and technical analyses (TA). Those are all covered in CANSLIM theorem. MT is a trading theorem based on stock moving direction or momentum. Both theorems are easy to learn but difficult to apply without an appropriate tool. The brokers' application system usually cannot provide such filtering due to its complexity. For example, for C&H, where is the handle located? For the MB, where is the last bottom you should trade at? Now, the challenging task can be fulfilled through SAS®. This paper presents the methods on how to apply the logic and graphically present them though SAS. All SAS users, especially those who work directly on capital market business, can benefit from reading this document to achieve their investment goals. Much of the programming logic can also be adopted in SAS finance packages for clients.
Read the paper (PDF)
Brian Shen, Merlin Clinical Service LLC
I
Session SAS0645-2017:
Identifying Abnormal Equipment Behavior and Filtering Data near the Edge for IoT Applications
What if you had analytics near the edge for your Internet of Things (IoT) devices that would tell you whether a piece of equipment is operating within its normal range? And what if those same analytics could help you intelligently determine what data you should keep and what data should be filtered at the edge? This session focuses on classifying streaming data near the edge by showcasing a demo that implements a single-class classification model within a gateway device. The model identifies observations that are normal and abnormal to help determine possible machine issues and preventative maintenance opportunities. The classification also helps to provide a method for filtering data at the edge by capturing all abnormal data but taking only a sample of the normal operating data. The model is developed using SAS® Viya and implemented on a gateway device using SAS® Event Stream Processing. By using a single-class classification technique, the demo also illustrates how to avoid issues with binary classification that would require failure observations in order to build an accurate model. Problems that this demo addresses include: identifying potential and future equipment failures in near real time; filtering sensor data near the edge to prevent unnecessary transport and storage of less valuable data; and building a classification model for failure that doesn't require observations relating to failures.
Read the paper (PDF)
Ryan Gillespie, SAS
Robert Moreira, SAS
Session 1349-2017:
Inference from Smart Meter Data Using the Fourier Transform
This presentation demonstrates that applying Fast Fourier Transformation (FFT) on smart meter data can provide enhanced customer segmentation and discovery. The FFT is a mathematical method for transforming a function of time into a function of frequency. It's vastly used in analyzing sound but is also relevant for utilities. Advanced Metering Infrastructure (AMI) refers to the full measurement and collection system that includes meters at the customer site and communication networks between the customer and the utility. With the inception of AMI, utilities experienced an explosion of data that provides vast analytical opportunities to improve reliability, customer satisfaction, and safety. However, the data explosion comes with its own challenges. The first challenge is the volume. Consider that just 20,000 customers with AMI data can reach over 300 GB of data per year. Simply aggregating the data from minutes to hours or even days can skew results and not provide accurate segmentations. The second challenge is the bad data that is being collected. Outliers caused by missing or incorrect reads, outages, or other factors must be addressed. FFT can eliminate this noise. The proposed framework is expected to identify various customer segments that could be used for demand response programs. The framework also has the potential to investigate diversion or fraud or failing meters (revenue protection), which is a big problem for many utilities.
Tom Anderson, SAS
Prasenjit Shil, Ameren
Session SAS0539-2017:
Interactive Modeling in SAS® Visual Analytics
SAS® Visual Analytics has two offerings, SAS® Visual Statistics and SAS® Visual Data Mining and Machine Learning, that provide knowledge workers and data scientists an interactive interface for data partition, data exploration, feature engineering, and rapid modeling. These offerings are powered by the SAS® Viya platform, thus enabling big data and big analytic problems to be solved. This paper focuses on the steps a user would perform during an interactive modeling session.
Read the paper (PDF)
Don Chapman, SAS
Jonathan Wexler, SAS
K
Session 1002-2017:
Know Thyself: Diabetes Trend Analysis
Throughout history, the phrase know thyself has been the aspiration of many. The trend of wearable technologies has certainly provided the opportunity to collect personal data. These technologies enable individuals to know thyself on a more sophisticated level. Specifically, wearable technologies that can track a patient's medical profile in a web-based environment, such as continuous blood glucose monitors, are saving lives. The main goal for diabetics is to replicate the functions of the pancreas in a manner that allows them to live a normal, functioning lifestyle. Many diabetics have access to a visual analytics website to track their blood glucose readings. However, they often are unreadable and overloaded with information. Analyzing these readings from the glucose monitor and insulin pump with SAS®, diabetics can parse their own information into more simplified and readable graphs. This presentation demonstrates the ease in creating these visualizations. Not only is this beneficial for diabetics, but also for the doctors that prescribe the necessary basal and bolus levels of insulin for a patient s insulin pump.
View the e-poster or slides (PDF)
Taylor Larkin, The University of Alabama
Denise McManus, The University of Alabama
L
Session 1430-2017:
Linear Model Regularization
Linear regression, which is widely used, can be improved by the inclusion of the penalizing parameter. This helps reduce variance (at the cost of a slight increase in bias) and improves prediction accuracy and model interpretability. The regularization model is implemented on the sample data set, and recommendations for the practice are included.
View the e-poster or slides (PDF)
Shashank Hebbar, Kennesaw State Universiy
Lili Zhang, Kennesaw State Universiy
Dhiraj Gharana, Kennesaw State Universiy
M
Session 0941-2017:
Maximizing Cross-Sell Opportunities with Predictive Analytics for Financial Institutions
In the increasingly competitive environment for banks and credit unions, every potential advantage should be pursued. One of these advantages is to market additional products to your existing customers rather than to new customers, since your existing customers already know (and hopefully trust) you, and you have so much data on them. But how can this best be done? How can you market the right products to the right customers at the right time? Predictive analytics can do this by forecasting which customers have the highest chance of purchasing a given financial product. This paper provides a step-by-step overview of a relatively simple but comprehensive approach to maximize cross-sell opportunities among your customers. We first prepare the data for a statistical analysis. With some basic predictive analytics techniques, we can then identify those members who have the highest chance of buying a financial product. For each of these members, we can also gain insight into why they would purchase, thus suggesting the best way to market to them. We then make suggestions to improve the model for better accuracy.
Read the paper (PDF)
Nate Derby
Session SAS0434-2017:
Methods of Multinomial Classification Using Support Vector Machines
Many practitioners of machine learning are familiar with support vector machines (SVMs) for solving binary classification problems. Two established methods of using SVMs in multinomial classification are the one-versus-all approach and the one-versus-one approach. This paper describes how to use SAS® software to implement these two methods of multinomial classification, with emphasis on both training the model and scoring new data. A variety of data sets are used to illustrate the pros and cons of each method.
Read the paper (PDF)
Ralph Abbey, SAS
Taiping He, SAS
Tao Wang, SAS
Session 1264-2017:
Mind-Map the Gap: Sentiment Analysis of Public Transport
This paper presents a case study in which social media posts by individuals related to public transport companies in the United Kingdom were collected from social media sites such as Twitter and Facebook and also from forums using SAS® and Python. The posts were then further processed by SAS® Text Miner and SAS® Visual Analytics to retrieve brand names, means of public transport (underground, trains, buses), and any mentioned attributes. Relevant concepts and topics are identified using text mining techniques and visualized using concept maps and word clouds. Later, we aim to identify and categorize sentiments against public transport in the corpus of the posts. Finally, we create an association map/mind-map of the different service dimensions/topics and the brands of public transport, using correspondence analysis.
Read the paper (PDF)
Tamas Bosznay, Amadeus Software Limited
Session 0820-2017:
Model Risk: Learning from Others' Mistakes
Banks can create a competitive advantage in their business by using business intelligence (BI) and by building models. In the credit domain, the best practice is to build risk-sensitive models (Probability of Default, Exposure at Default, Loss Given Default, Unexpected Loss, Concentration Risk, and so on) and implement them in decision-making, credit granting, and credit risk management. There are models and tools on the next level that are built on these models and that are used to help in achieving business targets, setting risk-sensitive pricing, capital planning, optimizing Return on Equity/Risk Adjusted Return on Capital (ROE/RAROC), managing the credit portfolio, setting the level of provisions, and so on. It works remarkably well as long as the models work. However, over time, models deteriorate, and their predictive power can drop dramatically. As a result, heavy reliance on models in decision-making (some decisions are automated following the model's results-without human intervention) can result in a huge error, which might have dramatic consequences for the bank's performance. In my presentation, I share our experience in reducing model risk and establishing corporate governance of models with the following SAS® tools: SAS® Model Monitoring Microservice, SAS® Model Manager, dashboards, and SAS® Visual Analytics.
Read the paper (PDF)
Boaz Galinson, Bank Leumi
Session 1014-2017:
Modeling the Merchandise Return Behavior of Anonymous and Non-Anonymous Online Apparel Retail Shoppers
This paper establishes the conceptualization of the dimension of the shopping cart (or market basket) on apparel retail websites. It analyzes how the cart dimension (describing anonymous shoppers) and the customer dimension (describing non-anonymous shoppers) impact merchandise return behavior. Five data-mining techniques-namely logistic regression, decision tree, neural network, gradient boosting, and support vector machine-are used for predicting the likelihood of merchandise return. The target variable is a dichotomous response variable: return vs not return. The primary input variables are conceptualized as constituents of the cart dimension, derived from engineering merchandise-related variables such as item style, item size, and item color, as well as free-shipping-related thresholds. By further incorporating the constituents of the customer dimension such as tenure, loyalty membership, and purchase histories, the predictive accuracy of the model built using each of the five data-mining techniques was found to improve substantially. This research also highlights the relative importance of the constituents of the cart and customer dimensions governing the likelihood of merchandise return. Recommendations for possible applications and research areas are provided.
Read the paper (PDF)
Sunny Lam, ANN Inc.
O
Session SAS0747-2017:
Open Your Mind: Use Cases for SAS® and Open-Source Analytics
As a data scientist, you need analytical tools and algorithms, whether commercial or open source, and you have some favorites. But how do you decide when to use what? And how can you integrate their use to your maximum advantage? This presentation provides several best practices for deploying both SAS® and open-source analytical tools to increase productivity and efficiency in your enterprise ecosystem. See an example of a marketing analysis using SAS and R algorithms in SAS® Enterprise Miner to develop a predictive model, and then operationalize that model for performance monitoring and in-database scoring. Also learn about using Python and SAS integration for developing predictive models from a Jupyter Notebook environment. Seeing these cases will help you decide how to improve your analytics with similar integration of SAS and open source.
Read the paper (PDF)
Tuba Islam, SAS
Session 1303-2017:
Optimizing the Analytical Data Life Cycle
The analytical data life cycle consists of 4 stages: data exploration, preparation, model development, and model deployment. Traditionally, these stages can consume 80% of the time and resources within your organization. With innovative techniques such as in-database and in-memory processing, managing data and analytics can be streamlined, with an increase in performance, economics, and governance. This session explores how you can optimize the analytical data life cycle with some best practices and tips using SAS® and Teradata.
Tho Nguyen, Teradata
David Hare, SAS
Session 0372-2017:
Outline Outliers: Adding a Business Sense
Outliers, such as unusual, violated, unexpected or rare events, have been focused on intensively by researchers and practitioners, providing their impacts on estimated statistics and developed models. Today, some business disciplines are focusing primarily on outliers such as defaults of credit, operational risks, quality nonconformities, fraud, or even the results of marketing initiatives in highly competitive environments with low response rates of a couple percent or even less. This paper discusses the importance of detecting, isolating, and categorizing business outliers to discover their root causes and to monitor them dynamically. Addressing not only extreme values or multivariable densities detecting outliers, but also addressing distributions, patterns, clusters, combinations of items, and sequences of events will allow for opportunities to be established for business improvement. SAS® Enterprise Miner can be used to perform such detections. Thus, creating special business segments or running specialized outlier oriented data mining processes, such as decision trees, allows for isolation of business important outliers, which are normally masked in traditional statistical techniques. This process combined with 'What-If' scenario generation prepares businesses for future possible surges even when having no current specific type outliers. Furthermore, analyzing some specific outliers may play a role in assessing business stability to corresponding stress tests.
Read the paper (PDF)
Alex Glushkovsky, BMO Financial Group
P
Session 1305-2017:
Predicting the Completeness of Clinical Data through Claims Data Matching Using SAS® Enterprise Miner™
Research using electronic health records (EHR) is emerging, but questions remain about its completeness, due in part to physicians' time to enter data in all fields. This presentation demonstrates the use of SAS® Enterprise Miner to predict completeness of clinical data using claims data as the standard 'source of truth' against which to compare it. A method for assessing and predicting the completeness of clinical data is presented using the tools and techniques from SAS Enterprise Miner. Some of the topics covered include: tips for preparing your sample data set for use in SAS Enterprise Miner; tips for preparing your sample data set for modeling, including effective use of the Input Data, Data Partition, Filter, and Replacement nodes; and building predictive models using Stat Explore, Decision Tree, Regression, and Model Compare nodes.
View the e-poster or slides (PDF)
Catherine Olson, Optum
Thomas Horstman, Optum
Session 0942-2017:
Predictive Accuracy: A Misleading Performance Measure for Highly Imbalanced Data
The most commonly reported model evaluation metric is the accuracy. This metric can be misleading when the data are imbalanced. In such cases, other evaluation metrics should be considered in addition to the accuracy. This study reviews alternative evaluation metrics for assessing the effectiveness of a model in highly imbalanced data. We used credit card clients in Taiwan as a case study. The data set contains 30,000 instances (22.12% risky and 77.88% non-risky) assessing the likeliness of a customer defaulting on a payment. Three different techniques were used during the model building process. The first technique involved down-sampling the majority class in the training subset. The second used the original imbalanced data whereas prior probabilities were set to account for oversampling in the third technique. The same sets of predictive models were then built for each technique after which the evaluation metrics were computed. The results suggest that model evaluation metrics might reveal more about distribution of classes than they do about the actual performance of models when the data are imbalanced. Moreover, some of the predictive models were identified to be very sensitive to imbalance. The final decision in model selection should consider a combination of different measures instead of relying on one measure. To minimize imbalance-biased estimates of performance, we recommend reporting both the obtained metric values and the degree of imbalance in the data.
Read the paper (PDF)
Josephine Akosa, Oklahoma State University
Session 1003-2017:
Property and Casualty Insurance Predictive Analytics in SAS®
Predictive analytics has been evolving in property and casualty insurance for the past two decades. This paper first provides a high-level overview of predictive analytics in each of the following core business operations in the property and casualty (P&C) insurance industry: marketing, underwriting, actuarial pricing, actuarial reserving, and claims. Then, a common P&C insurance predictive modeling technical process in SAS® dealing with large data sets is introduced. The steps of this process include data acquisition, data preparation, variable creation, variable selection, model building (also known as model fitting), model validation, model testing, and so on. Finally, some successful models are introduced. Base SAS®, SAS/STAT® software, SAS® Enterprise Guide®, and SAS® Enterprise Miner are presented as the main tools for this process. This predictive modeling process could be tweaked or directly used in many other industries as the statistical foundations of predictive analytics have large overlaps across P&C insurance, health care, life insurance, banking, pharmaceutical, genetics industries, and so on. This paper is intended for any level of SAS® user or business people from different industries who are interested in learning about general predictive analytics.
Read the paper (PDF)
Mei Najim, Gallagher Bassett
Q
Session 0844-2017:
Quickly Tackle Business Problems with Automated Model Development, Ensuring Accuracy and Performance
This session introduces how Equifax uses SAS® to develop a streamlined model automation process, including development and performance monitoring. The tool increases modeling efficiency and accuracy of the model, reduces error, and generates insights. You can integrate more advanced analytics tools in a later phase. The process can apply in any given business problem in risk and marketing, which helps leaders to make precise and accurate business decisions.
Vickey Chang, Equifax
R
Session 1323-2017:
Real AdaBoost: Boosting for Credit Scorecards and Similarity to WOE Logistic Regression
AdaBoost (or Adaptive Boosting) is a machine learning method that builds a series of decision trees, adapting each tree to predict difficult cases missed by the previous trees and combining all trees into a single model. I discuss the AdaBoost methodology, introduce the extension called Real AdaBoost, which is so similar to stepwise weight of evidence logistic regression (SWOELR) that it might offer a framework with which we can understand the power of the SWOELR approach. I discuss the advantages of Real AdaBoost, including variable interaction and adaptive, stage-wise binning, and demonstrate a SAS® macro that uses Real AdaBoost to generate predictive models.
Read the paper (PDF)
Paul Edwards, ScotiaBank
S
Session 0969-2017:
SAS® Macros for Binning Predictors with a Binary Target
Binary logistic regression models are widely used in CRM (customer relationship management) or credit risk modeling. In these models, it is common to use nominal, ordinal, or discrete (NOD) predictors. NOD predictors typically are binned (reducing the number of their levels) before usage in a logistic model. The primary purpose of binning is to obtain parsimony without greatly reducing the strength of association of the predictor X to the binary target Y. In this paper, two SAS® macros are discussed. The %NOD_BIN macro bins predictors with nominal values (and ordinal and discrete values) by collapsing levels to maximize information value (IV). The %ORDINAL_BIN macro is applied to predictors that are ordered and in which collapsing can occur only for levels that are adjacent in the ordering of X. The %ORDINAL_BIN macro finds all possible binning solutions by complete enumeration. Solutions are ranked by IV, and monotonic solutions are identified.
Read the paper (PDF)
Bruce Lund, Magnify Analytic Solutions
Session 0990-2017:
SAS® Visual Analytics to Inform FDA of Potential Safety Issues for CFSAN-Regulated Products
Web Intelligence is a business objects web-based application used by the FDA for accessing and querying data files, and ultimately creating reports from multiple databases. The system allows querying of different databases using common business terms, and in the case of the FDA's Center for Food Safety and Applied Nutrition (CFSAN), careful review of dietary supplement information. However, in order to create timely and efficient reports for detection of safety signals leading to adverse events, a more efficient system is needed to obtain and visually display the data. Using SAS® Visual Analytics and SAS® Enterprise Guide® can assist with timely extraction of data from multiple databases commonly used by CFSAN and create a more user friendly interface for management's review to help make key decisions in prevention of adverse events for the public.
Read the paper (PDF)
Manuel Kavekos, ORISE
Session SAS0437-2017:
Stacked Ensemble Models for Improved Prediction Accuracy
Ensemble models have become increasingly popular in boosting prediction accuracy over the last several years. Stacked ensemble techniques combine predictions from multiple machine learning algorithms and use these predictions as inputs to a second level-learning algorithm. This paper shows how you can generate a diverse set of models by various methods (such as neural networks, extreme gradient boosting, and matrix factorizations) and then combine them with popular stacking ensemble techniques, including hill-climbing, generalized linear models, gradient boosted decision trees, and neural nets, by using both the SAS® 9.4 and SAS® Visual Data Mining and Machine Learning environments. The paper analyzes the application of these techniques to real-life big data problems and demonstrates how using stacked ensembles produces greater prediction accuracy than individual models and na ve ensembling techniques. In addition to training a large number of models, model stacking requires the proper use of cross validation to avoid overfitting, which makes the process even more computationally expensive. The paper shows how to deal with the computational expense and efficiently manage an ensemble workflow by using parallel computation in a distributed framework.
Read the paper (PDF)
Funda Gunes, SAS
Russ Wolfinger, SAS
Pei-Yi Tan, SAS
T
Session SAS0523-2017:
Temporal Text Mining: A Thematic Exploration of Don Quixote
Temporal text mining (TTM) is the discovery of temporal patterns in documents that are collected over time. It involves discovery of latent themes, construction of a thematic evolution graph, and analysis of thematic patterns. This paper uses text mining and time series analysis techniques to explore Don Quixote de la Mancha, a two-volume master work of Western literature. First, it uses singular value decomposition in SAS® Text Miner to discover 25 key themes that characterize the two volumes. Then it treats the chapters of the two books as time-ordered documents and creates a semiautomated visual summary of the two volumes. It also explores the trajectory of individual themes over the course of the chapters and identifies episodes, recurring themes, and climaxes. Finally, it uses time series clustering in SAS® Enterprise Miner to group chapters that have similar themes and to group themes that have similar trajectories. The TTM methods demonstrated in this paper lend themselves to business applications such as monitoring changes in customer sentiment and summarizing research and legislative trends.
Read the paper (PDF)
Ray Wright, SAS
Session 0811-2017:
Text Mining of Movie Synopsis by SAS® Enterprise Miner™
This project described the method to classify movie genres based on synopses text data by two approaches: term frequency, and inverse document frequency (tf-idf) and C4.5 decision tree. Using the performance comparison of the classifiers by manipulating the different parameters, the strength and improvement of this method in substantial text analysis were also interpreted. As the result, these two approaches are powerful to identify movie genres.
Read the paper (PDF) | View the e-poster or slides (PDF)
Yiyun Zhou, Kennesaw State University
Session SAS1407-2017:
The Benefit of Using Clustering as Input to a Propensity to Buy Predictive Model
Propensity to Buy models comprise one of the most widely used techniques in supporting business strategy for customer segmentation and targeting. Some of the key challenges every data scientist faces in building predictive models are the utilization of all known predictor variables, uncovering any unknown signals, and adjusting for latent variable errors. Often, the business demands inclusion of certain variables based on a previous understanding of process dynamics. To meet such client requirements, these inputs are forced into the model, resulting in either a complex model with too many inputs or a fragile model that might decay faster than expected. West Corporation's Center for Data Science (CDS) has found a work around to strike a balance between meeting client requirements and building a robust model by using clustering techniques. A leading telecom services provider uses West's SMS Outbound Notification Platform to notify their customers about an upcoming Pay-Per-View event. As part of the modeling process, the client has identified a few variables as key business drivers and CDS used those variables to build clusters, which were then used as inputs to the predictive model. In doing so, not only all the effects of the client-mandated variables were captured successfully, but this also helped to reduce the number of inputs to the model, making it parsimonious. This paper illustrates how West has used clustering in the data preparation process and built a robust model.
Krutharth Peravalli, West Corporation
Sumit Sukhwani, West Corporation
Dmitriy Khots, West Corporation
Session 1061-2017:
The Rise of Chef Curry: Studying Advanced Basketball Metrics with Quantile Regression in SAS®
In the 2015-2016 season of the National Basketball Association (NBA), the Golden State Warriors achieved a record-breaking 73 regular-season wins. This accomplishment would not have been possible without their reigning Most Valuable Player (MVP) champion Stephen Curry and his historic shooting performance. Shattering his previous NBA record of 286 three-point shots made during the 2014-2015 regular season, he accrued an astounding 402 in the next season. With an increased emphasis on the advantages of the three-point shot and guard-heavy offenses in the NBA today, organizations are naturally eager to investigate player statistics related to shooting at long ranges, especially for the best of shooters. Furthermore, the addition of more advanced data-collecting entities such as SportVU creates an incredible opportunity for data analysis, moving beyond simply using aggregated box scores. This work uses quantile regression within SAS® 9.4 to explore the relationships between the three-point shot and other relevant advanced statistics, including some SportVU player-tracking data, for the top percentile of three-point shooters from the 2015-2016 NBA regular season.
View the e-poster or slides (PDF)
Taylor Larkin, The University of Alabama
Denise McManus, The University of Alabama
Session 1489-2017:
Timing is Everything: Detecting Important Behavior Triggers
Predictive analytics is powerful in its ability to predict likely future behavior, and is widely used in marketing. However, the old adage timing is everything continues to hold true the right customer and the right offer at the wrong time is less than optimal. In life, timing matters a great deal, but predictive analytics seldom takes timing into account explicitly. We should be able to do better. Financial service consumption changes often have precursor changes in behavior, and a behavior change can lead to multiple subsequent consumption changes. One way to improve our awareness of the customer situation is to detect significant events that have meaningful consequences that warrant proactive outreach. This session presents a simple time series approach to event detection that has proven to work successfully. Real case studies are discussed to illustrate the approach and implementation. Adoption of this practice can augment and enhance predictive analytics practice to elevate our game to the next level.
Read the paper (PDF)
Daymond Ling, Seneca College
Session SAS0763-2017:
Toward End-to-End Automated Machine Learning in SAS® Viya™
Trends in predictive modeling, specifically in machine learning, are moving toward automated approaches where well-known predictive algorithms are applied and the best one is chosen according to some evaluation metric. This approach's efficacy relies on the underlying data preparation in general, and on data preprocessing in particular. Commonly used data preprocessing techniques include missing value imputation, outlier detection and treatment, functional transformation, discretization, and nominal grouping. Excluding toy problems, the composition of the best data preprocessing step depends on the modeling task at hand. This necessitates an iterative generation and evaluation of predictive pipelines, which consist of a mix of data preprocessing techniques and predictive algorithms. This is a combinatorial problem that can be a bottleneck in the analytics workflow. In this paper, we discuss the SAS® Cloud Analytic Services (CAS) actions in SAS® Viya that can be used to effect this end-to-end predictive pipeline in a scalable way, with special emphasis on CAS actions for data exploration, preprocessing, and feature transformation. In addition, we discuss how the whole process can be automated.
Biruk Gebremariam, SAS
U
Session 1509-2017:
Using Analytics to Prevent Fraud Gives HDI Fast and Real-Time Approval for Claims
As part of the Talanx Group, HDI Insurance has been one of the leading insurers in Brazil. Recently HDI Brazil implemented an innovative and integrated solution to prevent fraud in the Auto Claims process based on SAS® Fraud Framework and SAS® Real-time Decision Manager. A car fix or a refund is approved immediately after the claim registration for those customers who have no suspicious information. On the other hand, the high-scored claims are checked by the inspectors using SAS® Social Network Analysis. In terms of analytics, the solution has a hybrid approach working with predictive models, business rules, anomalies, and network relationship. The main benefits are a reduction in the amount of fraud, more accuracy in determining the claims to be investigated, a decrease in the false-positive rate, and the use of a relationship network to investigate suspicious connections.
Read the paper (PDF)
Rayani Melega, HDI SEGUROS
Rayani Melega
Session 1169-2017:
Using Graph Analytics for Predictive Modeling in Life Insurance
This paper discusses a specific example of using graph analytics or social network analysis (SNA) in predictive modeling in the life insurance industry. The methods of social network analysis are applied to agents that share compensation, and the results are used to derive input variables for a model to predict the likelihood of certain behavior by insurance agents. Both SAS® code and SAS® Enterprise Miner are used to illustrate implementing different graph analytical methods. This paper assumes that the reader is familiar with the basic process of creating predictive models using multiple (linear or logistic) regression, and, in some sections, familiarity with SAS Enterprise Miner.
Read the paper (PDF)
Robert Moore, Thrivent Financial
Session 1248-2017:
Using SAS® to Analyze Emergency Department Visits: Medicaid Patients Compared to Other Pay Sources
Access to care for Medicaid beneficiaries is a topic of frequent study and debate. Section 1202 of the Affordable Care Act (ACA) requires states to raise Medicaid primary care payment rates to Medicare levels in 2013 and 2014. The federal government paid 100% of the increase. This program was designed to encourage primary care providers to participate in Medicaid, since this has long been a challenge for Medicaid. Whether this fee increase has increased access to primary care providers is still debated. Using SAS®, we evaluated whether Medicaid patients have a higher incidence of non-urgent visits to local emergency departments (ED) than do patients with other payment sources. The National Hospital Ambulatory Medical Care Survey (NHAMCS) data set, obtained from the Centers for Disease Control (CDC), was selected, since it contains data relating to hospital emergency departments. This emergency room data, for years 2003 2011, was analyzed by diagnosis, expected payment method, reason for the visit, region, and year. To evaluate whether the ED visits were considered urgent or non-urgent, we used the NYU Billings algorithm for classifying ED utilization (NYU Wagner 2015). Three models were used for the analyses: Binary Classification, Multi-Classification, and Regression. In addition to finding no regional differences, decision trees and SAS® Visual Analytics revealed that Medicaid patients do not have a higher rate of non-emergent visits when compared to other payment types.
Read the paper (PDF)
Bradley Casselman, CSA
Taylor Larkin, The University of Alabama
Denise McManus, The University of Alabama
Session 0484-2017:
Using Text Analysis to Improve the Quality of Scoring Models with SAS® Enterprise Miner™
Transformation of raw data into sensible and useful information for prediction purposes is a priceless skill nowadays. Vast amounts of data, easily accessible at each step in a process, gives us a great opportunity to use it for countless applications. Unfortunately, not all of the valuable data is available for processing using classical data mining techniques. What happens if textual data is also used to create the analytical base table (ABT)? The goal of this study is to investigate whether scoring models that also use textual data are significantly better than models that include only quantitative data. This thesis is focused on estimating the probability of default (PD) for the social lending platform kokos.pl. The same methods used in banks are used to evaluate the accuracy of reported PDs. Data used for analysis is gathered directly from the platform via the API. This paper describes in detail the steps of the data mining process that is built using SAS® Enterprise Miner . The results of the study support the thesis that models with a properly conducted text-mining process have better classification quality than models without text variables. Therefore, the use of this data mining approach is recommended when input data includes text variables.
Read the paper (PDF)
Piotr Malaszek, SCS Expert
Session 1065-2017:
Using the OPTGRAPH Procedure: Transformation of Transactional Data into Graph for Cluster Analysis
Graphs are mathematical structures capable of representing networks of objects and their relationships. Clustering is an area in graph theory where objects are split into groups based on their connections. Depending on the application domain, object clusters have various meanings (for example, in market basket analysis, clusters are families of products that are frequently purchased together). This paper provides a SAS® macro featuring PROC OPTGRAPH, which enables the transformation of transactional data, or any data with a many-to-many relationship between two entities, into graph data, allowing for the generation and application of the co-occurrence graph and the probability graph.
Read the paper (PDF)
Linh Le, Kennesaw State University
Jennifer Priestley, Kennesaw State University
W
Session 0785-2017:
Weight of Evidence Coding for the Cumulative Logit Model
Weight of evidence (WOE) coding of a nominal or discrete variable X is widely used when preparing predictors for usage in binary logistic regression models. The concept of WOE is extended to ordinal logistic regression for the case of the cumulative logit model. If the target (dependent) variable has L levels, then L-1 WOE variables are needed to recode X. The appropriate setting for implementing WOE coding is the cumulative logit model with partial proportionate odds. As in the binary case, it is important to bin X to achieve parsimony before the WOE coding. SAS® code to perform this binning is discussed. An example is given that shows the implementation of WOE coding and binning for a cumulative logit model with the target variable having three levels.
Read the paper (PDF)
Bruce Lund, Magnify Analytic Solutions
Session 0883-2017:
What Statisticians Should Know about Machine Learning
In the last few years, machine learning and statistical learning methods have gained increasing popularity among data scientists and analysts. Statisticians have sometimes been reluctant to embrace these methodologies, partly due to a lack of familiarity, and partly due to concerns with interpretability and usability. In fact, statisticians have a lot to gain by using these modern, highly computational tools. For certain types of problems, machine learning methods can be much more accurate predictors than traditional methods for regression and classification, and some of these methods are particularly well suited for the analysis of big and wide data. Many of these methods have origins in statistics or at the boundary of statistics and computer science, and some are already well established in statistical procedures, including LASSO and elastic net for model selection and cross validation for model evaluation. In this talk, I go through some examples illustrating the application of machine learning methods and interpretation of their results, and show how these methods are similar to, and differ from, traditional statistical methodology for the same problems.
Read the paper (PDF)
D. Cutler, Utah State University
Session 1483-2017:
Why Credit Risk Needs Advanced Analytics: A Journey from Base SAS® to SAS® High-Performance Risk
We are at a tipping point for credit risk modeling. To meet the technical and regulatory challenges of IFRS 9 and stress testing, and to strengthen model risk management, CBS aims to create an integrated, end-to-end, tools-based solution across the model lifecycle, with strong governance and controls and an improved scenario testing and forecasting capability. SAS has been chosen as the technology partner to enable CBS to meet these aims. A new predictive analytics platform combining well-known tools such as SAS® Enterprise Miner , SAS® Model Manager, and SAS® Data Management alongside SAS® Model Implementation Platform powered by SAS® High-Performance Risk is being deployed. Driven by technology, CBS has also considered the operating model for credit risk, restructuring resources around the new technology with clear lines of accountability, and has incorporated a dedicated data engineering function within the risk modeling team. CBS is creating a culture of collaboration across credit risk that supports the development of technology-led, innovative solutions that not only meet regulatory and model risk management requirements but that set a platform for the effective use of predictive analytics enterprise-wide.
Chris Arthur-McGuire
Z
Session 0935-2017:
Zeroing In on Effective Member Communication: An Rx Education Study
In 2013, the Centers for Medicare & Medicaid Services (CMS) changed the pharmacy mail-order member-acquisition process so that Humana Pharmacy may only call a member with cost savings greater than $2.00 to educate the member on the potential savings and instruct the member to call back. The Rx Education call center asked for analytics work to help prioritize member outreach, improve conversions, and decrease the number of members who are unable to be contacted. After a year of contacting members using this additional insight, the conversions after agreement rate rose from 71.5% to 77.5% and the unable to contact rate fell from 30.7% to 17.4%. This case study takes you on an analytics journey from the initial problem diagnosis and analytics solution, followed by refinements, as well as test and learn campaigns.
Read the paper (PDF)
Brian Mitchell, Humana Inc.
back to top