Analyzing big data and visualizing trends in social media is a challenge that many companies face as large sources of publicly available data become accessible. While the sheer size of usable data can be staggering, knowing how to find trends in unstructured textual data is just as important an issue. At a big data conference, data scientists from several companies were invited to participate in tackling this challenge by identifying trends in cancer using unstructured data from Twitter users and presenting their results. This paper explains how our approach using SAS® analytical methods were superior to other big data approaches in investigating these trends.
Scott Koval, Pinnacle Solutions, Inc
Yijie Li, Pinnacle Solutions, Inc
Mia Lyst, Pinnacle Solutions, Inc
With the increasing amount of educational data, educational data mining has become more and more important for uncovering the hidden patterns within institutional data so as to support institutional decision making (Luan 2012). However, only very limited studies have been done on educational data mining for institutional decision support. At the University of Connecticut (UCONN), organic chemistry is a required course for undergraduate students in a STEM discipline. It has a very high DFW rate (D=Drop, F=Failure, W=Withdraw). Take Fall 2014 as an example: the average DFW% for the Organic Chemistry lectures was 24% at UCONN, and there were over 1200 students enrolled in this class. In this study, undergraduate students enrolled during School Year 2010 2011 were used to build up the model. The purpose of this study was to predict student success in the future so as to improve the education quality in our institution. The Sample, Explore, Modify, Model, and Assess (SEMMA) method introduced by SAS was applied to develop the predictive model. The freshmen SAT scores, campus, semester GPA, financial aid, and other factors were used to predict students' performance in this course. In the predictive modeling process, several modeling techniques (decision tree, neural network, ensemble models, and logistic regression) were compared with each other in order to find an optimal one for our institution.
Youyou Zheng, University of Connecticut
Thanuja Sakruti, University of Connecticut
As the open-source community has been taking the technology world by storm, especially in the big data space, large corporations such as SAS, IBM, and Oracle have been working to embrace this new, quickly evolving ecosystem to continue to foster innovation and to remain competitive. For example, SAS, IBM, and others have aligned with the Open Data Platform initiative and are continuing to build out Hadoop and Spark solutions. And, Oracle has partnered with Cloudera to create the Big Data Appliance. This movement challenges companies that are consuming these products to select the right products and support partners. The hybrid approach using all tools available seems to be the methodology chosen by most successful companies. West Corporation, an Omaha-based provider of technology-enabled communication solutions, is no exception. West has been working with SAS for 10 years in the ETL, BI, and advanced analytics space, and West began its Hadoop journey a year ago. This paper focuses on how West data teams use both technologies to improve customer experience in the interactive voice response (IVR) system by storing massive semi-structure call logs in HDFS and by building models that predict a caller s intent to route the caller more efficiently and to reduce customer effort using familiar SAS code and the very user friendly SAS® Enterprise Miner .
Sumit Sukhwani, West Corporation
Krutharth Peravalli, West Corporation
Amit Gautam, West Corporation
UNIX and Linux SAS® administrators, have you ever been greeted by one of these statements as you walk into the office before you have gotten your first cup of coffee? Power outage! SAS servers are down. I cannot access my reports. Have you frantically tried to restart the SAS servers to avoid loss of productivity and missed one of the steps in the process, causing further delays while other work continues to pile up? If you have had this experience, you understand the benefit to be gained from a utility that automates the management of these multi-tiered deployments. Until recently, there was no method for automatically starting and stopping multi-tiered services in an orchestrated fashion. Instead, you had to use time-consuming manual procedures to manage SAS services. These procedures were also prone to human error, which could result in corrupted services and additional time lost, debugging and resolving issues injected by this process. To address this challenge, SAS Technical Support created the SAS Local Services Management (SAS_lsm) utility, which provides automated, orderly management of your SAS® multi-tiered deployments. The intent of this paper is to demonstrate the deployment and usage of the SAS_lsm utility. Now, go grab a coffee, and let's see how SAS_lsm can make life less chaotic.
Clifford Meyers, SAS
Manufacturers of any product from toys to medicine to automobiles must create items that are, above all else, safe to use. Not only is this essential to long-term brand value and corporate success, but it's also required by law. Although perfection is the goal, defects are bound to occur, especially in advanced products such as automobiles. Automobiles are the largest purchase most people make, next to a house. When something that costs tens of thousands of dollars runs into problems, you tend to remember. Recalls in part reflect growing pains after decades of consolidation in the auto industry. Many believe that recalls are the culmination of years of neglect by manufacturers and the agencies that regulate them. For several reasons, automakers are acting earlier and more often in issuing recalls. In the past 20 years, the number of voluntarily recalled vehicles has steadily grown. The automotive-recall landscape changed dramatically in 2000 with the passage of the federal TREAD Act. Before that, federal law required that automakers issue a recall only when a consumer reported a problem. TREAD requires that companies identify potential problems and promptly notify the NHTSA. This is largely due to stricter laws, heavier fines, and more cautious car makers. This study helps automobile manufacturers understand customers who are talking about defects in their cars and to be proactive in recalling the product at the right time before the Government acts.
Prathap Maniyur, Fractal Analytics
Mansi Bhat, Deloitte
prashanth Nayak, Worldlink
Sovereign risk rating and country risk rating are conceptually distinct in that the former captures the risk of a country defaulting on its commercial debt obligations using economic variables while the latter covers the downside of a country's business environment including political and social variables alongside economic variables. Through this paper we would like to understand the differences between these risk approaches in assessing a country's credit worthiness by statistically examining the predictive power of political and social variables in determining country risk. To do this, we wish to build two models, first model with economic variables as regressors (sovereign risk model) and the second model with economic, political and social variables as regressors (country risk model) to compare the predictive power of regressors and model performance metrics between both the models. This will be an OLS regression model with country risk rating obtained from S&P as the target variable. With a general assumption that economic variables are driven by political processes and social factors, we would like to see if the second model has better predictive power. The economic, political and social indicators data that will be used as independent variables in the model will be obtained from world bank open data and target variable (country risk rating) will be obtained from S&P country risk ratings data.
Bhuvaneswari Yallabandi, Oklahoma State University
Vishwanath Srivatsa Kolar Bhaskara, Oklahoma State University
A Bayesian network is a directed acyclic graphical model that represents probability relationships and conditional independence structure between random variables. SAS® Enterprise Miner implements a Bayesian network primarily as a classification tool; it includes na ve Bayes, tree-augmented na ve Bayes, Bayesian-network-augmented na ve Bayes, parent-child Bayesian network, and Markov blanket Bayesian network classifiers. The HPBNET procedure uses a score-based approach and a constraint-based approach to model network structures. This paper compares the performance of Bayesian network classifiers to other popular classification methods, such as classification tree, neural network, logistic regression, and support vector machines. The paper also shows some real-world applications of the implemented Bayesian network classifiers and a useful visualization of the results.
Ye Liu, SAS
Weihua Shi, SAS
Wendy Czika, SAS
Rapid advances in technology have empowered musicians all across the globe to share their music easily, resulting in intensified competition in the music industry. For this reason, musicians and record labels need to be aware of factors that can influence the popularity of their songs. The focus of our study is to determine how themes, topics, and terms within song lyrics have changed over time and how these changes might have influenced the popularity of songs. Moreover, we plan to run time series analysis on the numeric attributes of Billboard Top 100 songs in order to determine the appropriate combination of relevant attributes that influences a song's popularity. The findings of our study can potentially benefit musicians and record labels in understanding the necessary lyrical construction, overall themes, and topics that might enable a song to reach the highest chart position on the Billboard Top 100. The Billboard Top 100 is an optimal source of data, as it is an objective measure of popularity. Our data has been collected from open sources. Our data set consists of all 334,784 Billboard Top 100 observations for the years 1955-2015, with metadata covering all 26,869 unique songs that have appeared on the chart for that period. Our expanding lyric data set currently contains 18,002 of those songs, which were used to conduct our analysis. SAS® Enterprise Miner and SAS® Sentiment Analysis Studio were the primary tools of our analysis.
Jayant Sharma, Oklahoma State University
John Harden, Sandia National Laboratories
Session SAS1414-2017:
Churn Prevention in the Telecom Services Industry: A Systematic Approach to Prevent B2B Churn Using SAS®
It takes months to find a customer and only seconds to lose one Unknown. Though the Business-to-Business (B2B) churn problem might not be as common as Business-to-Consumer (B2C) churn, it has become crucial for companies to address this effectively as well. Using statistical methods to predict churn is the first step in the process of retaining customers, which also includes model evaluation, prescriptive analytics (including outreach optimization), and performance reporting. Providing visibility into model and treatment performance enables the Data and Ops teams to tune models and adjust treatment strategy. West Corporation's Center for Data Science (CDS) has partnered with one of the lines of businesses in order to measure and prevent B2B customer churn. CDS has coupled firmographic and demographic data with internal CRM and past outreach data to build a Propensity to Churn model using SAS®. CDS has provided the churn model output to an internal Client Success Team (CST), who focuses on high-risk/high-value customers in order to understand and provide resolution to any potential concerns that might be expressed by such customers. Furthermore, CDS automated weekly performance reporting using SAS and Microsoft Excel that not only focuses on model statistics, but also on CST actions and impact. This paper focuses on all of the steps involved in the churn-prevention process, including building and reviewing the model, treatment design and implementation, as well as performance reporting.
Krutharth Peravalli, West Corporation
Dmitriy Khots, West Corporation
Regarding a human disease network, most studies have estimated the associations of disorders primarily with gene or protein information. Those studies, however, have some difficulties in the data because of the massive volume of data and the huge computational cost. Instead, we constructed a human disease network that can describe the associations between diseases, using the claim data of Korean health insurance. Through several statistical analyses, we show the applicability and suitability of the disease network. Furthermore, we develop a statistical model that can predict a prevalence rate for dementia by using significant associations of the network in a statistical perspective.
Jinwoo Cho, Sung Kyun Kwan University
Applying solutions for recommending products to final customers in e-commerce is already a known practice. Crossing consumer profile information with their behavior tends to generate results that are more than satisfactory for the business. Natura's challenge was to create the same type of solution for their sales representatives in the platform used for ordering. The sales representatives are not buying for their own consumption, but rather are ordering according to the demands of their customers. That is the difference, because in this case the analysts does not have information about the behavior or preferences of the final client. By creating a basket product concept for their sales representatives, Natura developed a new solution. Natura developed an algorithm using association analysis (Market Basket) and implemented this directly in the sales platform using SAS® Real-Time Decision Manager. Measuring the results in indications conversion (products added in the requests), the amount brought in by the new solution was 53% higher than indications that used random suggestions, and 38% higher than those that used business rules.
Francisco Pigato, Natura
Customer feedback is a critical aspect of businesses in today's world as it is invaluable in determining what customers like and dislike about the business' service. This loop of regularly listening to customers' voice through survey comments and improving services based on it leads to better business, and more importantly, to an enhancement in customer experience. The challenge is to classify and analyze these unstructured text comments to gain insights and to focus on areas of improvement. The purpose of this paper is to illustrate how text mining in SAS® Enterprise Miner 14.1 helped one of our clients a leading financial services company convert their customers problems into opportunities. The customers' feedback pertaining to their experience with an Interactive Voice Response (IVR) system is collected by an enterprise feedback management (EFM) company. The comments are then split into two groups, which helps us differentiate customer opinions. This grouping is based on customers who have given a rating of 0 6 and a rating of 9 10 on a Likert scale of 0 10 (10 being extremely satisfied) in the survey questionnaire. Text mining is performed on both these groups, and an algorithm creates clusters that are consequentially used to segment customers based on opinions they are interested in voicing. Furthermore, sentiment scores are calculated for each one of the segments. The scores classify the polarity of customer feedback and prioritizes the problems the client needs to focus on.
Vinoth Kumar Raja, West Corporation
Sumit Sukhwani, West Corporation
Dmitriy Khots, West Corporation
Session 1068-2017:
Establishing an Agile, Self-Service Environment to Empower Agile Analytic Capabilities
Creating an environment that enables and empowers self-service and agile analytic capabilities requires a tremendous amount of working together and extensive agreements between IT and the business. Business and IT users are struggling to know what version of the data is valid, where they should get the data from, and how to combine and aggregate all the data sources to apply analytics and deliver results in a timely manner. All the while, IT is struggling to supply the business with more and more data that is becoming available through many different data sources such as the Internet, sensors, the Internet of Things, and others. In addition, once they start trying to join and aggregate all the different types of data, the manual coding can be very complicated and tedious, can demand extraneous resources and processing, and can negatively impact the overhead on the system. If IT enables agile analytics in a data lab, it can alleviate many of these issues, increase productivity, and deliver an effective self-service environment for all users. This self-service environment using SAS® analytics in Teradata has decreased the time required to prepare the data and develop the statistical data model, and delivered faster results in minutes compared to days or even weeks. This session discusses how you can enable agile analytics in a data lab, leverage SAS analytics in Teradata to increase performance, and learn how hundreds of organizations have adopted this concept to deliver self-service capabilities in a streamlined process.
Bob Matsey, Teradata
David Hare, SAS
Given the proposed budget cuts to higher education in the state of Kentucky, public universities will likely be awarded financial appropriations based on several performance metrics. The purpose of this project was to conceptualize, design, and implement predictive models that addressed two of the state's metrics: six-year graduation rate and fall-to-fall persistence for freshmen. The Western Kentucky University (WKU) Office of Institutional Research analyzed five years' worth of data on first-time, full-time bachelor's degree seeking students. Two predictive models evaluated and scored current students on their likelihood to stay enrolled and their chances of graduating on time. Following an ensemble of machine-learning assessments, the scored data were imported into SAS® Visual Analytics, where interactive reports allowed users to easily identify which students were at a high risk for attrition or at risk of not graduating on time.
Taylor Blaetz, Western Kentucky University
Tuesdi Helbig, Western Kentucky University
Gina Huff, Western Kentucky University
Matt Bogard, Western Kentucky University
Session 2000-2017:
Hands-On Workshop: Data Mining using SAS® Enterprise Miner™
This workshop provides hands-on experience with using SAS Enterprise Miner. Workshop participants will learn to do the following: open a project; create and explore a data source; build and compare models; and produce and examine score code that can be used for deployment.
Carlos Andre Reis Pinheiro, SAS
How would you answer this question? Most of us struggle to articulate the value of the tools, techniques, and teams we use when using analytics. How do you help the new director understand the value of SAS® to you, your job, and the company? In this interactive session, you will discover the components that make up total cost of ownership (TCO) as they apply to the analytics lifecycle. What should you consider when you evaluate total cost of ownership and why should you measure it? How can you help your management team understand the value that SAS provides?
Melodie Rush, SAS
Session 1349-2017:
Inference from Smart Meter Data Using the Fourier Transform
This presentation demonstrates that applying Fast Fourier Transformation (FFT) on smart meter data can provide enhanced customer segmentation and discovery. The FFT is a mathematical method for transforming a function of time into a function of frequency. It's vastly used in analyzing sound but is also relevant for utilities. Advanced Metering Infrastructure (AMI) refers to the full measurement and collection system that includes meters at the customer site and communication networks between the customer and the utility. With the inception of AMI, utilities experienced an explosion of data that provides vast analytical opportunities to improve reliability, customer satisfaction, and safety. However, the data explosion comes with its own challenges. The first challenge is the volume. Consider that just 20,000 customers with AMI data can reach over 300 GB of data per year. Simply aggregating the data from minutes to hours or even days can skew results and not provide accurate segmentations. The second challenge is the bad data that is being collected. Outliers caused by missing or incorrect reads, outages, or other factors must be addressed. FFT can eliminate this noise. The proposed framework is expected to identify various customer segments that could be used for demand response programs. The framework also has the potential to investigate diversion or fraud or failing meters (revenue protection), which is a big problem for many utilities.
Tom Anderson, SAS
Prasenjit Shil, Ameren
Many practitioners of machine learning are familiar with support vector machines (SVMs) for solving binary classification problems. Two established methods of using SVMs in multinomial classification are the one-versus-all approach and the one-versus-one approach. This paper describes how to use SAS® software to implement these two methods of multinomial classification, with emphasis on both training the model and scoring new data. A variety of data sets are used to illustrate the pros and cons of each method.
Ralph Abbey, SAS
Taiping He, SAS
Tao Wang, SAS
This paper establishes the conceptualization of the dimension of the shopping cart (or market basket) on apparel retail websites. It analyzes how the cart dimension (describing anonymous shoppers) and the customer dimension (describing non-anonymous shoppers) impact merchandise return behavior. Five data-mining techniques-namely logistic regression, decision tree, neural network, gradient boosting, and support vector machine-are used for predicting the likelihood of merchandise return. The target variable is a dichotomous response variable: return vs not return. The primary input variables are conceptualized as constituents of the cart dimension, derived from engineering merchandise-related variables such as item style, item size, and item color, as well as free-shipping-related thresholds. By further incorporating the constituents of the customer dimension such as tenure, loyalty membership, and purchase histories, the predictive accuracy of the model built using each of the five data-mining techniques was found to improve substantially. This research also highlights the relative importance of the constituents of the cart and customer dimensions governing the likelihood of merchandise return. Recommendations for possible applications and research areas are provided.
Sunny Lam, ANN Inc.
Many communication channels exist for customers to engage with businesses, yet an interactive voice response (IVR) system remains the most critical of them. The reason is is because IVR acts as the front end to consumer interaction and is the most effective method for customers to do business with companies in order to resolve their issues before talking to an agent. If the IVR interface is not designed properly, customers can be stuck in an endless loop of pressing buttons that can lead to consumer annoyance. The bottom line is: An IVR system should be set up to quickly resolve as many routine inbound inquires as possible and to allow customers to speak to an agent when necessary. In order to accomplish this, the IVR interface has to be optimized so that it is fully effective and provides a great customer experience. This paper demonstrates how SAS® tools helped optimize the IVR system of a book publishing company. The data set used in this study was obtained from a telecom services company and contained IVR logs of more than 300,000 calls with 1.4 million observations. To gain insights into customer behaviors, path analysis was performed on this data using SAS® Enterprise Miner and obstacles faced by customers were identified. This helped in determining underperforming prompts, and analysis using SAS procedures was conducted on such prompts. Prompts tuning was recommended and new self-service areas were identified that avoid transfers and can save clients thousands of dollars in investments in call centers.
Padmashri Janarthanam, University of Nebraska Omaha
Vinoth Kumar Raja, West Corporation
Outliers, such as unusual, violated, unexpected or rare events, have been focused on intensively by researchers and practitioners, providing their impacts on estimated statistics and developed models. Today, some business disciplines are focusing primarily on outliers such as defaults of credit, operational risks, quality nonconformities, fraud, or even the results of marketing initiatives in highly competitive environments with low response rates of a couple percent or even less. This paper discusses the importance of detecting, isolating, and categorizing business outliers to discover their root causes and to monitor them dynamically. Addressing not only extreme values or multivariable densities detecting outliers, but also addressing distributions, patterns, clusters, combinations of items, and sequences of events will allow for opportunities to be established for business improvement. SAS® Enterprise Miner can be used to perform such detections. Thus, creating special business segments or running specialized outlier oriented data mining processes, such as decision trees, allows for isolation of business important outliers, which are normally masked in traditional statistical techniques. This process combined with 'What-If' scenario generation prepares businesses for future possible surges even when having no current specific type outliers. Furthermore, analyzing some specific outliers may play a role in assessing business stability to corresponding stress tests.
Alex Glushkovsky, BMO Financial Group
Research using electronic health records (EHR) is emerging, but questions remain about its completeness, due in part to physicians' time to enter data in all fields. This presentation demonstrates the use of SAS® Enterprise Miner to predict completeness of clinical data using claims data as the standard 'source of truth' against which to compare it. A method for assessing and predicting the completeness of clinical data is presented using the tools and techniques from SAS Enterprise Miner. Some of the topics covered include: tips for preparing your sample data set for use in SAS Enterprise Miner; tips for preparing your sample data set for modeling, including effective use of the Input Data, Data Partition, Filter, and Replacement nodes; and building predictive models using Stat Explore, Decision Tree, Regression, and Model Compare nodes.
Catherine Olson, Optum
Thomas Horstman, Optum
The most commonly reported model evaluation metric is the accuracy. This metric can be misleading when the data are imbalanced. In such cases, other evaluation metrics should be considered in addition to the accuracy. This study reviews alternative evaluation metrics for assessing the effectiveness of a model in highly imbalanced data. We used credit card clients in Taiwan as a case study. The data set contains 30,000 instances (22.12% risky and 77.88% non-risky) assessing the likeliness of a customer defaulting on a payment. Three different techniques were used during the model building process. The first technique involved down-sampling the majority class in the training subset. The second used the original imbalanced data whereas prior probabilities were set to account for oversampling in the third technique. The same sets of predictive models were then built for each technique after which the evaluation metrics were computed. The results suggest that model evaluation metrics might reveal more about distribution of classes than they do about the actual performance of models when the data are imbalanced. Moreover, some of the predictive models were identified to be very sensitive to imbalance. The final decision in model selection should consider a combination of different measures instead of relying on one measure. To minimize imbalance-biased estimates of performance, we recommend reporting both the obtained metric values and the degree of imbalance in the data.
Josephine Akosa, Oklahoma State University
Predictive analytics has been evolving in property and casualty insurance for the past two decades. This paper first provides a high-level overview of predictive analytics in each of the following core business operations in the property and casualty (P&C) insurance industry: marketing, underwriting, actuarial pricing, actuarial reserving, and claims. Then, a common P&C insurance predictive modeling technical process in SAS® dealing with large data sets is introduced. The steps of this process include data acquisition, data preparation, variable creation, variable selection, model building (also known as model fitting), model validation, model testing, and so on. Finally, some successful models are introduced. Base SAS®, SAS/STAT® software, SAS® Enterprise Guide®, and SAS® Enterprise Miner are presented as the main tools for this process. This predictive modeling process could be tweaked or directly used in many other industries as the statistical foundations of predictive analytics have large overlaps across P&C insurance, health care, life insurance, banking, pharmaceutical, genetics industries, and so on. This paper is intended for any level of SAS® user or business people from different industries who are interested in learning about general predictive analytics.
Mei Najim, Gallagher Bassett
A random forest is an ensemble of decision trees that often produce more accurate results than a single decision tree. The predictions of the individual trees in the forest are averaged to produce a final prediction. The question now arises whether a better or more accurate final prediction cannot be obtained by a more intelligent use of the trees in the forest. In particular, in the way random forests are currently defined, every tree contributes the same fraction to the final result (for example, if there are 50 trees, each tree contributes 1/50th to the final result). This ignores model uncertainty as less accurate trees are treated exactly like more accurate trees. Replacing averaging with Bayesian Model Averaging will give better trees the opportunity to contribute more to the final result, which might lead to more accurate predictions. However, there are several complications to this approach that have to be resolved, such as the computation of an SBC value for a decision tree. Two novel approaches to solving this problem are presented and the results compared to that obtained with the standard random forest approach.
Tiny Du Toit, North-West University
Andre De Waal, SAS
AdaBoost (or Adaptive Boosting) is a machine learning method that builds a series of decision trees, adapting each tree to predict difficult cases missed by the previous trees and combining all trees into a single model. I discuss the AdaBoost methodology, introduce the extension called Real AdaBoost, which is so similar to stepwise weight of evidence logistic regression (SWOELR) that it might offer a framework with which we can understand the power of the SWOELR approach. I discuss the advantages of Real AdaBoost, including variable interaction and adaptive, stage-wise binning, and demonstrate a SAS® macro that uses Real AdaBoost to generate predictive models.
Paul Edwards, ScotiaBank
This project described the method to classify movie genres based on synopses text data by two approaches: term frequency, and inverse document frequency (tf-idf) and C4.5 decision tree. Using the performance comparison of the classifiers by manipulating the different parameters, the strength and improvement of this method in substantial text analysis were also interpreted. As the result, these two approaches are powerful to identify movie genres.
Yiyun Zhou, Kennesaw State University
Session SAS1407-2017:
The Benefit of Using Clustering as Input to a Propensity to Buy Predictive Model
Propensity to Buy models comprise one of the most widely used techniques in supporting business strategy for customer segmentation and targeting. Some of the key challenges every data scientist faces in building predictive models are the utilization of all known predictor variables, uncovering any unknown signals, and adjusting for latent variable errors. Often, the business demands inclusion of certain variables based on a previous understanding of process dynamics. To meet such client requirements, these inputs are forced into the model, resulting in either a complex model with too many inputs or a fragile model that might decay faster than expected. West Corporation's Center for Data Science (CDS) has found a work around to strike a balance between meeting client requirements and building a robust model by using clustering techniques. A leading telecom services provider uses West's SMS Outbound Notification Platform to notify their customers about an upcoming Pay-Per-View event. As part of the modeling process, the client has identified a few variables as key business drivers and CDS used those variables to build clusters, which were then used as inputs to the predictive model. In doing so, not only all the effects of the client-mandated variables were captured successfully, but this also helped to reduce the number of inputs to the model, making it parsimonious. This paper illustrates how West has used clustering in the data preparation process and built a robust model.
Krutharth Peravalli, West Corporation
Sumit Sukhwani, West Corporation
Dmitriy Khots, West Corporation
This paper discusses a specific example of using graph analytics or social network analysis (SNA) in predictive modeling in the life insurance industry. The methods of social network analysis are applied to agents that share compensation, and the results are used to derive input variables for a model to predict the likelihood of certain behavior by insurance agents. Both SAS® code and SAS® Enterprise Miner are used to illustrate implementing different graph analytical methods. This paper assumes that the reader is familiar with the basic process of creating predictive models using multiple (linear or logistic) regression, and, in some sections, familiarity with SAS Enterprise Miner.
Robert Moore, Thrivent Financial
Access to care for Medicaid beneficiaries is a topic of frequent study and debate. Section 1202 of the Affordable Care Act (ACA) requires states to raise Medicaid primary care payment rates to Medicare levels in 2013 and 2014. The federal government paid 100% of the increase. This program was designed to encourage primary care providers to participate in Medicaid, since this has long been a challenge for Medicaid. Whether this fee increase has increased access to primary care providers is still debated. Using SAS®, we evaluated whether Medicaid patients have a higher incidence of non-urgent visits to local emergency departments (ED) than do patients with other payment sources. The National Hospital Ambulatory Medical Care Survey (NHAMCS) data set, obtained from the Centers for Disease Control (CDC), was selected, since it contains data relating to hospital emergency departments. This emergency room data, for years 2003 2011, was analyzed by diagnosis, expected payment method, reason for the visit, region, and year. To evaluate whether the ED visits were considered urgent or non-urgent, we used the NYU Billings algorithm for classifying ED utilization (NYU Wagner 2015). Three models were used for the analyses: Binary Classification, Multi-Classification, and Regression. In addition to finding no regional differences, decision trees and SAS® Visual Analytics revealed that Medicaid patients do not have a higher rate of non-emergent visits when compared to other payment types.
Bradley Casselman, CSA
Taylor Larkin, The University of Alabama
Denise McManus, The University of Alabama
Transformation of raw data into sensible and useful information for prediction purposes is a priceless skill nowadays. Vast amounts of data, easily accessible at each step in a process, gives us a great opportunity to use it for countless applications. Unfortunately, not all of the valuable data is available for processing using classical data mining techniques. What happens if textual data is also used to create the analytical base table (ABT)? The goal of this study is to investigate whether scoring models that also use textual data are significantly better than models that include only quantitative data. This thesis is focused on estimating the probability of default (PD) for the social lending platform kokos.pl. The same methods used in banks are used to evaluate the accuracy of reported PDs. Data used for analysis is gathered directly from the platform via the API. This paper describes in detail the steps of the data mining process that is built using SAS® Enterprise Miner . The results of the study support the thesis that models with a properly conducted text-mining process have better classification quality than models without text variables. Therefore, the use of this data mining approach is recommended when input data includes text variables.
Piotr Malaszek, SCS Expert
In industrial systems, vibration signals are the most important measurements for indicating asset health. Based on these measurements, an engineer with expert knowledge about the assets, industrial process, and vibration monitoring can perform spectral analysis to identify failure modes. However, this is still a manual process that heavily depends on the experience and knowledge of the engineer analyzing the vibration data. Moreover, when measurements are performed continuously, it becomes impossible to act in real time on this data. The objective of this paper is to examine using analytics to perform vibration spectral analysis in real time to predict asset failures. The first step in this approach is to translate engineering knowledge and features into analytic features in order to perform predictive modeling. This process involves converting the time signal into the frequency domain by applying a fast Fourier transform (FFT). Based on the specific design characteristics of the asset, it is possible to derive the relevant features of the vibration signal to predict asset failures. This approach is illustrated using a bearing data set available from the Prognostics Data Repository of the National Aeronautics and Space Administration (NASA). Modeling is done using R and is integrated within SAS® Asset Performance Analytics. In essence, this approach helps the engineers to make better data-driven decisions. The approach described in this paper shows the strength of combining ex
Adriaan Van Horenbeek, SAS
Increasingly, customers are using social media and other Internet-based applications such as review sites and discussion boards to voice their opinions and express their sentiments about brands. Such spontaneous and unsolicited customer feedback can provide brand managers with valuable insights about competing brands. There is a general consensus that listening to and reacting to the voice of the customer is a vital component of brand management. However, the unstructured, qualitative, and textual nature of customer data that is obtained from customers poses significant challenges for data scientists and business analysts. In this paper, we propose a methodology that can help brand managers visualize the competitive structure of a market based on an analysis of customer perceptions and sentiments that are obtained from blogs, discussion boards, review sites, and other similar sources. The brand map is designed to graphically represent the association of product features with brands, thus helping brand managers assess a brand's true strengths and weaknesses based on the voice of customers. Our multi-stage methodology uses the principles of topic modeling and sentiment analysis in text mining. The results of text mining are analyzed using correspondence analysis to graphically represent the differentiating attributes of each brand. We empirically demonstrate the utility of our methodology by using data collected from Edmunds.com, a popular review site for car buyers.
praveen kumar kotekal, Oklahoma state university
Amit K Ghosh, Cleveland State University
Goutam Chakraborty, Oklahoma State University