SAS® and SAS® Enterprise Miner™ have provided advanced data mining and machine learning capabilities for years beginning long before the current buzz. Moreover, SAS has continually incorporated advances in machine learning research into its classification, prediction, and segmentation procedures. SAS Enterprise Miner now includes many proven machine learning algorithms in its high-performance environment and is introducing new leading-edge scalable technologies. This paper provides an overview of machine learning and presents several supervised and unsupervised machine learning examples that use SAS Enterprise Miner. So, come back to the future to see machine learning in action with SAS!
Patrick Hall, SAS
Jared Dean, SAS
Ilknur Kaynar Kabul, SAS
Jorge Silva, SAS
The proliferation of textual data in business is overwhelming. Unstructured textual data is being constantly generated via call center logs, emails, documents on the web, blogs, tweets, customer comments, customer reviews, and so on. While the amount of textual data is increasing rapidly, businesses ability to summarize, understand, and make sense of such data for making better business decisions remain challenging. This presentation takes a quick look at how to organize and analyze textual data for extracting insightful customer intelligence from a large collection of documents and for using such information to improve business operations and performance. Multiple business applications of case studies using real data that demonstrate applications of text analytics and sentiment mining using SAS® Text Miner and SAS® Sentiment Analysis Studio are presented. While SAS® products are used as tools for demonstration only, the topics and theories covered are generic (not tool specific).
Goutam Chakraborty, Oklahoma State University
Murali Pagolu, SAS
The power of social media has increased to such an extent that businesses that fail to monitor consumer responses on social networking sites are now clearly at a disadvantage. In this paper, we aim to provide some insights on the impact of the Digital Rights Management (DRM) policies of Microsoft and the release of Xbox One on their customers' reactions. We have conducted preliminary research to compare the basic text mining capabilities of SAS® and R, two very diverse yet powerful tools. A total of 6,500 Tweets were collected to analyze the impact of the DRM policies of Microsoft. The Tweets were segmented into three groups based on date: before Microsoft announced its Xbox One policies (May 18 to May 26), after the policies were announced (May 27 to June 16), and after changes were made to the announced policies (June 16 to July 1). Our results suggest that SAS works better than R when it comes to extensive analysis of textual data. In our following work, customers reactions to the release of Xbox One will be analyzed using SAS® Sentiment Analysis Studio. We will collect Tweets on Xbox posted before and after the release of Xbox One by Microsoft. We will have two categories, Tweets posted between November 15 and November 21 and those posted between November 22 and November 29. Sentiment analysis will then be performed on these Tweets, and the results will be compared between the two categories.
Aditya Datla, Oklahoma State University
Reshma Palangat, Oklahoma State University
Goutam Chakraborty, Oklahoma State University
Most marketers today are trying to use Facebook s network of 1.1 billion plus registered users for social media marketing. Local television stations and newspapers are no exception. This paper investigates what makes a post effective. A Facebook page that is owned by a brand has fans, or people who like the page and follow the stories posted on that page. The posts on a brand page, however, do not appear on all the fans News Feeds. This is determined by EdgeRank, a Facebook proprietary algorithm that determines what content users see and how it s prioritized on their News Feed. If marketers can understand how EdgeRank works, then they can develop more impactful posts and ultimately produce more effective social marketing using Facebook. The objective of this paper is to find the characteristics of a Facebook post that enhance the efficacy of a news outlet s page among their fans using Facebook Power Ratio as the target variable. Power Ratio, a surrogate to EdgeRank, was developed by experts at Frank N. Magid Associates, a research-based media consulting firm. Seventeen variables that describe the characteristics of a post were extracted from more than 8,000 posts, which were encoded by 10 media experts at Magid. Numerous models were built and compared to predict Power Ratio. The most useful model is a polynomial regression with the top three important factors as whether a post asks fans to like the post, content category of a post (including news, weather, etc.), and number of fans of the page.
Dinesh Yadav Gaddam, Oklahoma State University
Yogananda Domlur Seetharama, Oklahoma State University
Many different neuroscience researchers have explored how various parts of the brain are connected, but no one has performed association mining using brain data. In this study, we used SAS® Enterprise Miner™ 7.1 for association mining of brain data collected by a 14-channel EEG device. An application of the association mining technique is presented in this novel context of brain activities and by linking our results to theories of cognitive neuroscience. The brain waves were collected while a user processed information about Facebook, the most well-known social networking site. The data was cleaned using Independent Component Analysis via an open source MATLAB package. Next, by applying the LORETA algorithm, activations at every fraction of the second were recorded. The data was codified into transactions to perform association mining. Results showing how various parts of brain get excited while processing the information are reported. This study provides preliminary insights into how brain wave data can be analyzed by widely available data mining techniques to enhance researcher s understanding of brain activation patterns.
Pankush Kalgotra, Oklahoma State University
Ramesh Sharda, Oklahoma State University
Goutam Chakraborty, Oklahoma State University
Do you have an abstract for an idea that you want to submit as a proposal to SAS® conferences, but you are not sure which section is the most appropriate one? In this paper, we discuss a methodology for automatically identifying the most suitable section or sections for your proposal content. We use SAS® Text Miner 12.1 and SAS® Content Categorization Studio 12.1 to develop a rule-based categorization model. This model is used to automatically score your paper abstract to identify the most relevant and appropriate conference sections to submit to for a better chance of acceptance.
Goutam Chakraborty, Oklahoma State University
Murali Pagolu, SAS
In our previous work, we often needed to perform large numbers of repetitive and data-driven post-campaign analyses to evaluate the performance of marketing campaigns in terms of customer response. These routine tasks were usually carried out manually by using Microsoft Excel, which was tedious, time-consuming, and error-prone. In order to improve the work efficiency and analysis accuracy, we managed to automate the analysis process with SAS® programming and replace the manual Excel work. Through the use of SAS macro programs and other advanced skills, we successfully automated the complicated data-driven analyses with high efficiency and accuracy. This paper presents and illustrates the creative analytical ideas and programming skills for developing the automatic analysis process, which can be extended to apply in a variety of business intelligence and analytics fields.
Justin Jia, Canadian Imperial Bank of Commerce (CIBC)
Amanda Lin, Bell Canada
The Affordable Care Act (ACA) contains provisions that have stimulated interest in analytics among health care providers, especially those provisions that address quality of outcomes. High Impact Technologies (HIT) has been addressing these issues since before passage of the ACA and has a Health Care Data Model recognized by Gartner and implemented at several health care providers. Recently, HIT acquired SAS® Visual Analytics, and this paper reports our successful efforts to use SAS Visual Analytics for visually exploring Big Data for health care providers. Health care providers can suffer significant financial penalties for readmission rates above a certain threshold and other penalties related to quality of care. We have been able to use SAS Visual Analytics, coupled with our experience gained from implementing the HIT Healthcare Data Model at a number of Healthcare providers, to identify clinical measures that are significant predictors for readmission. As a result, we can help health care providers reduce the rate of 30-day readmissions.
Joe Whitehurst, High Impact Technologies
Diane Hatcher, SAS
Can clustering discharge records improve a predictive model s overall fit statistic? Do the predictors differ across segments? This talk describes the methods and results of data mining pediatric IBD patient records from my Analytics 2013 poster. SAS® Enterprise Miner™ 12.1 was used to segment patients and model important predictors for the length of hospital stay using discharge records from the national Kid s Inpatient Database. Profiling revealed that patient segments were differentiated by primary diagnosis, operating room procedure indicator, comorbidities, and factors related to admission and disposition of patient. Cluster analysis of patient discharges improved the overall average square error of predictive models and distinguished predictors that were unique to patient segments.
Linda Schumacher, Oklahoma State University
Goutam Chakraborty, Oklahoma State University
Energy companies that operate in a highly regulated environment and are constrained in pricing flexibility must employ a multitude of approaches to maintain high levels of customer satisfaction. Many investor-owned utilities are just starting to embrace a customer-centric business model to improve the customer experience and hold the line on costs while operating in an inflationary business setting. Faced with these challenges, it is natural for utility executives to ask: 'What drives customer satisfaction, and what is the optimum balance between influencing customer perceptions and improving actual process performance in order to be viewed as a top-tier performer by our customers?' J.D. Power, for example, cites power quality and reliability as the top influencer of overall customer satisfaction. But studies have also shown that customer perceptions of reliability do not always match actual reliability experience. This apparent gap between actual and perceived performance raises a conundrum: Should the utility focus its efforts and resources on improving actual reliability performance or would it be better to concentrate on influencing customer perceptions of reliability? How can this conundrum be unraveled with an analytically driven approach? In this paper, we explore how the design of experiment techniques can be employed to help understand the relationship between process performance and customer perception, thereby leading to important insights into the energy customer equation and higher customer satisfaction!
Mark Konya, Ameren Missouri
Kathy Ball, SAS
New innovative, analytical techniques are necessary to extract patterns in big data that have temporal and geo-spatial attributes. An approach to this problem is required when geo-spatial time series data sets, which have billions of rows and the precision of exact latitude and longitude data, make it extremely difficult to locate patterns of interest The usual temporal bins of years, months, days, hours, and minutes often do not allow the analyst to have control of the precision necessary to find patterns of interest. Geohashing is a string representation of two-dimensional geometric coordinates. Time hashing is a similar representation, which maps time to preserve all temporal aspects of the date and time of the data into a one-dimensional set of data points. Geohashing and time hashing are both forms of a Z-order curve, which maps multidimensional data into single dimensions and preserves the locality of the data points. This paper explores the use of a multidimensional Z-order curve, combining both geohashing and time hashing, that is known as geo-temporal hashing or space-time boxes using SAS®. This technique provides a foundation for reducing the data into bins that can yield new methods for pattern discovery and detection in big data.
Richard La Valley, Leidos
Abraham Usher, Human Geo Group
Don Henderson, Henderson Consulting Services
Paul Dorfman, Dorfman Consulting
Evaluation of the efficacy of an intervention is often complicated because the intervention is not randomly assigned. Usually, interventions in marketing, such as coupons or retention campaigns, are directed at customers because their spending is below some threshold or because the customers themselves make a purchase decision. The presence of nonrandom assignment of the stimulus can lead to over- or underestimating the value of the intervention. This can cause future campaigns to be directed at the wrong customers or cause the impacts of these effects to be over- or understated. This paper gives a brief overview of selection bias, demonstrates how selection in the data can be modeled, and shows how to apply some of the important consistent methods of estimating selection models, including Heckman's two-step procedure, in an empirical example. Sample code is provided in an appendix.
Gunce Walton, SAS
Kenneth Sanford, SAS
New York City boasts a wide variety of cuisine owing to the rich tourism and the vibrant immigrant population. The quality of food and hygiene maintained at the restaurants serving different cuisines has a direct impact on the people dining in them. The objective of this paper is to build a model that predicts the grade of the restaurants in New York City. It also provides deeper statistical insights into the distribution of restaurants, cuisine categories, grades, criticality of violations, etc., and concludes with the sequence analysis performed on the complete set of violations recorded for the restaurants at different time periods over the years 2012 and 2013. The data for 2013 is used to test the model. The data set consists of 15 variables that capture to restaurant location-specific and violation details. The target is an ordinal variable with three levels, A, B, and C, in descending order of the quality representation. Various SAS® Enterprise Miner™ models, logistic regression, decision trees, neural networks, and ensemble models are built and compared using validation misclassification rate. The stepwise regression model appears to be the best model, with prediction accuracy of 75.33%. The regression model is trained at step 3. The number of critical violations at 8.5 gives the root node for the split of the target levels, and the rest of the tree splits are guided by the predictor variables such as number of critical and non-critical violations, number of critical violations for the year 2011, cuisine group, and the borough.
Pruthvi Bhupathiraju Venkata, Oklahoma State University
Goutam Chakraborty, Oklahoma State University
For almost two decades, Western Kentucky University's Office of Institutional Research (WKU-IR) has used SAS® to help shape the future of the institution by providing faculty and administrators with information they can use to make a difference in the lives of their students. This presentation provides specific examples of how WKU-IR has shaped the policies and practices of our institution and discusses how WKU-IR moved from a support unit to a key strategic partner. In addition, the presentation covers the following topics: How the WKU Office of Institutional Research developed over time; Why WKU abandoned reactive reporting for a more accurate, convenient system using SAS® Enterprise Intelligence Suite for Education; How WKU shifted from investigating what happened to predicting outcomes using SAS® Enterprise Miner™ and SAS® Text Miner; How the office keeps the system relevant and utilized by key decision makers; What the office has accomplished and key plans for the future.
Tuesdi Helbig, Western Kentucky University
Gina Huff, Western Kentucky University
The role of the Data Scientist is the viral job description of the decade. And like LOLcats, there are many types of Data Scientists. What is this new role? Who is hiring them? What do they do? What skills are required to do their job? What does this mean for the SAS® programmer and the statistician? Are they obsolete? And finally, if I am a SAS user, how can I become a Data Scientist? Come learn about this job of the future and what you can do to be part of it.
Chuck Kincaid, Experis Business Analytics
Recent studies suggest that unstructured data, such as customer comments or feedback, can enhance the power of existing predictive models. SAS® Text Miner can generate singular value decomposition (SVD) units from text documents, which is a vectorial representation of terms in documents. These SVDs, when used as additional inputs along with the existing structured input variables, often prove to capture the response better. However, SVD units are sort of black box variables and are not easy to interpret or explain. This is a big hindrance to win over the decision makers in the organizations to incorporate these derived textual data components in the models. In this paper, we demonstrate a new and powerful feature in SAS® Text Miner 12.1 that helps in explaining the SVDs or the text cluster components. We discuss two important methods that are useful to interpreting them. For this purpose, we used data from a television network company that has transcripts of its call center notes from three prior calls of each customer. We are able to extract the key terms from the call center notes in the form of Boolean rules, which have contributed to the prediction of customer churn. These rules provide an intuitive sense of which set of terms, when occurring in either the presence or absence of another set of terms in the call center notes, might lead to a churn. It also provides insights into which customers are at a bigger risk of churning from the company s services and, more importantly, why.
Murali Pagolu, SAS
Goutam Chakraborty, Oklahoma State University
Power producers are looking for ways to not only improve efficiency of power plant assets but also to grow concerns about the environmental impacts of power generation without compromising their market competitiveness. To meet this challenge, this study demonstrates the application of data mining techniques for process optimization in a coal-fired power plant in Thailand with 97,920 data records. The main purpose is to determine which factors have a great impact on both (1) heat rate (kJ/kWh) of electrical energy output and (2) opacity of the flue gas exhaust emissions. As opposed to the traditional regression analysis currently employed at the plant and based on Microsoft Excel, more complex analytical models using SAS® Enterprise Miner™ help supporting managerial decision to improve the overall performance of the existing energy infrastructure while reducing emissions through a change in the energy supply structure.
Thanrawee Phurithititanapong, National Institute of Development Administration
Jongsawas Chongwatpol, National Institute of Development Administration
SAS® High-Performance Analytics is a significant step forward in the area of high-speed, analytic processing in a scalable clustered environment. However, Big Data problems generally come with data from lots of data sources, at varying levels of maturity. Teradata s innovative Unified Data Architecture (UDA) represents a significant improvement in the way that large companies can think about Enterprise Data Management, including the Teradata Database, Hortonworks Hadoop, and Aster Data Discovery platform in a seamless integrated platform. Together, the two platforms provide business users, analysts, and data scientists with the ideally suited data management platforms, targeted specifically to their analytic needs, based upon analytic use cases, managed in a single integrated enterprise data management environment. The paper will focus on how several companies today are using Teradata s Integrated Hardware and Software UDA Platform to manage a single enterprise analytic environment, fight the ongoing proliferation of analytic data marts, and speed their operational analytic processes.
John Cunningham, Teradata Corporation
This presentation addresses two main topics: The first topic focuses on the industry's norms and the best practices for building internal credit ratings (PD, EAD, and LGD). Although there is not any capital relief to local US banks using internal credit ratings (the US hs not adopted the Internal Rating Based approach of Basel2, with the exception of the top 10 banks), there is an increased responsiveness in credit ratings modeling for the last two years in the US banking industry. The main reason is the added value a bank can achieve from these ratings, and that is the focus of the second part of this presentation. It describes our journey (a client story) for getting there, introducing the SAS® project. Even more importantly, it describes how we use credit ratings in order to achieve effective credit risk management and get real added value out of that investment. The key success factor for achieving it is to effectively implement ratings within the credit process and throughout decision making . Only then can ratings be used to improve risk-adjusted return on capital, which is the high-end objective of all of us.
Boaz Galinson, Bank Leumi
Understanding the actual gambling behavior of an individual over the Internet, we develop markers which identify behavioral patterns, which in turn can be used to predict the level of risk a subscriber is prone to gambling. The data set contains 4,056 subscribers. Using SAS® Enterprise Miner™ 12.1, a set of models are run to predict which subscriber is likely to become a high-risk internet gambler. The data contains 114 variables such as first active date and first active product used on the website as well as the characteristics of the game such as fixed odds, poker, casino, games, etc. Other measures of a subscriber s data such as money put at stake and what odds are being bet are also included. These variables provide a comprehensive view of a subscriber s behavior while gambling over the website. The target variable is modeled as a binary variable, 0 indicating a risky gambler and 1 indicating a controlled gambler. The data is a typical example of real-world data with many missing values and hence had to be transformed, imputed, and then later considered for analysis. The model comparison algorithm of SAS Enterprise Miner 12.1 was used to determine the best model. The stepwise Regression performs the best among a set of 25 models which were run using over a 100 permutations of each model. The Stepwise Regression model predicts a high-risk Internet gambler at an accuracy of 69.63% with variables such as wk4frequency and wk3frequency of bets.
Sai Vijay Kishore Movva, Oklahoma State University
Vandana Reddy, Oklahoma State University
Goutam Chakraborty, Oklahoma State University
Ensemble models combine two or more models to enable a more robust prediction, classification, or variable selection. This paper describes three types of ensemble models: boosting, bagging, and model averaging. It discusses go-to methods, such as gradient boosting and random forest, and newer methods, such as rotational forest and fuzzy clustering. The examples section presents a quick setup that enables you to take fullest advantage of the ensemble capabilities of SAS® Enterprise Miner™ by using existing nodes, Start Groups and End Groups nodes, and custom coding.
Miguel M. Maldonado, SAS
Jared Dean, SAS
Wendy Czika, SAS
Susan Haller, SAS
A common complaint from users working on identifying fraud and abuse in Medicare is that teams focus on operational applications, static reports, and high-level outliers. But, when faced with the need to constantly evaluate changing Medicare provider and beneficiary or enrollee dynamics, users are clamoring for more dynamic and accurate detection approaches. Providing these organizations with a data discovery and predictive analytics framework that leverages Hadoop and other big data approaches, while providing a clear path for teams to make more fact-based decisions more quickly is very important in pre- and post-fraud and abuse analysis. Organizations that do pursue a framework and a reusable services-based data discovery and analytics framework and architecture approach enjoy greater success in supporting data management, reporting, and analytics demands. They can quickly turn models into prioritized alerts and avoid improper or fraudulent payments. A successful framework should enable organizations to come up with efficient fraud, waste, and abuse models to address complex schemes; identify fraud, waste, and abuse vulnerabilities; and shorten triage efforts using a variety of data sourced from big data platforms like Hadoop and other relational database management systems. This paper talks about the data management, data discovery, predictive analytics, and social network analysis capabilities that are included in the SAS fraud framework and how a unified approach can significantly reduce the lifecycle of building and deploying fraud models. We hope this paper will provide IT leaders with a clear path for resolving issues from the simple to the incredibly complex, through a measured and scalable approach for delivering value for fraud, waste, and abuse models by providing deep insights to support evidence-based investigations.
Vivek Sethunatesan, Northrop Grumman Corp
Breast cancer is the most common cancer among females globally. After being diagnosed and treated for breast cancer, patients fear the recurrence of breast cancer. Breast cancer recurrence (BCR) can be defined as the return of breast cancer after primary treatment, and it can recur within the first three to five years. BCR studies have been conducted mostly in developed countries such as the United States, Japan, and Canada. Thus, the primary aim of this study is to investigate the feasibility of building a medical scorecard to assess the risk of BCR among Malaysian women. The medical scorecard was developed using data from 454 out of 1,149 patients who were diagnosed and underwent treatment at the Department of Surgery, Hospital Kuala Lumpur from 2006 until 2011. The outcome variable is a binary variable with two values: 1 (recurrence) and 0 (remission). Based on the availability of data, only 13 categorical predictors were identified and used in this study. The predictive performance of the Breast Cancer Recurrence scorecard (BCR scorecard) model was compared to the standard logistic regression (LR) model. Both the BCR scorecard and LR model were developed using SAS® Enterprise Miner™ 7.1. From this exploratory study, although the BCR scorecard model has better predictive ability with a lower misclassification rate (18%) compared to the logistic regression model (23%), the sensitivity of the BCR scorecard model is still low, possibly due to the small sample size and small number of risk factors. Five important risk factors were identified: histological type, race, stage, tumor size, and vascular invasion in predicting recurrence status.
Nurul Husna Jamian, Universiti Teknologi Mara
Yap Bee Wah, Universiti Teknologi Mara
Nor Aina Emran, Hospital Kuala Lumpur
PROC TABULATE is a powerful tool for creating tabular summary reports. Its advantages, over PROC REPORT, are that it requires less code, allows for more convenient table construction, and uses syntax that makes it easier to modify a table s structure. However, its inability to compute the sum, difference, product, and ratio of column sums has hindered its use in many circumstances. This paper illustrates and discusses some creative approaches and methods for overcoming these limitations, enabling users to produce needed reports and still enjoy the simplicity and convenience of PROC TABULATE. These methods and skills can have prominent applications in a variety of business intelligence and analytics fields.
Justin Jia, Canadian Imperial Bank of Commerce (CIBC)
Amanda Lin, Bell Canada
Paper 1506-2014:
Practical Considerations in the Development of a Suite of Predictive Models for Population Health Management
The use of predictive models in healthcare has steadily increased over the decades. Statistical models now are assumed to be a necessary component in population health management. This session will review practical considerations in the choice of models to develop, criteria for assessing the utility of the models for production, and challenges with incorporating the models into business process flows. Specific examples of models will be provided based upon work by the Health Economics team at Blue Cross Blue Shield of North Carolina.
Daryl Wansink, Blue Cross Blue Shield of North Carolina
Over the years, there has been a growing concern about consumption of tobacco among youth. But no concrete studies have been done to find what exactly leads the children to start consuming tobacco. This study is an attempt to figure out the potential reasons for the same. Through our analysis, we have also tried to build A model to predict whether a child would smoke next year or not. This study is based on the 2011 National Youth Tobacco Survey data of 18,867 observations. In order to prepare data for insightful analysis, imputation operations were performed on the data using tree-based imputation methods. From a pool of 197 variables, 48 key variables were selected using variable selection methods, partial least squares, and decision tree models. Logistic Regression and Decision Tree models were built to predict whether a child would smoke in the next year or not. Comparing the models using Misclassification rate as the selection criteria, we found that the Stepwise Logistic Regression Model outperformed other models with a Validation Misclassification of 0.028497, 47.19% Sensitivity and 95.80% Specificity. Factors such as company of friends, cigarette brand ads, accessibility to the tobacco products, and passive smoking turned out to be the most important predictors in determining a child smoker. After this study, we could outline some important findings like the odds of a child taking up smoking are 2.17 times high when his close friends are also smoking.
Jin Ho Jung, Oklahoma State University
Gaurav Pathak, Oklahoma State University
Goutam Chakraborty, Oklahoma State University
An increase in sea levels is a potential problem that is affecting the human race and marine ecosystem. Many models are being developed to find out the factors that are responsible for it. In this research, the Memory-Based Reasoning model looks more effective than most other models. This is because this model takes the previous solutions and predicts the solutions for forthcoming cases. The data was collected from NASA. The data contains 1,072 observations and 10 variables such as emissions of carbon dioxide, temperature, and other contributing factors like electric power consumption, total number of industries established, and so on. Results of Memory-Based Reasoning models like RD tree, scan tree, neural networks, decision tree, and logistic regression are compared. Fit statistics, such as misclassification rate and average squared error are used to evaluate the model performance. This analysis is used to predict the rise in sea levels in the near future and to take the necessary actions to protect the environment from global warming and natural disasters.
Prasanna K S Sailaja Bhamidi, Oklahoma State University
Goutam Chakraborty, Oklahoma State University
Sparse data sets are common in applications of text and data mining, social network analysis, and recommendation systems. In SAS® software, sparse data sets are usually stored in the coordinate list (COO) transactional format. Two major drawbacks are associated with this sparse data representation: First, most SAS procedures are designed to handle dense data and cannot consume data that are stored transactionally. In that case, the options for analysis are significantly limited. Second, a sparse data set in transactional format is hard to store and process in distributed systems. Most techniques require that all transactions for a particular object be kept together; this assumption is violated when the transactions of that object are distributed to different nodes of the grid. This paper presents some different ideas about how to package all transactions of an object into a single row. Approaches include storing the sparse matrix densely, doing variable selection, doing variable extraction, and compressing the transactions into a few text variables by using Base64 encoding. These simple but effective techniques enable you to store and process your sparse data in better ways. This paper demonstrates how to use SAS® Text Miner procedures to process sparse data sets and generate output data sets that are easy to store and can be readily processed by traditional SAS modeling procedures. The output of the system can be safely stored and distributed in any grid environment.
Zheng Zhao, SAS
Russell Albright, SAS
James Cox, SAS
Analyzing automobile policies and claims is an ongoing area of interest to the insurance industry. Although there have been many data mining projects in insurance sector over the past decade, the following questions How can insurance firms retain their best customers? Will this damaged car be covered and get claim payment? How much of loss of claims associated with this policy will be? do remain as common. This study applies data mining techniques using SAS® Enterprise Miner™ to enhance insurance policies and claims. The main focus is on assessing how corporate fleet customers policy characteristics and claim behavior are different from that of non-fleet customers. With more than 100,000 data records, implementing advanced analytics help create better planning for policy and claim management strategy.
Kittipong Trongsawad, National Institute of Development Administration
Jongsawas Chongwatpol, National Institute of Development Administration
This paper reveals the human mobility behavior in the metropolitan area of Rio de Janeiro, Brazil. The base for this study is the mobile phone data provided by one of the largest mobile carriers in Brazil. Mobile phone data comprises a reasonable variety of information, including data about time and location for call activity throughout urban areas. This information might be used to build users trajectories over time, describing the major characteristics of the urban mobility within the city. A variety of distribution analyses is presented in this paper aiming clearly describes the most relevant characteristics of the overall mobility in the metropolitan area of Rio de Janeiro. In addition to that, methods from physics to describe trends in trips such as gravity and radiation models were computed and compared in terms of granularity of the geographic scales and also in relation to traditional data mining approach such as linear regressions. A brief comparison in terms of performance in predicting the amount of trips between pairs of locations is presented at the end.
Carlos Andre Reis Pinheiro, KU Leuven
The high school dropout problem has been called a national crisis (Heppen & Therriault, 2008). Almost one-third of all high school students leave the public school system before graduating (Swanson, 2004), and the problem is particularly severe among minority students (Greene & Winters, 2005; U.S. Department of Education, 2006). Educators, researchers, and policymakers continue to work to identify effective dropout prevention strategies. One effective approach is to identify high-risk students at an early stage, and then provide corresponding interventions to keep them in school. One of the strengths of Educational Data Mining is to reveal hidden patterns and predict future performance by analyzing accessible student data. These predictive algorithms generated by predictive modeling can serve as an early warning system. However, because individual schools and districts have various combinations of race, gender, and socioeconomic status, we cannot use a set of standardized predictors and obtain satisfactory predictive results. Analyzing a limited number of variables and limited historical data does not generate accurate models. Additionally, the predictive model might not consider interactions among predictors. The strength of data mining is the capability to analyze a large amount of data and variables. Multiple analytic strategies (including model comparisons) can be applied to maximize model performance. For future goals, we propose a debuted data mining framework to construct an early warning and trend analysis system with components of data warehousing, data mining, and reporting at the levels of individual students, schools, school districts, and the entire state.
Wendy Dickinson, Ringling College of Art + Design
Morgan Wang, University of Central Florida
Are you wondering what is causing your valuable machine asset to fail? What could those drivers be, and what is the likelihood of failure? Do you want to be proactive rather than reactive? Answers to these questions have arrived with SAS® Predictive Asset Maintenance. The solution provides an analytical framework to reduce the amount of unscheduled downtime and optimize maintenance cycles and costs. An all new (R&D-based) version of this offering is now available. Key aspects of this paper include: Discussing key business drivers for and capabilities of SAS Predictive Asset Maintenance. Detailed analysis of the solution, including: Data model Explorations Data selections Path I: analysis workbench maintenance analysis and stability monitoring Path II: analysis workbench JMP®, SAS® Enterprise Guide®, and SAS® Enterprise Miner™ Analytical case development using SAS Enterprise Miner, SAS® Model Manager, and SAS® Data Integration Studio SAS Predictive Asset Maintenance Portlet for reports A realistic business example in the oil and gas industry is used.
George Habek, SAS
Paper SAS1523-2014:
SAS® Workshop: Data Mining
This workshop provides hands-on experience using SAS® Enterprise Miner™. Workshop participants will do the following: open a project create and explore a data source build and compare models produce and examine score code that can be used for deployment
Bob Lucas, SAS
Mike Speed, SAS
Paper SAS1525-2014:
SAS® Workshop: High-Performance Analytics
This workshop provides hands-on experience using SAS® Enterprise Miner™ high-performance nodes. Workshop participants will do the following: learn the similarities and differences between high-performance nodes and standard nodes build a project flow using high-performance nodes extract and save a score code for model deployment
Bob Lucas, SAS
Jeff Thompson, SAS
Paper SAS1393-2014:
SAS® Workshop: SAS® Office Analytics
This workshop provides hands-on experience using SAS® Office Analytics. Workshop participants will complete the following tasks: use SAS® Enterprise Guide® to access and analyze data create a stored process that can be shared across an organization access and analyze data sources and stored processes using the SAS® Add-In for Microsoft Office
Eric Rossland, SAS
How does the SAS® server architecture fit within your IT infrastructure? What functional aspects does the architecture support? This session helps attendees understand the logical server topology of the SAS technology stack: resource and process management in-memory architecture in-database processing The session also discusses process flows from data acquisition through analytical information to visual insight. IT architects, data administrators, and IT managers from all industries should leave with an understanding of how SAS has evolved to better fit into the IT enterprise and to help IT's internal customers make better decisions.
Gary Spakes, SAS
Paper SAS247-2014:
Smart-Meter Analytical Applications
For electricity retailer and distribution companies, the introduction of smart-meter technologies has been a key investment, reducing the significant costs associated with meter reading. Electricity companies continue to look for ways to generate a dividend from them in other ways. This presentation looks at selected practical applications of smart-meter data: forecasting using smart-meter data as inputs customer segmentation revenue protection This presentation aims to show some techniques that can be used to effectively manage and analyze the large amounts of data generated by these devices in order to generate business value.
Andrew Cathie, SAS
One in every four people dies of heart disease in the United States, and stress is an important factor which contributes towards a cardiac event. As the condition of the heart gradually worsens with age, the factors that lead to a myocardial infarction when the patients are subjected to stress are analyzed. The data used for this project was obtained from a survey conducted through the Department of Biostatistics at Vanderbilt University. The objective of this poster is to predict the chance of survival of a patient after a cardiac event. Then by using decision trees, neural networks, regression models, bootstrap decision trees, and ensemble models, we predict the target which is modeled as a binary variable, indicating whether a person is likely to survive or die. The top 15 models, each with an accuracy of over 70%, were considered. The model will give important survival characteristics of a patient which include his history with diabetes, smoking, hypertension, and angioplasty.
Yogananda Domlur Seetharama, Oklahoma State University
Sai Vijay Kishore Movva, Oklahoma State University
With smartphone and mobile apps market developing so rapidly, the expectations about effectiveness of mobile applications is high. Marketers and app developers need to analyze huge data available much before the app release, not only to better market the app, but also to avoid costly mistakes. The purpose of this poster is to build models to predict the success rate of an app to be released in a particular category. Data has been collected for 540 android apps under the Top free newly released apps category from https://play.google.com/store . The SAS® Enterprise Miner™ Text Mining node and SAS® Sentiment Analysis Studio are used to parse and tokenize the collected customer reviews and also to calculate the average customer sentiment score for each app. Linear regression, neural, and auto-neural network models have been built to predict the rank of an app by considering average rating, number of installations, total number of reviews, number of 1-5 star ratings, app size, category, content rating, and average customer sentiment score as independent variables. A linear regression model with least Average Squared Error is selected as the best model, and number of installations, app maturity content are considered as significant model variables. App category, user reviews, and average customer sentiment score are also considered as important variables in deciding the success of an app. The poster summarizes the app success trends across various factors and also introduces a new SAS® macro %getappdata, which we have developed for web crawling and text parsing.
Vandana Reddy, Oklahoma State University
Chinmay Dugar, Oklahoma State University
Global businesses must react to daily changes in market conditions over multiple geographies and industries. Consuming reputable daily economic reports assists in understanding these changing conditions, but requires both a significant human time commitment and a subjective assessment of each topic area of interest. To combat these constraints, Dow's Advanced Analytics team has constructed a process to calculate sentence-level topic frequency and sentiment scoring from unstructured economic reports. Daily topic sentiment scores are aggregated to weekly and monthly intervals and used as exogenous variables to model external economic time series data. These models serve to both validate the relationship between our sentiment scoring process and also as near-term forecasts where daily or weekly variables are unavailable. This paper will first describe our process of using SAS® Text Miner to import and discover economic topics and sentiment from unstructured economic reports. The next section describes sentiment variable selection techniques that use SAS/STAT®, SAS/ETS®, and SAS® Enterprise Miner™ to generate similarity measures to economic indices. Our process then uses ARIMAX modeling in SAS® Forecast Studio to create economic index forecasts with topic sentiments. Finally, we show how the sentiment model components are used as a matrix of economic key performance indicators by topic and geography.
Michael P. Dessauer, The Dow Chemical Company
Justin Kauhl, Tata Consultancy Services
Nowadays, in the Big Data era, Business Intelligence Departments collect, store, process, calculate, and monitor massive amounts of data. Nevertheless, sometimes hundreds of metrics built on the structured data are inefficient to explain why the offered deal sold better or worse than expected. The answer might be found in text data that every company owns and yet is not aware of its possible usage or neglects its value. This project shows text mining methods, implemented in SAS® Text Miner 12.1, that enable the determination of a deal's success or failure factors based on in-house or Internet-scattered customers' views and opinions. The study is conducted on data gathered from Groupon Sp. z o.o. (Polish business unit) - e-commerce company, as it is assumed that the market is by and large a customer-driven environment.
Rafal Wojdan, Warsaw School of Economics
Direct marketing is the practice of delivering promotional messages directly to potential customers on an individual basis rather than by using mass medium. In this project, we build a finely tuned response model that helps a financial services company to select high-quality receptive customers for their future campaigns and to identify the important factors that influence marketing to effectively manage their resources. This study was based on the customer solicitation center s marketing campaign data (45,211 observations and 18 variables) available on UC Irvine's web site with attributes of present and past campaign information (communication type, contact duration, previous campaign outcome, and so on) and customer s personal and banking information. As part of data preparation, we had performed mean imputation to handle missing values and categorical recoding for reducing levels of class variables. In this study, we had built several predictive models using the SAS® Enterprise Miner™ models Decision Tree, Neural Network, Logistic Regression, and SVM to predict whether the customer responds to the loan offer by subscribing. The results showed that the Stepwise Logistic Regression model was the best when chosen based on the misclassification rate criteria. When the top 3 decile customers were selected based on the best model, the cumulative response rate was 14.5% in contrast to the baseline response rate of 5%. Further analysis showed that the customers are more likely to subscribe to the loan offer if they have the following characteristics: never been contacted in the past, no default history, and provided cell phone as primary contact information.
Arun Mandapaka, Oklahoma State University
Amit Kushwah, Oklahoma State University
Goutam Chakraborty, Oklahoma State University
Applying models to analyze sports data has always been done by teams across the globe. The film Moneyball has generated much hype about how a sports team can use data and statistics to build a winning team. The objective of this poster is to use the model comparison algorithm of SAS® Enterprise Miner™ to pick the best model that can predict the outcome of a soccer game. It is hence important to determine which factors influence the results of a game. The data set used contains input variables about a team s offensive and defensive abilities and the outcome of a game is modeled as a target variable. Using SAS Enterprise Miner, multinomial regression, neural networks, decision trees, ensemble models and gradient boosting models are built. Over 100 different versions of these models are run. The data contains statistics from the 2012-13 English premier league season. The competition has 20 teams playing each other in a home and away format. The season has a total of 380 games; the first 283 games are used to predict the outcome of the last 97 games. The target variable is treated as both nominal variable and ordinal variable with 3 levels for home win, away win, and tie. The gradient boosting model is the winning model which seems to predict games with 65% accuracy and identifies factors such as goals scored and ball possession as more important compared to fouls committed or red cards received.
Vandana Reddy, Oklahoma State University
Sai Vijay Kishore Movva, Oklahoma State University
Identifying claim fraud using predictive analytics represents a unique challenge. 1. Predictive analytics generally requires that you have a target variable which can be analyzed. Fraud is unique in this regard in that there is a lot of fraud that has occurred historically that has not been identified. Therefore, the definition of the target variable is difficult. 2.There is also a natural assumption that the past will bear some resemblance to the future. In the case of fraud, methods of defrauding insurance companies change quickly and can make the analysis of a historical database less valuable for identifying future fraud. 3. In an underlying database of claims that may have been determined to be fraudulent by an insurance company, there is many times an inconsistency between different claim adjusters regarding which claims are referred for investigation. This inconsistency can lead to erroneous model results due to data that is not homogenous. This paper will demonstrate how analytics can be used in several ways to help identify fraud: 1. More consistent referral of suspicious claims 2. Better identification of new types of suspicious claims 3. Incorporating claim adjuster insight into the analytics results. As part of this paper, we will demonstrate the application of several approaches to fraud identification: 1. Clustering 2. Association analysis 3. PRIDIT (Principal Component Analysis of RIDIT scores).
Roosevelt C. Mosley, Pinnacle Actuarial Resources, Inc.
Nick Kucera, Pinnacle Actuarial Resources, Inc.
There are yearly 2.35 million road accident cases recorded in the U.S. Among them, 37,000 were considered fatal. Road crashes cost USD 230.6 billion per year, or an average of USD 820 per person. Our efforts are to identify the important factors that lead to vehicle collisions and to predict the injury risk involved in them. Data was collected from National Automotive Sampling System (NASS), containing 20,247 cases with 19 variables. Input variables describe the factors involved in an accident like Height, Age, Weight, Gender, Vehicle model year, Speed limit, Energy absorption in Collision & Deformation location, etc. The target variable is nominal showing levels of injury. Missing values in interval variables were imputed using mean and class variables using the count method. Multivariate analysis suggests high correlation between tire footprint and wheelbase (Corr=0.97, P<0.0001) and original weight of car and curb weight of car (Corr=0.79, P<0.0001). Variables having high kurtosis values were transformed using range standardization. Variables were sorted using variable importance using decision tree analysis. Models like multiple regression, polynomial regression, neural network, and decision tree were applied in the dataset to identify the factors that are most significant in predicting the injury risk. Multilinear perception neural network came out to be the best model to predict injury risk index, with the least Average Squared Error 0.086 in validation dataset.
Prateek Khare, Oklahoma State University
Vandana Reddy, Oklahoma State University
Goutam Chakraborty, Oklahoma State University
Especially in this current financial climate, many of us are being asked to do more with less. For several years, the Office of Institutional Research and Testing at Baylor University has been using SAS® software to increase the efficiency of the office and of the University as a whole. Reports that were once prepared manually have been automated. Data quality processes have been implemented in order to reduce the number of duplicate mailings. Predictive modeling is used to focus recruiting efforts on those prospective students most likely to respond. A web-based portal has been created to provide self-service report generation for many administrators across campus. Along with this, a number of data processing functions have been centralized, eliminating the need for additional programming skills and software support. This presentation discusses these improvements in more detail and provides examples of the end results.
Faron Kincheloe, Baylor University
Over the last year, the SAS® Enterprise Miner™ development team has made numerous and wide-ranging enhancements and improvements. New utility nodes that save data, integrate better with open-source software, and register models make your routine tasks easier. The area of time series data mining has three new nodes. There are also new models for Bayesian network classifiers, generalized linear models (GLMs), support vector machines (SVMs), and more.
Jared Dean, SAS
Jonathan Wexler, SAS