SAS Enterprise Miner Papers A-Z

A
Paper 3282-2015:
A Case Study: Improve Classification of Rare Events with SAS® Enterprise Miner™
Imbalanced data are frequently seen in fraud detection, direct marketing, disease prediction, and many other areas. Rare events are sometimes of primary interest. Classifying them correctly is the challenge that many predictive modelers face today. In this paper, we use SAS® Enterprise Miner™ on a marketing data set to demonstrate and compare several approaches that are commonly used to handle imbalanced data problems in classification models. The approaches are based on cost-sensitive measures and sampling measures. A rather novel technique called SMOTE (Synthetic Minority Over-sampling TEchnique), which has achieved the best result in our comparison, will be discussed.
Read the paper (PDF). | Download the data file (ZIP).
Ruizhe Wang, GuideWell Connect
Novik Lee, Guidewell Connect
Yun Wei, Guidewell Connect
Paper 2641-2015:
A New Method of Using Polytomous Independent Variables with Many Levels for the Binary Outcome of Big Data Analysis
In big data, many variables are polytomous with many levels. The common method to deal with polytomous independent variables is to use a series of design variables, which correspond to the option class or by in the polytomous independent variable in PROC LOGISTIC, if the outcome is binary. If big data has many polytomous independent variables with many levels, using design variables makes the analysis processing very complicated in both computation time and result, which might provide little help on the prediction of outcome. This paper presents a new simple method for logistic regression with polytomous independent variables in big data analysis when analysis of big data is required. In the proposed method, the first step is to conduct an iteration statistical analysis from a SAS® macro program. Similar to an algorithm in the creation of spline variables, this analysis searches for the proper aggregation groups with a statistical significant difference from all levels in a polytomous independent variable. In the SAS macro program for an iteration, processing of searching new level groups with statistical significant differences has been developed. The first is from level 1 with the smallest value of the outcome means. Then we can conduct a statistical test for the level 1 group with the level 2 group with the second smallest value of outcome mean. If these two groups have a statistical significant difference, we can start to test the level 2 group with the level 3 group. If level 1 and level 2 do not have a statistical significant difference, we can combine them into a new level group 1. Then we are going to test the new level group 1 with level 3. The processing continues until all the levels have been tested. Then we can replace the original level values of the polytomous variable by the new level values with the statistical significant difference. In this situation, the polytomous variable with new levels can be described by these means of all new levels because of the 1 to 1 equivalence relationship of a piecewise function in logit from the polytomous's levels to outcome means. It is very easy to approve that the conditional mean of an outcome y given a polytomous variable x is a very good approximation based on the maximum likelihood analysis. Compared with design variables, the new piecewise variable based on the information of all levels as a single independent variable can capture the impact of all levels in a much simpler way. We have used this method in the predictive models of customer attrition on the polytomous variables: state, business type, customer claim type, and so on. All of these polytomous variables show significant improvement on the prediction of customer attrition than without using them or using design variables in the model development.
Read the paper (PDF).
jian gao, constant contact
jesse harriot, constant contact
lisa Pimentel, constant contact
Paper 3492-2015:
Alien Nation: Text Analysis of UFO Sightings in the US Using SAS® Enterprise Miner™ 13.1
Are we alone in this universe? This is a question that undoubtedly passes through every mind several times during a lifetime. We often hear a lot of stories about close encounters, Unidentified Flying Object (UFO) sightings and other mysterious things, but we lack the documented evidence for analysis on this topic. UFOs have been a matter of interest in the public for a long time. The objective of this paper is to analyze one database that has a collection of documented reports of UFO sightings to uncover any fascinating story related to the data. Using SAS® Enterprise Miner™ 13.1, the powerful capabilities of text analytics and topic mining are leveraged to summarize the associations between reported sightings. We used PROC GEOCODE to convert addresses of sightings to the locations on the map. Then we used PROC GMAP procedure to produce a heat map to represent the frequency of the sightings in various locations. The GEOCODE procedure converts address data to geographic coordinates (latitude and longitude values). These geographic coordinates can then be used on a map to calculate distances or to perform spatial analysis. On preliminary analysis of the data associated with sightings, it was found that the most popular words associated with UFOs tell us about their shapes, formations, movements, and colors. The Text Profiler node in SAS Enterprise Miner 13.1 was leveraged to build a model and cluster the data into different levels of segment variable. We also explain how the opinions about the UFO sightings change over time using Text Profiling. Further, this analysis uses the Text Profile node to find interesting terms or topics that were used to describe the UFO sightings. Based on the feedback received at SAS® analytics conference, we plan to incorporate a technique to filter duplicate comments and include weather in that location.
Read the paper (PDF). | Download the data file (ZIP).
Pradeep Reddy Kalakota, Federal Home Loan Bank of Desmoines
Naresh Abburi, Oklahoma State University
Goutam Chakraborty, Oklahoma State University
Zabiulla Mohammed, Oklahoma State University
Paper 3197-2015:
All Payer Claims Databases (APCDs) in Data Transparency and Quality Improvement
Since Maine established the first All Payer Claims Database (APCD) in 2003, 10 additional states have established APCDs and 30 others are in development or show strong interest in establishing APCDs. APCDs are generally mandated by legislation, though voluntary efforts exist. They are administered through various agencies, including state health departments or other governmental agencies and private not-for-profit organizations. APCDs receive funding from various sources, including legislative appropriations and private foundations. To ensure sustainability, APCDs must also consider the sale of data access and reports as a source of revenue. With the advent of the Affordable Care Act, there has been an increased interest in APCDs as a data source to aid in health care reform. The call for greater transparency in health care pricing and quality, development of Patient-Centered Medical Homes (PCMHs) and Accountable Care Organizations (ACOs), expansion of state Medicaid programs, and establishment of health insurance and health information exchanges have increased the demand for the type of administrative claims data contained in an APCD. Data collection, management, analysis, and reporting issues are examined with examples from implementations of live APCDs. Developing data intake, processing, warehousing, and reporting standards are discussed in light of achieving the triple aim of improving the individual experience of care; improving the health of populations; and reducing the per capita costs of care. APCDs are compared and contrasted with other sources of state-level health care data, including hospital discharge databases, state departments of insurance records, and institutional and consumer surveys. The benefits and limitations of administrative claims data are reviewed. Specific issues addressed with examples include implementing transparent reporting of service prices and provider quality, maintaining master patient and provider identifiers, validating APCD data and comparison with o ther state health care data available to researchers and consumers, defining data suppression rules to ensure patient confidentiality and HIPAA-compliant data release and reporting, and serving multiple end users, including policy makers, researchers, and consumers with appropriately consumable information.
Read the paper (PDF). | Watch the recording.
Paul LaBrec, 3M Health Information Systems
Paper 3293-2015:
Analyzing Direct Marketing Campaign Performance Using Weight of Evidence Coding and Information Value through SAS® Enterprise Miner™ Incremental Response
Data mining and predictive models are extensively used to find the optimal customer targets in order to maximize the return on investment. Direct marketing techniques target all the customers who are likely to buy regardless of the customer classification. In a real sense, this mechanism couldn't classify the customers who are going to buy even without a marketing contact, thereby resulting in a loss on investment. This paper focuses on the Incremental Lift modeling approach using Weight of Evidence Coding and Information Value followed by Incremental Response and Outcome model Diagnostics. This model identifies the additional purchases that would not have taken place without a marketing campaign. Modeling work was conducted using a combined model. The research work is carried out on Travel Center data. This data identifies the increase in average response rate by 2.8% and the number of fuel gallons by 244 when compared with the results from the traditional campaign, which targeted everyone. This paper discusses in detail the implementation of the 'Incremental Response' node to direct the marketing campaigns and its Incremental Revenue and Profit analysis.
Read the paper (PDF). | Download the data file (ZIP).
Sravan Vadigepalli, Best Buy
Paper 3472-2015:
Analyzing Marine Piracy from Structured and Unstructured Data Using SAS® Text Miner
Approximately 80% of world trade at present uses the seaways, with around 110,000 merchant vessels and 1.25 million marine farers transported and almost 6 billion tons of goods transferred every year. Marine piracy stands as a serious challenge to sea trade. Understanding how the pirate attacks occur is crucial in effectively countering marine piracy. Predictive modeling using the combination of textual data with numeric data provides an effective methodology to derive insights from both structured and unstructured data. 2,266 text descriptions about pirate incidents that occurred over the past seven years, from 2008 to the second quarter of 2014, were collected from the International Maritime Bureau (IMB) website. Analysis of the textual data using SAS® Enterprise Miner™ 12.3, with the help of concept links, answered questions on certain aspects of pirate activities, such as the following: 1. What are the arms used by pirates for attacks? 2. How do pirates steal the ships? 3. How do pirates escape after the attacks? 4. What are the reasons for occasional unsuccessful attacks? Topics are extracted from the text descriptions using a text topic node, and the varying trends of these topics are analyzed with respect to time. Using the cluster node, attack descriptions are classified into different categories based on attack style and pirate behavior described by a set of terms. A target variable called Attack Type is derived from the clusters and is combined with other structured input variables such as Ship Type, Status, Region, Part of Day, and Part of Year. A Predictive model is built with Attact Type as the target variable and other structured data variables as input predictors. The Predictive model is used to predict the possible type of attack given the details of the ship and its travel. Thus, the results of this paper could be very helpful for the shipping industry to become more aware of possible attack types for different vessel types when traversing different routes , and to devise counter-strategies in reducing the effects of piracy on crews, vessels, and cargo.
Read the paper (PDF).
Raghavender Reddy Byreddy, Oklahoma State University
Nitish Byri, Oklahoma State University
Goutam Chakraborty, Oklahoma State University
Tejeshwar Gurram, Oklahoma State University
Anvesh Reddy Minukuri, Oklahoma State University
Paper 3330-2015:
Analyzing and Visualizing the Sentiment of the Ebola Outbreak via Tweets
The Ebola virus outbreak is producing some of the most significant and fastest trending news throughout the globe today. There is a lot of buzz surrounding the deadly disease and the drastic consequences that it potentially poses to mankind. Social media provides the basic platforms for millions of people to discuss the issue and allows them to openly voice their opinions. There has been a significant increase in the magnitude of responses all over the world since the death of an Ebola patient in a Dallas, Texas hospital. In this paper, we aim to analyze the overall sentiment that is prevailing in the world of social media. For this, we extracted the live streaming data from Twitter at two different times using the Python scripting language. One instance relates to the period before the death of the patient, and the other relates to the period after the death. We used SAS® Text Miner nodes to parse, filter, and analyze the data and to get a feel for the patterns that exist in the tweets. We then used SAS® Sentiment Analysis Studio to further analyze and predict the sentiment of the Ebola outbreak in the United States. In our results, we found that the issue was not taken very seriously until the death of the Ebola patient in Dallas. After the death, we found that prominent personalities across the globe were talking about the disease and then raised funds to fight it. We are continuing to collect tweets. We analyze the locations of the tweets to produce a heat map that corresponds to the intensity of the varying sentiment across locations.
Read the paper (PDF).
Dheeraj Jami, Oklahoma State University
Goutam Chakraborty, Oklahoma State University
Shivkanth Lanka, Oklahoma State University
Paper 3426-2015:
Application of Text Mining on Tweets to Collect Rational Information about Type-2 Diabetes
Twitter is a powerful form of social media for sharing information about various issues and can be used to raise awareness and collect pointers about associated risk factors and preventive measures. Type-2 diabetes is a national problem in the US. We analyzed twitter feeds about Type-2 diabetes in order to suggest a rational use of social media with respect to an assertive study of any ailment. To accomplish this task, 900 tweets were collected using Twitter API v1.1 in a Python script. Tweets, follower counts, and user information were extracted via the scripts. The tweets were segregated into different groups on the basis of their annotations related to risk factors, complications, preventions and precautions, and so on. We then used SAS® Text Miner to analyze the data. We found that 70% of the tweets stated the status quo, based on marketing and awareness campaigns. The remaining 30% of tweets contained various key terms and labels associated with type-2 diabetes. It was observed that influential users tweeted more about precautionary measures whereas non-influential people gave suggestions about treatments as well as preventions and precautions.
Read the paper (PDF).
Shubhi Choudhary, Oklahoma State University
Goutam Chakraborty, Oklahoma State University
Vijay Singh, Oklahoma State University
Paper 3449-2015:
Applied Analytics in Festival Tourism: A Case Study of Intention-to-Revisit Prediction in an Annual Local Food Festival in Thailand
Improving tourists' satisfaction and intention to revisit the festival is an ongoing area of interest to the tourism industry. Many organizers at the festival site strive very hard to attract and retain attendees by investing heavily in their marketing and promotion strategies for the festival. To meet this challenge, the advanced analytical model though data mining approach is proposed to answer the following research question: What are the most important factors that influence tourists' intentions to revisit the festival site? Cluster analysis, neural network, decision tree, stepwise regression, polynomial regression, and support vector machine are applied in this study. The main goal is to determine what it takes not only to retain the loyalty attendees, but also attract and encourage new attendees to be back at the site.
Read the paper (PDF).
Jongsawas Chongwatpol, NIDA Business School, National Institute of Development Administration
Thanathorn Vajirakachorn, Dept. of Tourism Management, School of Business, University of the Thai Chamber of Commerce
Paper 3354-2015:
Applying Data-Driven Analytics to Relieve Suffering Associated with Natural Disasters
Managing the large-scale displacement of people and communities caused by a natural disaster has historically been reactive rather than proactive. Following a disaster, data is collected to inform and prompt operational responses. In many countries prone to frequent natural disasters such as the Philippines, large amounts of longitudinal data are collected and available to apply to new disaster scenarios. However, because of the nature of natural disasters, it is difficult to analyze all of the data until long after the emergency has passed. For this reason, little research and analysis have been conducted to derive deeper analytical insight for proactive responses. This paper demonstrates the application of SAS® analytics to this data and establishes predictive alternatives that can improve conventional storm responses. Humanitarian organizations can use this data to understand displacement patterns and trends and to optimize evacuation routing and planning. Identifying the main contributing factors and leading indicators for the displacement of communities in a timely and efficient manner prevents detrimental incidents at disaster evacuation sites. Using quantitative and qualitative methods, responding organizations can make data-driven decisions that innovate and improve approaches to managing disaster response on a global basis. The benefits of creating a data-driven analytical model can help reduce response time, improve the health and safety of displaced individuals, and optimize scarce resources in a more effective manner. The International Organization for Migration (IOM), an intergovernmental organization, is one of the first-response organizations on the ground that responds to most emergencies. IOM is the global co-load for the Camp Coordination and Camp Management (CCCM) cluster in natural disasters. This paper shows how to use SAS® Visual Analytics and SAS® Visual Statistics for the Philippines in response to Super Typhoon Haiyan in Nove mber 2013 to develop increasingly accurate models for better emergency-preparedness. Using data collected from IOM's Displacement Tracking Matrix (DTM), the final analysis shows how to better coordinate service delivery to evacuation centers sheltering large numbers of displaced individuals, applying accurate hindsight to develop foresight on how to better respond to emergencies and disasters. Predictive models build on patterns found in historical and transactional data to identify risks and opportunities. The capacity to predict trends and behavior patterns related to displacement and mobility has the potential to enable the IOM to respond in a more timely and targeted manner. By predicting the locations of displacement, numbers of persons displaced, number of vulnerable groups, and sites at most risk of security incidents, humanitarians can respond quickly and more effectively with the appropriate resources (material and human) from the outset. The end analysis uses the SAS® Storm Optimization model combined with human mobility algorithms to predict population movement.
Lorelle Yuen, International Organization for Migration
Kathy Ball, Devon Energy
B
Paper 3082-2015:
Big Data Meets Little Data: Hadoop and Arduino Integration Using SAS®
SAS® has been an early leader in big data technology architecture that more easily integrates unstructured files across multi-tier data system platforms. By using SAS® Data Integration Studio and SAS® Enterprise Business Intelligence software, you can easily automate big data using SAS® system accommodations for Hadoop open-source standards. At the same time, another seminal technology has emerged, which involves real-time multi-sensor data integration using Arduino microprocessors. This break-out session demonstrates the use of SAS® 9.4 coding to define Hadoop clusters and to automate Arduino data acquisition to convert custom unstructured log files into structured tables, which can be analyzed by SAS in near real time. Examples include the use of SAS Data Integration Studio to create and automate stored processes, as well as tips for C language object coding to integrate to SAS data management, with a simple temperature monitoring application for Hadoop to Arduino using SAS.
Keith Allan Jones PHD, QUALIMATIX.com
C
Paper SAS1913-2015:
Clustering Techniques to Uncover Relative Pricing Opportunities: Relative Pricing Corridors Using SAS® Enterprise Miner and SAS® Visual Analytics
The impact of price on brand sales is not always linear or independent of other brand prices. We demonstrate, using sales information and SAS® Enterprise Miner, how to uncover relative price bands where prices might be increased without losing market share or decreased slightly to gain share.
Read the paper (PDF).
Ryan Carr, SAS
Charles Park, Lenovo
Paper 3511-2015:
Credit Scorecard Generation Using the Credit Scoring Node in SAS® Enterprise Miner™
In today's competitive world, acquiring new customers is crucial for businesses but what if most of the acquired customers turn out to be defaulters? This decision would backfire on the business and might lead to losses. The extant statistical methods have enabled businesses to identify good risk customers rather than intuitively judging them. The objective of this paper is to build a credit risk scorecard using the Credit Risk Node inside SAS® Enterprise Miner™ 12.3, which can be used by a manager to make an instant decision on whether to accept or reject a customer's credit application. The data set used for credit scoring was extracted from UCI Machine Learning repository and consisted of 15 variables that capture details such as status of customer's existing checking account, purpose of the credit, credit amount, employment status, and property. To ensure generalization of the model, the data set has been partitioned using the data partition node in two groups of 70:30 as training and validation respectively. The target is a binary variable, which categorizes customers into good risk and bad risk group. After identifying the key variables required to generate the credit scorecard, a particular score was assigned to each of its sub groups. The final model generating the scorecard has a prediction accuracy of about 75%. A cumulative cut-off score of 120 was generated by SAS to make the demarcation between good and bad risk customers. Even in case of future variations in the data, model refinement is easy as the whole process is already defined and does not need to be rebuilt from scratch.
Read the paper (PDF).
Ayush Priyadarshi, Oklahoma State University
Kushal Kathed, Oklahoma State University
Shilpi Prasad, Oklahoma State University
D
Paper SAS4780-2015:
Deriving Insight Across the Enterprise from Digital Data
Learn how leading retailers are developing key findings in digital data to be leveraged across marketing, merchandising, and IT.
Rachel Thompson, SAS
Paper 3368-2015:
Determining the Key Success Factors for Hit Songs in the Billboard Music Charts
Analyzing the key success factors for hit songs in the Billboard music charts is an ongoing area of interest to the music industry. Although there have been many studies over the past decades on predicting whether a song has the potential to become a hit song, the following research question remains, Can hit songs be predicted? And, if the answer is yes, what are the characteristics of those hit songs? This study applies data mining techniques using SAS® Enterprise Miner™ to understand why some music is more popular than other music. In particular, certain songs are considered one-hit wonders, which are in the Billboard music charts only once. Meanwhile, other songs are acknowledged as masterpieces. With 2,139 data records, the results demonstrate the practical validity of our approach.
Read the paper (PDF).
Piboon Banpotsakun, National Institute of Development Administration
Jongsawas Chongwatpol, NIDA Business School, National Institute of Development Administration
Paper 3021-2015:
Discovering Personality Type through Text Mining in SAS® Enterprise Miner™ 12.3
Data scientists and analytic practitioners have become obsessed with quantifying the unknown. Through text mining third-person posthumous narratives in SAS® Enterprise Miner™ 12.1, we measured tangible aspects of personalities based on the broadly accepted big-five characteristics: extraversion, agreeableness, conscientiousness, neuroticism, and openness. These measurable attributes are linked to common descriptive terms used throughout our data to establish statistical relationships. The data set contains over 1,000 obituaries from newspapers throughout the United States, with individuals who vary in age, gender, demographic, and socio-economic circumstances. In our study, we leveraged existing literature to build the ontology used in the analysis. This literature suggests that a third person's perspective gives insight into one's personality, solidifying the use of obituaries as a source for analysis. We statistically linked target topics such as career, education, religion, art, and family to the five characteristics. With these taxonomies, we developed multivariate models in order to assign scores to predict an individual's personality type. With a trained model, this study has implications for predicting an individual's personality, allowing for better decisions on human capital deployment. Even outside the traditional application of personality assessment for organizational behavior, the methods used to extract intangible characteristics from text enables us to identify valuable information across multiple industries and disciplines.
Read the paper (PDF).
Mark Schneider, Deloitte & Touche
Andrew Van Der Werff, Deloitte & Touche, LLP
Paper 3347-2015:
Donor Sentiment and Characteristic Analysis Using SAS® Enterprise Miner™ and SAS® Sentiment Analysis Studio
It has always been a million-dollar question, What inhibits a donor to donate? Many successful universities have deep roots in annual giving. We know donor sentiment is a key factor in drawing attention to engage donors. This paper is a summary of findings about donor behaviors using textual analysis combined with the power of predictive modeling. In addition to identifying the characteristics of donors, the paper focuses on identifying the characteristics of a first-time donor. It distinguishes the features of the first-time donor from the general donor pattern. It leverages the variations in data to provide deeper insights into behavioral patterns. A data set containing 247,000 records was obtained from the XYZ University Foundation alumni database, Facebook, and Twitter. Solicitation content such as email subject lines sent to the prospect base was considered. Time-dependent data and time-independent data were categorized to make unbiased predictions about the first-time donor. The predictive models use input such as age, educational records, scholarships, events, student memberships, and solicitation methods. Models such as decision trees, Dmine regression, and neural networks were built to predict the prospects. SAS® Sentiment Analysis Studio and SAS® Enterprise Miner™ were used to analyze the sentiment.
Read the paper (PDF).
Ramcharan Kakarla, Comcast
Goutam Chakraborty, Oklahoma State University
Paper SAS1865-2015:
Drilling for Deepwater Data: A Forensic Analysis of the Gulf of Mexico Deepwater Horizon Disaster
During the cementing and pumps-off phase of oil drilling, drilling operations need to know, in real time, about any loss of hydrostatic or mechanical well integrity. This phase involves not only big data, but also high-velocity data. Today's state-of-the-art drilling rigs have tens of thousands of sensors. These sensors and their data output must be correlated and analyzed in real time. This paper shows you how to leverage SAS® Asset Performance Analytics and SAS® Enterprise Miner™ to build a model for drilling and well control anomalies, fingerprint key well control measures of the transienct fluid properties, and how to operationalize these analytics on the drilling assets with SAS® event stream processing. We cover the implementation and results from the Deepwater Horizon case study, demonstrating how SAS analytics enables the rapid differentiation between safe and unsafe modes of operation.
Read the paper (PDF).
Jim Duarte, SAS
Keith Holdaway, SAS
Moray Laing, SAS
E
Paper 3190-2015:
Educating Future Business Leaders in the Era of Big Data
At NC State University, our motto is Think and Do. When it comes to educating students in the Poole College of Management, that means that we want them to not only learn to think critically but also to gain hands-on experience with the tools that will enable them to be successful in their careers. And, in the era of big data, we want to ensure that our students develop skills that will help them to think analytically in order to use data to drive business decisions. One method that lends itself well to thinking and doing is the case study approach. In this paper, we discuss the case study approach for teaching analytical skills and highlight the use of SAS® software for providing practical, hands-on experience with manipulating and analyzing data. The approach is illustrated with examples from specific case studies that have been used for teaching introductory and intermediate courses in business analytics.
Read the paper (PDF).
Tonya Balan, NC State University
G
Paper SAS2602-2015:
Getting Started with Enabling Your End-User Applications to Use SAS® Grid Manager 9.4
A SAS® Grid Manager environment provides your organization with a powerful and flexible way to manage many forms of SAS® computing workloads. For the business and IT user community, the benefits can range from data management jobs effectively utilizing the available processing resources, complex analyses being run in parallel, and reassurance that statutory reports are generated in a highly available environment. This workshop begins the process of familiarizing users with the core concepts of how to grid-enable tasks within SAS® Studio, SAS® Enterprise Guide®, SAS® Data Integration Studio, and SAS® Enterprise Miner™ client applications.
Edoardo Riva, SAS
Paper 3198-2015:
Gross Margin Percent Prediction: Using the Power of SAS® Enterprise Miner™ 12.3 to Predict the Gross Margin Percent for a Steel Manufacturing Company
Predicting the profitability of future sales orders in a price-sensitive, highly competitive make-to-order market can create a competitive advantage for an organization. Order size and specifications vary from order to order and customer to customer, and might or might not be repeated. While it is the intent of the sales groups to take orders for a profit, because of the volatility of steel prices and the competitive nature of the markets, gross margins can range dramatically from one order to the next and in some cases can be negative. Understanding the key factors affecting the gross margin percent and their impact can help the organization to reduce the risk of non-profitable orders and at the same time improve their decision-making ability on market planning and forecasting. The objective of this paper is to identify the best model amongst multiple predictive models inside SAS® Enterprise Miner™, which could accurately predict the gross margin percent for future orders. The data used for the project consisted of over 30,000 transactional records and 33 input variables. The sales records have been collected from multiple manufacturing plants of the steel manufacturing company. Variables such as order quantity, customer location, sales group, and others were used to build predictive models. The target variable gross margin percent is the net profit on the sales, considering all the factors such as labor cost, cost of raw materials, and so on. The model comparison node of SAS Enterprise Miner was used to determine the best among different variations of regression models, decision trees, and neural networks, as well as ensemble models. Average squared error was used as the fit statistic to evaluate each model's performance. Based on the preliminary model analysis, the ensemble model outperforms other models with the least average square error.
Read the paper (PDF).
Kushal Kathed, Oklahoma State University
Patti Jordan
Ayush Priyadarshi, Oklahoma State University
H
Paper SAS1704-2015:
Helpful Hints for Transitioning to SAS® 9.4
A group tasked with testing SAS® software from the customer perspective has gathered a number of helpful hints for SAS® 9.4 that will smooth the transition to its new features and products. These hints will help with the 'huh?' moments that crop up when you are getting oriented and will provide short, straightforward answers. We also share insights about changes in your order contents. Gleaned from extensive multi-tier deployments, SAS® Customer Experience Testing shares insiders' practical tips to ensure that you are ready to begin your transition to SAS 9.4. The target audience for this paper is primarily system administrators who will be installing, configuring, or administering the SAS 9.4 environment. (This paper is an updated version of the paper presented at SAS Global Forum 2014 and includes new features and software changes since the original paper was delivered, plus any relevant content that still applies. This paper includes information specific to SAS 9.4 and SAS 9.4 maintenance releases.)
Read the paper (PDF).
Cindy Taylor, SAS
Paper 3431-2015:
How Latent Analyses within Survey Data Can Be Valuable Additions to Any Regression Model
This study looks at several ways to investigate latent variables in longitudinal surveys and their use in regression models. Three different analyses for latent variable discovery are briefly reviewed and explored. The procedures explored in this paper are PROC LCA, PROC LTA, PROC CATMOD, PROC FACTOR, PROC TRAJ, and PROC SURVEYLOGISTIC. The analyses defined through these procedures are latent profile analyses, latent class analyses, and latent transition analyses. The latent variables are included in three separate regression models. The effect of the latent variables on the fit and use of the regression model compared to a similar model using observed data is briefly reviewed. The data used for this study was obtained via the National Longitudinal Study of Adolescent Health, a study distributed and collected by Add Health. Data was analyzed using SAS® 9.3. This paper is intended for any level of SAS® user. This paper is also aimed at an audience with a background in behavioral science or statistics.
Read the paper (PDF).
Deanna Schreiber-Gregory, National University
Paper 3185-2015:
How to Hunt for Utility Customer Electric Usage Patterns Armed with SAS® Visual Statistics with Hadoop and Hive
Your electricity usage patterns reveal a lot about your family and routines. Information collected from electrical smart meters can be mined to identify patterns of behavior that can in turn be used to help change customer behavior for the purpose of altering system load profiles. Demand Response (DR) programs represent an effective way to cope with rising energy needs and increasing electricity costs. The Federal Energy Regulatory Commission (FERC) defines demand response as changes in electric usage by end-use customers from their normal consumption patterns in response to changes in the price of electricity over time, or to incentive payments designed to lower electricity use at times of high wholesale market prices or when system reliability of jeopardized. In order to effectively motivate customers to voluntarily change their consumptions patterns, it is important to identify customers whose load profiles are similar so that targeted incentives can be directed toward these customers. Hence, it is critical to use tools that can accurately cluster similar time series patterns while providing a means to profile these clusters. In order to solve this problem, though, hardware and software that is capable of storing, extracting, transforming, loading and analyzing large amounts of data must first be in place. Utilities receive customer data from smart meters, which track and store customer energy usage. The data collected is sent to the energy companies every fifteen minutes or hourly. With millions of meters deployed, this quantity of information creates a data deluge for utilities, because each customer generates about three thousand data points monthly, and more than thirty-six billion reads are collected annually for a million customers. The data scientist is the hunter, and DR candidate patterns are the prey in this cat-and-mouse game of finding customers willing to curtail electrical usage for a program benefit. The data scientist must connect large siloed data sources, external data , and even unstructured data to detect common customer electrical usage patterns, build dependency models, and score them against their customer population. Taking advantage of Hadoop's ability to store and process data on commodity hardware with distributed parallel processing is a game changer. With Hadoop, no data set is too large, and SAS® Visual Statistics leverages machine learning, artificial intelligence, and clustering techniques to build descriptive and predictive models. All data can be usable from disparate systems, including structured, unstructured, and log files. The data scientist can use Hadoop to ingest all available data at rest, and analyze customer usage patterns, system electrical flow data, and external data such as weather. This paper will use Cloudera Hadoop with Apache Hive queries for analysis on platforms such as SAS® Visual Analytics and SAS Visual Statistics. The paper will showcase optionality within Hadoop for querying large data sets with open-source tools and importing these data into SAS® for robust customer analytics, clustering customers by usage profiles, propensity to respond to a demand response event, and an electrical system analysis for Demand Response events.
Read the paper (PDF).
Kathy Ball, SAS
Paper 3446-2015:
How to Implement Two-Phase Regression Analysis to Predict Profitable Revenue Units
Is it a better business decision to determine profitability of all business units/kiosks and then decide to prune the nonprofitable ones? Or does model performance improve if we decide to first find the units that meet the break-even point and then try to calculate their profits? In our project, we did a two-stage regression process due to highly skewed distribution of the variables. First, we performed logistic regression to predict which kiosks would be profitable. Then, we used linear regression to predict the average monthly revenue at each kiosk. We used SAS® Enterprise Guide® and SAS® Enterprise Miner™ for the modeling process. The effectiveness of the linear regression model is much more for predicting the target variable at profitable kiosks as compared to unprofitable kiosks. The two-phase regression model seemed to perform better than simply performing a linear regression, particularly when the target variable has too many levels. In real-life situations, the dependent and independent variables can have highly skewed distributions, and two-phase regression can help improve model performance and accuracy. Some results: The logistic regression model has an overall accuracy of 82.9%, sensitivity of 92.6%, and specificity of 61.1% with comparable figures for the training data set at 81.8%, 90.7%, and 63.8% respectively. This indicates that the regression model seems to be consistently predicting the profitable kiosks at a reasonably good level. Linear regression model: For the training data set, the MAPE (mean absolute percentage errors in prediction) is 7.2% for the kiosks that earn more than $350 whereas the MAPE (mean absolute percentage errors in prediction) for kiosks that earn less than $350 is -102% for the predicted values (not log-transformed) of the target versus the actual value of the target respectively. For the validation data set, the MAPE (mean absolute percentage errors in prediction) is 7.6% for the kiosks that earn more than $350 whereas the MAPE (mean absolute percentage errors in prediction) for kiosks that earn less than $350 is -142% for the predicted values (not log-transformed) of the target versus the actual value of the target respectively. This means that the average monthly revenue figures seem to be better predicted for the model where the kiosks were earning higher than the threshold value of $350--that is, for those kiosk variables with a flag variable of 1. The model seems to be predicting the target variable with lower APE for higher values of the target variable for both the training data set above and the entire data set below. In fact, if the threshold value for the kiosks is moved to even say $500, the predictive power of the model in terms of APE will substantially increase. The validation data set (Selection Indicator=0) has fewer data points, and, therefore, the contrast in APEs is higher and more varied.
Read the paper (PDF).
Shrey Tandon, Sobeys West
Paper 3151-2015:
How to Use Internal and External Data to Realize the Potential for Changing the Game in Handset Campaigns
The telecommunications industry is the fastest changing business ecosystem in this century. Therefore, handset campaigning to increase loyalty is the top issue for telco companies. However, these handset campaigns have great fraud and payment risks if the companies do not have the ability to classify and assess customers properly according to their risk propensity. For many years, telco companies managed the risk with business rules such as customer tenure until the launch of analytics solutions into the market. But few business rules restrict telco companies in the sales of handsets to new customers. On the other hand, with increasing competition pressure in telco companies, it is necessary to use external credit data to sell handsets to new customers. Credit bureau data was a good opportunity to measure and understand the behaviors of the applicants. But using external data required system integration and real-time decision systems. For those reasons, we need a solution that enables us to predict risky customers and then integrate risk scores and all information into one real-time decision engine for optimized handset application vetting. After an assessment period, SAS® Analytics platform and RTDM were chosen as the most suitable solution because they provide a flexible user friendly interface, high integration, and fast deployment capability. In this project, we build a process that includes three main stages to transform the data into knowledge. These stages are data collection, predictive modelling, and deployment and decision optimization. a) Data Collection: We designed a specific daily updated data mart that connects internal payment behavior, demographics, and customer experience data with external credit bureau data. In this way, we can turn data into meaningful knowledge for better understanding of customer behavior. b) Predictive Modelling: For using the company potential, it is critically important to use an analytics approach that is based on state-of-the-art tec hnologies. We built nine models to predict customer propensity to pay. As a result of better classification of customers, we obtain satisfied results in designing collection scenarios and decision model in handset application vetting. c) Deployment and Decision Optimization: Knowledge is not enough to reach success in business. It should be turned into optimized decision and deployed real time. For this reason, we have been using SAS® Predictive Analytics Tools and SAS® Real-Time Decision Manager to primarily turn data into knowledge and turn knowledge into strategy and execution. With this system, we are now able to assess customers properly and to sell handset even to our brand-new customers as part of the application vetting process. As a result of this, while we are decreasing nonpayment risk, we generated extra revenue that is coming from brand-new contracted customers. In three months, 13% of all handset sales was concluded via RTDM. Another benefit of the RTDM is a 30% cost saving in external data inquiries. Thanks to the RTDM, Avea has become the first telecom operator that uses bureau data in Turkish Telco industry.
Read the paper (PDF).
Hurcan Coskun, Avea
I
Paper 3343-2015:
Improving SAS® Global Forum Papers
Just as research is built on existing research, the references section is an important part of a research paper. The purpose of this study is to find the differences between professionals and academicians with respect to the references section of a paper. Data is collected from SAS® Global Forum 2014 Proceedings. Two research hypotheses are supported by the data. First, the average number of references in papers by academicians is higher than those by professionals. Second, academicians follow standards for citing references more than professionals. Text mining is performed on the references to understand the actual content. This study suggests that authors of SAS Global Forum papers should include more references to increase the quality of the papers.
Read the paper (PDF).
Vijay Singh, Oklahoma State University
Pankush Kalgotra, Oklahoma State University
Paper SAS1965-2015:
Improving the Performance of Data Mining Models with Data Preparation Using SAS® Enterprise Miner™
In data mining modelling, data preparation is the most crucial, most difficult, and longest part of the mining process. A lot of steps are involved. Consider the simple distribution analysis of the variables, the diagnosis and reduction of the influence of variables' multicollinearity, the imputation of missing values, and the construction of categories in variables. In this presentation, we use data mining models in different areas like marketing, insurance, retail and credit risk. We show how to implement data preparation through SAS® Enterprise Miner™, using different approaches. We use simple code routines and complex processes involving statistical insights, cluster variables, transform variables, graphical analysis, decision trees, and more.
Read the paper (PDF).
Ricardo Galante, SAS
Paper 3356-2015:
Improving the Performance of Two-Stage Modeling Using the Association Node of SAS® Enterprise Miner™ 12.3
Over the years, very few published studies have discussed ways to improve the performance of two-stage predictive models. This study, based on 10 years (1999-2008) of data from 130 US hospitals and integrated delivery networks, is an attempt to demonstrate how we can leverage the Association node in SAS® Enterprise Miner™ to improve the classification accuracy of the two-stage model. We prepared the data with imputation operations and data cleaning procedures. Variable selection methods and domain knowledge were used to choose 43 key variables for the analysis. The prominent association rules revealed interesting relationships between prescribed medications and patient readmission/no-readmission. The rules with lift values greater than 1.6 were used to create dummy variables for use in the subsequent predictive modeling. Next, we used two-stage sequential modeling, where the first stage predicted if the diabetic patient was readmitted and the second stage predicted whether the readmission happened within 30 days. The backward logistic regression model outperformed competing models for the first stage. After including dummy variables from an association analysis, many fit indices improved, such as the validation ASE to 0.228 from 0.238, cumulative lift to 1.56 from 1.40. Likewise, the performance of the second stage was improved after including dummy variables from an association analysis. Fit indices such as the misclassification rate improved to 0.240 from 0.243 and the final prediction error to 0.17 from 0.18.
Read the paper (PDF).
Girish Shirodkar, Oklahoma State University
Goutam Chakraborty, Oklahoma State University
Ankita Chaudhari, Oklahoma State University
K
Paper 3023-2015:
Killing Them with Kindness: Policies Not Based on Data Might Do More Harm Than Good
Educational administrators sometimes have to make decisions based on what they believe is in the best interest of their students because they do not have the data they need at the time. Some administrators do not even know that the data exist to help them make their decisions. However, well-intentioned policies that are not based on facts can sometimes do more harm than good for the students and the institution. This presentation discusses the results of the policy analyses conducted by the Office of Institutional Research at Western Kentucky University using Base SAS®, SAS/STAT®, SAS® Enterprise Miner™, and SAS® Visual Analytics. The researchers analyzed Western Kentucky University's math course placement procedure for incoming students and assessed the criteria used for admissions decisions, including those for first-time first-year students, transfer students, and students readmitted to the University after being dismissed for unsatisfactory academic progress--procedures and criteria previously designed with the students' best interests at heart. The presenters discuss the statistical analyses used to evaluate the policies and the use of SAS Visual Analytics to present their results to administrators in a visual manner. In addition, the presenters discuss subsequent changes in the policies, and where possible, the results of the policy changes.
Read the paper (PDF).
Tuesdi Helbig, Western Kentucky University
Matthew Foraker, Western Kentucky University
L
Paper SAS1955-2015:
Latest and Greatest: Best Practices for Migrating to SAS® 9.4
SAS® customers benefit greatly when they are using the functionality, performance, and stability available in the latest version of SAS. However, the task of moving all SAS collateral such as programs, data, catalogs, metadata (stored processes, maps, queries, reports, and so on), and content to SAS® 9.4 can seem daunting. This paper provides an overview of the steps required to move all SAS collateral from systems based on SAS® 9.2 and SAS® 9.3 to the current release of SAS® 9.4.
Read the paper (PDF).
Alec Fernandez, SAS
Paper 3342-2015:
Location-Based Association of Customer Sentiment and Retail Sales
There are various economic factors that affect retail sales. One important factor that is expected to correlate is overall customer sentiment toward a brand. In this paper, we analyze how location-specific customer sentiment could vary and correlate with sales at retail stores. In our attempt to find any dependency, we have used location-specific Twitter feeds related to a national-brand chain retail store. We opinion-mine their overall sentiment using SAS® Sentiment Analysis Studio. We estimate correlation between the opinion index and retail sales within the studied geographic areas. Later in the analysis, using ArcGIS Online from Esri, we estimate whether other location-specific variables that could potentially correlate with customer sentiment toward the brand are significantly important to predict a brand's retail sales.
Read the paper (PDF).
Asish Satpathy, University of California, Riverside
Goutam Chakraborty, Oklahoma State University
Tanvi Kode, Oklahoma State University
M
Paper 3375-2015:
Maximizing a Churn Campaign's Profitability with Cost-sensitive Predictive Analytics
Predictive analytics has been widely studied in recent years, and it has been applied to solve a wide range of real-world problems. Nevertheless, current state-of-the-art predictive analytics models are not well aligned with managers' requirements in that the models fail to include the real financial costs and benefits during the training and evaluation phases. Churn predictive modeling is one of those examples in which evaluating a model based on a traditional measure such as accuracy or predictive power does not yield the best results when measured by investment per subscriber in a loyalty campaign and the financial impact of failing to detect a real churner versus wrongly predicting a non-churner as a churner. In this paper, we propose a new financially based measure for evaluating the effectiveness of a voluntary churn campaign, taking into account the available portfolio of offers, their individual financial cost, and the probability of acceptance depending on the customer profile. Then, using a real-world churn data set, we compared different cost-insensitive and cost-sensitive predictive analytics models and measured their effectiveness based on their predictive power and cost optimization. The results show that using a cost-sensitive approach yields to an increase in profitability of up to 32.5%.
Alejandro Correa Bahnsen, University of Luxembourg
Darwin Amezquita, DIRECTV
Juan Camilo Arias, Smartics
Paper 2524-2015:
Methodology of Model Creation
The goal of this session is to describe the whole process of model creation from the business request through model specification, data preparation, iterative model creation, model tuning, implementation, and model servicing. Each mentioned phase consists of several steps in which we describe the main goal of the step, the expected outcome, the tools used, our own SAS codes, useful nodes, and settings in SAS® Enterprise Miner™, procedures in SAS® Enterprise Guide®, measurement criteria, and expected duration in man-days. For three steps, we also present deep insights with examples of practical usage, explanations of used codes, settings, and ways of exploring and interpreting the output. During the actual model creation process, we suggest using Microsoft Excel to keep all input metadata along with information about transformations performed in SAS Enterprise Miner. To get faster information about model results, we combine an automatic SAS® code generator implemented in Excel, and then we input this code to SAS Enterprise Guide and create a specific profile of results directly from the nodes output tables of SAS Enterprise Miner. This paper also focuses on an example of a binary model stability check-in time performed in SAS Enterprise Guide through measuring optimal cut-off percentage and lift. These measurements are visualized and automatized using our own codes. By using this methodology, users would have direct contact with transformed data along with the possibility to analyze and explore any semi-results. Furthermore, the proposed approach could be used for several types of modeling (for example, binary and nominal predictive models or segmentation models). Generally, we have summarized our best practices of combining specific procedures performed in SAS Enterprise Guide, SAS Enterprise Miner, and Microsoft Excel to create and interpret models faster and more effectively.
Read the paper (PDF).
Peter Kertys, VÚB a.s.
Paper 1381-2015:
Model Risk and Corporate Governance of Models with SAS®
Banks can create a competitive advantage in their business by using business intelligence (BI) and by building models. In the credit domain, the best practice is to build risk-sensitive models (Probability of Default, Exposure at Default, Loss-given Default, Unexpected Loss, Concentration Risk, and so on) and implement them in decision-making, credit granting, and credit risk management. There are models and tools on the next level built on these models and that are used to help in achieving business targets, risk-sensitive pricing, capital planning, optimizing of ROE/RAROC, managing the credit portfolio, setting the level of provisions, and so on. It works remarkably well as long as the models work. However, over time, models deteriorate and their predictive power can drop dramatically. Since the global financial crisis in 2008, we have faced a tsunami of regulation and accelerated frequency of changes in the business environment, which cause models to deteriorate faster than ever before. As a result, heavy reliance on models in decision-making (some decisions are automated following the model's results--without human intervention) might result in a huge error that can have dramatic consequences for the bank's performance. In my presentation, I share our experience in reducing model risk and establishing corporate governance of models with the following SAS® tools: model monitoring, SAS® Model Manager, dashboards, and SAS® Visual Analytics.
Read the paper (PDF).
Boaz Galinson, Bank Leumi
Paper 3406-2015:
Modeling to Improve the Customer Unit Target Selection for Inspections of Commercial Losses in the Brazilian Electric Sector: The case of CEMIG
Electricity is an extremely important product for society. In Brazil, the electric sector is regulated by ANEEL (Ag ncia Nacional de Energia El trica), and one of the regulated aspects is power loss in the distribution system. In 2013, 13.99% of all injected energy was lost in the Brazilian system. Commercial loss is one of the power loss classifications, which can be countered by inspections of the electrical installation in a search for irregularities in power meters. CEMIG (Companhia Energ tica de Minas Gerais) currently serves approximately 7.8 million customers, which makes it unfeasible (in financial and logistic terms) to inspect all customer units. Thus, the ability to select potential inspection targets is essential. In this paper, logistic regression models, decision tree models, and the Ensemble model were used to improve the target selection process in CEMIG. The results indicate an improvement in the positive predictive value from 35% to 50%.
Read the paper (PDF).
Sergio Henrique Ribeiro, Cemig
Iguatinan Monteiro, CEMIG
P
Paper 3326-2015:
Predicting Hospitalization of a Patient Using SAS® Enterprise Miner™
Inpatient treatment is the most common type of treatment ordered for patients who have a serious ailment and need immediate attention. Using a data set about diabetes patients downloaded from the UCI Network Data Repository, we built a model to predict the probability that the patient will be rehospitalized within 30 days of discharge. The data has about 100,000 rows and 51 columns. In our preliminary analysis, a neural network turned out to be the best model, followed closely by the decision tree model and regression model.
Nikhil Kapoor, Oklahoma State University
Ganesh Kumar Gangarajula, Oklahoma State University
Paper 3254-2015:
Predicting Readmission of Diabetic Patients Using the High-Performance Support Vector Machine Algorithm of SAS® Enterprise Miner™ 13.1
Diabetes is a chronic condition affecting people of all ages and is prevalent in around 25.8 million people in the U.S. The objective of this research is to predict the probability of a diabetic patient being readmitted. The results from this research will help hospitals design a follow-up protocol to ensure that patients having a higher re-admission probability are doing well in order to promote a healthy doctor-patient relationship. The data was obtained from the Center for Machine Learning and Intelligent Systems at University of California, Irvine. The data set contains over 100,000 instances and 55 variables such as insulin and length of stay, and so on. The data set was split into training and validation to provide an honest assessment of models. Various variable selection techniques such as stepwise regression, forward regression, LARS, and LASSO were used. Using LARS, prominent factors were identified in determining the patient readmission rate. Numerous predictive models were built: Decision Tree, Logistic Regression, Gradient Boosting, MBR, SVM, and others. The model comparison algorithm in SAS® Enterprise Miner™ 13.1 recognized that the High-Performance Support Vector Machine outperformed the other models, having the lowest misclassification rate of 0.363. The chosen model has a sensitivity of 49.7% and a specificity of 75.1% in the validation data.
Read the paper (PDF).
Hephzibah Munnangi, Oklahoma State University
Goutam Chakraborty, Oklahoma State University
Paper 3501-2015:
Predicting Transformer Lifetime Using Survival Analysis and Modeling Risk Associated with Overloaded Transformers Using SAS® Enterprise Miner™ 12.1
Utility companies in America are always challenged when it comes to knowing when their infrastructure fails. One of the most critical components of a utility company's infrastructure is the transformer. It is important to assess the remaining lifetime of transformers so that the company can reduce costs, plan expenditures in advance, and largely mitigate the risk of failure. It is also equally important to identify the high-risk transformers in advance and to maintain them accordingly in order to avoid sudden loss of equipment due to overloading. This paper uses SAS® to predict the lifetime of transformers, identify the various factors that contribute to their failure, and model the transformer into High, Medium, and Low risk categories based on load for easy maintenance. The data set from a utility company contains around 18,000 observations and 26 variables from 2006 to 2013, and contains the failure and installation dates of the transformers. The data set comprises many transformers that were installed before 2006 (there are 190,000 transformers on which several regression models are built in this paper to identify their risk of failure), but there is no age-related parameter for them. Survival analysis was performed on this left-truncated and right-censored data. The data set has variables such as Age, Average Temperature, Average Load, and Normal and Overloaded Conditions for residential and commercial transformers. Data creation involved merging 12 different tables. Nonparametric models for failure time data were built so as to explore the lifetime and failure rate of the transformers. By building a Cox's regression model, the important factors contributing to the failure of a transformer are also analyzed in this paper. Several risk- based models are then built to categorize transformers into High, Medium, and Low risk categories based on their loads. This categorization can help the utility companies to better manage the risks associated with transformer failures.
Read the paper (PDF).
Balamurugan Mohan, Oklahoma State University
Goutam Chakraborty, Oklahoma State University
S
Paper SAS2520-2015:
SAS® Does Data Science: How to Succeed in a Data Science Competition
First introduced in 2013, the Cloudera Data Science Challenge is a rigorous competition in which candidates must provide a solution to a real-world big data problem that surpasses a benchmark specified by some of the world's elite data scientists. The Cloudera Data Science Challenge 2 (in 2014) involved detecting anomalies in the United States Medicare insurance system. Finding anomalous patients, procedures, providers, and regions in the competition's large, complex, and intertwined data sets required industrial-strength tools for data wrangling and machine learning. This paper shows how I did it with SAS®.
Read the paper (PDF). | Download the data file (ZIP).
Patrick Hall, SAS
Paper SAS4083-2015:
SAS® Workshop: Data Mining
This workshop provides hands-on experience using SAS® Enterprise Miner. Workshop participants will learn to: open a project, create and explore a data source, build and compare models, and produce and examine score code that can be used for deployment.
Read the paper (PDF).
Chip Wells, SAS
Paper SAS1856-2015:
SAS® and SAP Business Warehouse on SAP HANA--What's in the Handshake?
Is your company using or considering using SAP Business Warehouse (BW) powered by SAP HANA? SAS® provides various levels of integration with SAP BW in an SAP HANA environment. This integration enables you to not only access SAP BW components from SAS, but to also push portions of SAS analysis directly into SAP HANA, accelerating predictive modeling and data mining operations. This paper explains the SAS toolset for different integration scenarios, highlights the newest technologies contributing to integration, and walks you through examples of using SAS with SAP BW on SAP HANA. The paper is targeted at SAS and SAP developers and architects interested in building a productive analytical environment with the help of the latest SAS and SAP collaborative advancements.
Read the paper (PDF).
Tatyana Petrova, SAS
T
Paper 2920-2015:
Text Mining Kaiser Permanente Member Complaints with SAS® Enterprise Miner™
This presentation details the steps involved in using SAS® Enterprise Miner™ to text mine a sample of member complaints. Specifically, it describes how the Text Parsing, Text Filtering, and Text Topic nodes were used to generate topics that described the complaints. Text mining results are reviewed (slightly modified for confidentiality), as well as conclusions and lessons learned from the project.
Read the paper (PDF).
Amanda Pasch, Kaiser Permanenta
Paper 3361-2015:
The More Trees, the Better! Scaling Up Performance Using Random Forest in SAS® Enterprise Miner™
Random Forest (RF) is a trademarked term for an ensemble approach to decision trees. RF was introduced by Leo Breiman in 2001.Due to our familiarity with decision trees--one of the intuitive, easily interpretable models that divides the feature space with recursive partitioning and uses sets of binary rules to classify the target--we also know some of its limitations such as over-fitting and high variance. RF uses decision trees, but takes a different approach. Instead of growing one deep tree, it aggregates the output of many shallow trees and makes a strong classifier model. RF significantly improves the accuracy of classification by growing an ensemble of trees and allowing for the selection of the most popular one. Unlike decision trees, RF has a robustness against over-fitting and high variance, since it randomly selects a subset of variables in each split node. This paper demonstrates this simple yet powerful classification algorithm by building an income-level prediction system. Data extracted from the 1994 Census Bureau database was used for this study. The data set comprises information about 14 key attributes for 45,222 individuals. Using SAS® Enterprise Miner™ 13.1, models such as random forest, decision tree, probability decision tree, gradient boosting, and logistic regression were built to classify the income level( >50K or <50k) of the population. The results showed that the random forest model was the best model for this data, based on the misclassification rate criteria. The RF model predicts the income-level group of the individuals with an accuracy of 85.1%, with the predictors capturing specific characteristic patterns. This demonstration using SAS® can lead to useful insights into RF for solving classification problems.
Read the paper (PDF).
Narmada Deve Panneerselvam, OSU
Paper 3820-2015:
Time to Harvest: Operationalizing SAS® Analytics on the SAP HANA Platform
A maximum harvest in farming analytics is achieved only if analytics can also be operationalized at the level of core business applications. Mapped to the use of SAS® Analytics, the fruits of SAS be shared with Enterprise Business Applications by SAP. Learn how your SAS environment, including the latest of SAS® In-Memory Analytics, can be integrated with SAP applications based on the SAP In-Memory Platform SAP HANA. We'll explore how a SAS® Predictive Modeling environment can be embedded inside SAP HANA and how native SAP HANA data management capabilities such as SAP HANA Views, Smart Data Access, and more can be leveraged by SAS applications and contribute to an end-to-end in-memory data management and analytics platform. Come and see how you can extend the reach of your SAS® Analytics efforts with the SAP HANA integration!
Read the paper (PDF).
Morgen Christoph, SAP SE
U
Paper SAS1910-2015:
Unconventional Data-Driven Methodologies Forecast Performance in Unconventional Oil and Gas Reservoirs
How does historical production data relate a story about subsurface oil and gas reservoirs? Business and domain experts must perform accurate analysis of reservoir behavior using only rate and pressure data as a function of time. This paper introduces innovative data-driven methodologies to forecast oil and gas production in unconventional reservoirs that, owing to the nature of the tightness of the rocks, render the empirical functions less effective and accurate. You learn how implementations of the SAS® MODEL procedure provide functional algorithms that generate data-driven type curves on historical production data. Reservoir engineers can now gain more insight to the future performance of the wells across their assets. SAS enables a more robust forecast of the hydrocarbons in both an ad hoc individual well interaction and in an automated batch mode across the entire portfolio of wells. Examples of the MODEL procedure arising in subsurface production data analysis are discussed, including the Duong data model and the stretched exponential data model. In addressing these examples, techniques for pattern recognition and for implementing TREE, CLUSTER, and DISTANCE procedures in SAS/STAT® are highlighted to explicate the importance of oil and gas well profiling to characterize the reservoir. The MODEL procedure analyzes models in which the relationships among the variables comprise a system of one or more nonlinear equations. Primary uses of the MODEL procedure are estimation, simulation, and forecasting of nonlinear simultaneous equation models, and generating type curves that fit the historical rate production data. You will walk through several advanced analytical methodologies that implement the SEMMA process to enable hypotheses testing as well as directed and undirected data mining techniques. SAS® Visual Analytics Explorer drives the exploratory data analysis to surface trends and relationships, and the data QC workflows ensure a robust input space for the performance forecasting methodologies that are visualized in a web-based thin client for interactive interpretation by reservoir engineers.
Read the paper (PDF).
Keith Holdaway, SAS
Louis Fabbi, SAS
Dan Lozie, SAS
Paper 1640-2015:
Understanding Characteristics of Insider Threats by Using Feature Extraction
This paper explores feature extraction from unstructured text variables using Term Frequency-Inverse Document Frequency (TF-IDF) weighting algorithms coded in Base SAS®. Data sets with unstructured text variables can often hold a lot of potential to enable better predictive analysis and document clustering. Each of these unstructured text variables can be used as inputs to build an enriched data set-specific inverted index, and the most significant terms from this index can be used as single word queries to weight the importance of the term to each document from the corpus. This paper also explores the usage of hash objects to build the inverted indices from the unstructured text variables. We find that hash objects provide a considerable increase in algorithm efficiency, and our experiments show that a novel weighting algorithm proposed by Paik (2013) best enables meaningful feature extraction. Our TF-IDF implementations are tested against a publicly available data breach data set to understand patterns specific to insider threats to an organization.
Read the paper (PDF). | Watch the recording.
Ila Gokarn, Singapore Management University
Clifton Phua, SAS
Paper 3141-2015:
Unstructured Data Mining to Improve Customer Experience in Interactive Voice Response Systems
Interactive Voice Response (IVR) systems are likely one of the best and worst gifts to the world of communication, depending on who you ask. Businesses love IVR systems because they take out hundreds of millions of dollars of call center costs in automation of routine tasks, while consumers hate IVRs because they want to talk to an agent! It is a delicate balancing act to manage an IVR system that saves money for the business, yet is smart enough to minimize consumer abrasion by knowing who they are, why they are calling, and providing an easy automated solution or a quick route to an agent. There are many aspects to designing such IVR systems, including engineering, application development, omni-channel integration, user interface design, and data analytics. For larger call volume businesses, IVRs generate terabytes of data per year, with hundreds of millions of rows per day that track all system and customer- facing events. The data is stored in various formats and is often unstructured (lengthy character fields that store API return information or text fields containing consumer utterances). The focus of this talk is the development of a data mining framework based on SAS® that is used to parse and analyze IVR data in order to provide insights into usability of the application across various customer segments. Certain use cases are also provided.
Read the paper (PDF).
Dmitriy Khots, West Corp
Paper SAS1716-2015:
Using Boolean Rule Extraction for Taxonomic Text Categorization for Big Data
Categorization hierarchies are ubiquitous in big data. Examples include the MEDLINE's Medical Subject Headings (MeSH) taxonomy, United Nations Standard Products and Services Code (UNSPSC) product codes, and the Medical Dictionary for Regulatory Activities (MedDRA) hierarchy for adverse reaction coding. A key issue is that in most taxonomies the probability of any particular example being in a category is very small at lower levels of the hierarchy. Blindly applying a standard categorization model is likely to perform poorly if this fact is not taken into consideration. This paper introduce a novel technique for text categorization, Boolean rule extraction, which enables you to effectively address this situation. In addition, models that are generated by a rule-based technique have good interpretability and can be easily modified by a human expert, enabling better human-machine interaction. The paper demonstrates how to use SAS® Text Miner macros and procedures to obtain effective predictive models at all hierarchy levels in a taxonomy.
Read the paper (PDF).
Zheng Zhao, SAS
Russ Albright, SAS
James Cox, SAS
Ning Jin, SAS
Paper SAS1925-2015:
Using SAS to Deliver Web Content Personalization Using Cloud-Based Clickstream Data and On-premises Customer Data
Real-time web content personalization has come into its teen years, but recently a spate of marketing solutions have enabled marketers to finely personalize web content for visitors based on browsing behavior, geo-location, preferences, and so on. In an age where the attention span of a web visitor is measured in seconds, marketers hope that tailoring the digital experience will pique each visitor's interest just long enough to increase corporate sales. The range of solutions spans the entire spectrum of completely cloud-based installations to completely on-premises installations. Marketers struggle to find the most optimal solution that would meet their corporation's marketing objectives, provide them the highest agility and time-to-market, and still keep a low marketing budget. In the last decade or so, marketing strategies that involved personalizing using purely on-premises customer data quickly got replaced by ones that involved personalizing using only web-browsing behavior (a.k.a, clickstream data). This was possible because of a spate of cloud-based solutions that enabled marketers to de-couple themselves from the underlying IT infrastructure and the storage issues of capturing large volumes of data. However, this new trend meant that corporations weren't using much of their treasure trove of on-premises customer data. Of late, however, enterprises have been trying hard to find solutions that give them the best of both--the ease of gathering clickstream data using cloud-based applications and on-premises customer data--to perform analytics that lead to better web content personalization for a visitor. This paper explains a process that attempts to address this rapidly evolving need. The paper assumes that the enterprise already has tools for capturing clickstream data, developing analytical models, and for presenting the content. It provides a roadmap to implementing a phased approach where enterprises continue to capture clickstream data, but they bring that data in-house to be merg ed with customer data to enable their analytics team to build sophisticated predictive models that can be deployed into the real-time web-personalization application. The final phase requires enterprises to continuously improve their predictive models on a periodic basis.
Read the paper (PDF).
Mahesh Subramanian, SAS Institute Inc.
Suneel Grover, SAS
Paper 3503-2015:
Using SAS® Enterprise Guide®, SAS® Enterprise Miner™, and SAS® Marketing Automation to Make a Collection Campaign Smarter
Companies are increasingly relying on analytics as the right solution to their problems. In order to use analytics and create value for the business, companies first need to store, transform, and structure the data to make it available and functional. This paper shows a successful business case where the extraction and transformation of the data combined with analytical solutions were developed to automate and optimize the management of the collections cycle for a TELCO company (DIRECTV Colombia). SAS® Data Integration Studio is used to extract, process, and store information from a diverse set of sources. SAS Information Map is used to integrate and structure the created databases. SAS® Enterprise Guide® and SAS® Enterprise Miner™ are used to analyze the data, find patterns, create profiles of clients, and develop churn predictive models. SAS® Customer Intelligence Studio is the platform on which the collection campaigns are created, tested, and executed. SAS® Web Report Studio is used to create a set of operational and management reports.
Read the paper (PDF).
Darwin Amezquita, DIRECTV
Paulo Fuentes, Directv Colombia
Andres Felipe Gonzalez, Directv
Paper 3101-2015:
Using SAS® Enterprise Miner™ to Predict Breast Cancer at an Early Stage
Breast cancer is the leading cause of cancer-related deaths among women worldwide, and its early detection can reduce mortality rate. Using a data set containing information about breast screening provided by the Royal Liverpool University Hospital, we constructed a model that can provide early indication of a patient's tendency to develop breast cancer. This data set has information about breast screening from patients who were believed to be at risk of developing breast cancer. The most important aspect of this work is that we excluded variables that are in one way or another associated with breast cancer, while keeping the variables that are less likely to be associated with breast cancer or whose associations with breast cancer are unknown as input predictors. The target variable is a binary variable with two values, 1 (indicating a type of cancer is present) and 0 (indicating a type of cancer is not present). SAS® Enterprise Miner™ 12.1 was used to perform data validation and data cleansing, to identify potentially related predictors, and to build models that can be used to predict at an early stage the likelihood of patients developing breast cancer. We compared two models: the first model was built with an interactive node and a cluster node, and the second was built without an interactive node and a cluster node. Classification performance was compared using a receiver operating characteristic (ROC) curve and average squares error. Interestingly, we found significantly improved model performance by using only variables that have a lesser or unknown association with breast cancer. The result shows that the logistic model with an interactive node and a cluster node has better performance with a lower average squared error (0.059614) than the model without an interactive node and a cluster node. Among other benefits, this model will assist inexperienced oncologists in saving time in disease diagnosis.
Read the paper (PDF).
Gibson Ikoro, Queen Mary University of London
Beatriz de la Iglesia, University of East Anglia, Norwich, UK
Paper 3484-2015:
Using SAS® Enterprise Miner™ to Predict the Number of Rings on an Abalone Shells Using Its Physical Characteristics
Abalone is a common name given to sea snails or mollusks. These creatures are highly iridescent, with shells of strong changeable colors. This characteristic makes the shells attractive to humans as decorative objects and jewelry. The abalone structure is being researched to build body armor. The value of a shell varies by its age and the colors it displays. Determining the number of rings on an abalone is a tedious and cumbersome task and is usually done by cutting the shell through the cone, staining it, and counting the number of rings on it through a microscope. In this poster, I aim to predict the number of any rings on any abalone by using the physical characteristics. This data was obtained from UCI Machine Learning Repository, which consists of 4,177 observations with 8 attributes. I considered the number of rings to be my target variable. The abalone's age can be reasonably approximated as being 1.5 times the number of rings on its shell. Using SAS® Enterprise Miner™, I have built regression models and neural network models to determine the physical measurements responsible for determining the number of rings on the abalone. While I have obtained a coefficient of determination of 54.01%, my aim is to improve and expand the analysis using the power of SAS Enterprise Miner. The current initial results indicate that the height, the shucked weight, and the viscera weight of the shell are the three most influential variables in predicting the number of rings on an abalone.
Ganesh Kumar Gangarajula, Oklahoma State University
Yogananda Domlur Seetharam
Paper 3212-2015:
Using SAS® to Combine Regression and Time Series Analysis on U.S. Financial Data to Predict the Economic Downturn
During the financial crisis of 2007-2009, the U.S. labor market lost 8.4 million jobs, causing the unemployment rate to increase from 5% to 9.5%. One of the indicators for economic recession is negative gross domestic product (GDP) for two consecutive quarters. This poster combines quantitative and qualitative techniques to predict the economic downturn by forecasting recession probabilities. Data was collected from the Board of Governors of the Federal Reserve System and the Federal Reserve Bank of St. Louis, containing 29 variables and quarterly observations from 1976-Q1 to 2013-Q3. Eleven variables were selected as inputs based on their effects on recession and limiting the multicollinearity: long-term treasury yield (5-year and 10-year), mortgage rate, CPI inflation rate, prime rate, market volatility index, Better Business Bureau (BBB) corporate yield, house price index, stock market index, commercial real estate price index, and one calculated variable yield spread (Treasury yield-curve spread). The target variable was a binary variable depicting the economic recession for each quarter (1=Recession). Data was prepared for modeling by applying imputation and transformation on variables. Two-step analysis was used to forecast the recession probabilities for the short-term period. Predicted recession probabilities were first obtained from the Backward Elimination Logistic Regression model that was selected on the basis of misclassification (validation misclassification= 0.115). These probabilities were then forecasted using the Exponential Smoothing method that was selected on the basis of mean average error (MAE= 11.04). Results show the recession periods including the great recession of 2008 and the forecast for eight quarters (up to 2015-Q3).
Read the paper (PDF).
Avinash Kalwani, Oklahoma State University
Nishant Vyas, Oklahoma State University
Paper 3508-2015:
Using Text from Repair Tickets of a Truck Manufacturing Company to Predict Factors that Contribute to Truck Downtime
In this era of bigdata, the use of text analytics to discover insights is rapidly gainingpopularity in businesses. On average, more than 80 percent of the data inenterprises may be unstructured. Text analytics can help discover key insightsand extract useful topics and terms from the unstructured data. The objectiveof this paper is to build a model using textual data that predicts the factorsthat contribute to downtime of a truck. This research analyzes the data of over200,000 repair tickets of a leading truck manufacturing company. After theterms were grouped into fifteen key topics using text topic node of SAS® TextMiner, a regression model was built using these topics to predict truckdowntime, the target variable. Data was split into training and validation fordeveloping the predictive models. Knowledge of the factors contributing todowntime and their associations helped the organization to streamline theirrepair process and improve customer satisfaction.
Read the paper (PDF).
Ayush Priyadarshi, Oklahoma State University
Goutam Chakraborty, Oklahoma State University
W
Paper 3600-2015:
When Two Are Better Than One: Fitting Two-Part Models Using SAS
In many situations, an outcome of interest has a large number of zero outcomes and a group of nonzero outcomes that are discrete or highly skewed. For example, in modeling health care costs, some patients have zero costs, and the distribution of positive costs are often extremely right-skewed. When modeling charitable donations, many potential donors give nothing, and the majority of donations are relatively small with a few very large donors. In the analysis of count data, there are also times where there are more zeros than would be expected using standard methodology, or cases where the zeros might differ substantially than the non-zeros, such as number of cavities a patient has at a dentist appointment or number of children born to a mother. If data has such structure, and ordinary least squares methods are used, then predictions and estimation might be inaccurate. The two-part model gives us a flexible and useful modeling framework in many situations. Methods for fitting the models with SAS® software are illustrated.
Read the paper (PDF).
Laura Kapitula, Grand Valley State University
Y
Paper 3262-2015:
Yes, SAS® Can Do! Manage External Files with SAS Programming
Managing and organizing external files and directories play an important part in our data analysis and business analytics work. A good file management system can streamline project management and file organizations and significantly improve work efficiency . Therefore, under many circumstances, it is necessary to automate and standardize the file management processes through SAS® programming. Compared with managing SAS files via PROC DATASETS, managing external files is a much more challenging task, which requires advanced programming skills. This paper presents and discusses various methods and approaches to managing external files with SAS programming. The illustrated methods and skills can have important applications in a wide variety of analytic work fields.
Read the paper (PDF).
Justin Jia, Trans Union
Amanda Lin, CIBC
back to top