SAS Global Forum 2015 Proceedings

Twitter is a powerful form of social media for sharing information about various issues and can be used to raise awareness and collect pointers about associated risk factors and preventive measures. Type-2 diabetes is a national problem in the US. We analyzed twitter feeds about Type-2 diabetes in order to suggest a rational use of social media with respect to an assertive study of any ailment. To accomplish this task, 900 tweets were collected using Twitter API v1.1 in a Python script. Tweets, follower counts, and user information were extracted via the scripts. The tweets were segregated into different groups on the basis of their annotations related to risk factors, complications, preventions and precautions, and so on. We then used SAS^® Text Miner to analyze the data. We found that 70% of the tweets stated the status quo, based on marketing and awareness campaigns. The remaining 30% of tweets contained various key terms and labels associated with type-2 diabetes. It was observed that influential users tweeted more about precautionary measures whereas non-influential people gave suggestions about treatments as well as preventions and precautions.

Read the paper (PDF).

Managing the large-scale displacement of people and communities caused by a natural disaster has historically been reactive rather than proactive. Following a disaster, data is collected to inform and prompt operational responses. In many countries prone to frequent natural disasters such as the Philippines, large amounts of longitudinal data are collected and available to apply to new disaster scenarios. However, because of the nature of natural disasters, it is difficult to analyze all of the data until long after the emergency has passed. For this reason, little research and analysis have been conducted to derive deeper analytical insight for proactive responses. This paper demonstrates the application of SAS^® analytics to this data and establishes predictive alternatives that can improve conventional storm responses. Humanitarian organizations can use this data to understand displacement patterns and trends and to optimize evacuation routing and planning. Identifying the main contributing factors and leading indicators for the displacement of communities in a timely and efficient manner prevents detrimental incidents at disaster evacuation sites. Using quantitative and qualitative methods, responding organizations can make data-driven decisions that innovate and improve approaches to managing disaster response on a global basis. The benefits of creating a data-driven analytical model can help reduce response time, improve the health and safety of displaced individuals, and optimize scarce resources in a more effective manner. The International Organization for Migration (IOM), an intergovernmental organization, is one of the first-response organizations on the ground that responds to most emergencies. IOM is the global co-load for the Camp Coordination and Camp Management (CCCM) cluster in natural disasters. This paper shows how to use SAS^® Visual Analytics and SAS^® Visual Statistics for the Philippines in response to Super Typhoon Haiyan in Nove mber 2013 to develop increasingly accurate models for better emergency-preparedness. Using data collected from IOM's Displacement Tracking Matrix (DTM), the final analysis shows how to better coordinate service delivery to evacuation centers sheltering large numbers of displaced individuals, applying accurate hindsight to develop foresight on how to better respond to emergencies and disasters. Predictive models build on patterns found in historical and transactional data to identify risks and opportunities. The capacity to predict trends and behavior patterns related to displacement and mobility has the potential to enable the IOM to respond in a more timely and targeted manner. By predicting the locations of displacement, numbers of persons displaced, number of vulnerable groups, and sites at most risk of security incidents, humanitarians can respond quickly and more effectively with the appropriate resources (material and human) from the outset. The end analysis uses the SAS^® Storm Optimization model combined with human mobility algorithms to predict population movement.

Kaiser Permanente Northwest is contractually obligated for regulatory submissions to Oregon Health Authority, Health Share of Oregon, and Molina Healthcare in Washington. The submissions consist of Medicaid Encounter data for medical and pharmacy claims. SAS^® programs are used to extract claims data from Kaiser's claims data warehouse, process the data, and produce output files in HIPAA ASC X12 and NCPDP format. Prior to April 2014, programs were written in SAS^® 8.2 running on a VAX server. Several key drivers resulted in the conversion of the existing system to SAS^® Enterprise Guide^® 5.1 running on UNIX. These drivers were: the need to have a scalable system in preparation for the Affordable Care Act (ACA); performance issues with the existing system; incomplete process reporting and notification to business owners; and a highly manual, labor-intensive process of running individual programs. The upgraded system addressed these drivers. The estimated cost reduction was from $1.30 per reported encounter to $0.13 per encounter. The converted system provides for better preparedness for the ACA. One expected result of ACA is significant Medicaid membership growth. The program has already increased in size by 50% in the preceding 12 months. The updated system allows for the expected growth in membership.

Read the paper (PDF).

There is a plethora of uses of the colon (:) in SAS^® programming. The colon is used as a data or variable name wild-card, a macro variable creator, an operator modifier, and so forth. The colon helps you write clear, concise, and compact code. The main objective of this paper is to encourage the effective use of the colon in writing crisp code. This paper presents real-time applications of the colon in day-to-day programming. In addition, this paper presents cases where the colon limits programmers' wishes.

Read the paper (PDF).

Many organizations need to forecast large numbers of time series that are discretely valued. These series, called count series, fall approximately between continuously valued time series, for which there are many forecasting techniques (ARIMA, UCM, ESM, and others), and intermittent time series, for which there are a few forecasting techniques (Croston's method and others). This paper proposes a technique for large-scale automatic count series forecasting and uses SAS^® Forecast Server and SAS/ETS^® software to demonstrate this technique.

Read the paper (PDF).

During the cementing and pumps-off phase of oil drilling, drilling operations need to know, in real time, about any loss of hydrostatic or mechanical well integrity. This phase involves not only big data, but also high-velocity data. Today's state-of-the-art drilling rigs have tens of thousands of sensors. These sensors and their data output must be correlated and analyzed in real time. This paper shows you how to leverage SAS^® Asset Performance Analytics and SAS^® Enterprise Miner™ to build a model for drilling and well control anomalies, fingerprint key well control measures of the transienct fluid properties, and how to operationalize these analytics on the drilling assets with SAS^® event stream processing. We cover the implementation and results from the Deepwater Horizon case study, demonstrating how SAS analytics enables the rapid differentiation between safe and unsafe modes of operation.

Read the paper (PDF).

Your electricity usage patterns reveal a lot about your family and routines. Information collected from electrical smart meters can be mined to identify patterns of behavior that can in turn be used to help change customer behavior for the purpose of altering system load profiles. Demand Response (DR) programs represent an effective way to cope with rising energy needs and increasing electricity costs. The Federal Energy Regulatory Commission (FERC) defines demand response as changes in electric usage by end-use customers from their normal consumption patterns in response to changes in the price of electricity over time, or to incentive payments designed to lower electricity use at times of high wholesale market prices or when system reliability of jeopardized. In order to effectively motivate customers to voluntarily change their consumptions patterns, it is important to identify customers whose load profiles are similar so that targeted incentives can be directed toward these customers. Hence, it is critical to use tools that can accurately cluster similar time series patterns while providing a means to profile these clusters. In order to solve this problem, though, hardware and software that is capable of storing, extracting, transforming, loading and analyzing large amounts of data must first be in place. Utilities receive customer data from smart meters, which track and store customer energy usage. The data collected is sent to the energy companies every fifteen minutes or hourly. With millions of meters deployed, this quantity of information creates a data deluge for utilities, because each customer generates about three thousand data points monthly, and more than thirty-six billion reads are collected annually for a million customers. The data scientist is the hunter, and DR candidate patterns are the prey in this cat-and-mouse game of finding customers willing to curtail electrical usage for a program benefit. The data scientist must connect large siloed data sources, external data , and even unstructured data to detect common customer electrical usage patterns, build dependency models, and score them against their customer population. Taking advantage of Hadoop's ability to store and process data on commodity hardware with distributed parallel processing is a game changer. With Hadoop, no data set is too large, and SAS^® Visual Statistics leverages machine learning, artificial intelligence, and clustering techniques to build descriptive and predictive models. All data can be usable from disparate systems, including structured, unstructured, and log files. The data scientist can use Hadoop to ingest all available data at rest, and analyze customer usage patterns, system electrical flow data, and external data such as weather. This paper will use Cloudera Hadoop with Apache Hive queries for analysis on platforms such as SAS^® Visual Analytics and SAS Visual Statistics. The paper will showcase optionality within Hadoop for querying large data sets with open-source tools and importing these data into SAS^® for robust customer analytics, clustering customers by usage profiles, propensity to respond to a demand response event, and an electrical system analysis for Demand Response events.

Read the paper (PDF).

SAS^® Visual Analytics opens up a world of intuitive interactions, providing report creators the ability to develop more efficient ways to deliver information. Business-related hierarchies can be defined dynamically in SAS Visual Analytics to group data more efficiently--no more going back to the developers. Visualizations can interact with each other, with other objects within other sections, and even with custom applications and SAS^® stored processes. This paper provides a blueprint to streamline and consolidate reporting efforts using these interactions available in SAS Visual Analytics. The goal of this methodology is to guide users down information pathways that can progressively subset data into smaller, more understandable chunks of data, while summarizing each layer to provide insight along the way. Ultimately the final destination of the information pathway holds a reasonable subset of data so that a user can take action and facilitate an understood outcome.

Read the paper (PDF).

Automated decision-making systems are now found everywhere, from your bank to your government to your home. For example, when you inquire for a loan through a website, a complex decision process likely runs combinations of statistical models and business rules to make sure you are offered a set of options for tantalizing terms and conditions. To make that happen, analysts diligently strive to encode their complex business logic into these systems. But how do you know if you are making the best possible decisions? How do you know if your decisions conform to your business constraints? For example, you might want to maximize the number of loans that you provide while balancing the risk among different customer categories. Welcome to the world of optimization. SAS^® Business Rules Manager and SAS/OR^® software can be used together to manage and optimize decisions. This presentation demonstrates how to build business rules and then optimize the rule parameters to maximize the effectiveness of those rules. The end result is more confidence that you are delivering an effective decision-making process.

Read the paper (PDF). | Download the data file (ZIP).

A utility's meter data is a valuable asset that can be daunting to leverage. Consider that one household or premise can produce over 35,000 rows of information, consisting of over 8 MB of data per year. Thirty thousand meters collecting fifteen-minute-interval data with forty variables equates to 1.2 billion rows of data. Using SAS^® Visual Analytics, we provide examples of leveraging smart meter data to address business around revenue protection, meter operations, and customer analysis. Key analyses include identifying consumption on inactive meters, potential energy theft, and stopped or slowing meters; and support of all customer classes (for example, residential, small commercial, and industrial) and their data with different time intervals and frequencies.

Read the paper (PDF).

Differential item functioning (DIF), as an assessment tool, has been widely used in quantitative psychology, educational measurement, business management, insurance, and health care. The purpose of DIF analysis is to detect response differences of items in questionnaires, rating scales, or tests across different subgroups (for example, gender) and to ensure the fairness and validity of each item for those subgroups. The goal of this paper is to demonstrate several ways to conduct DIF analysis by using different SAS^® procedures (PROC FREQ, PROC LOGISITC, PROC GENMOD, PROC GLIMMIX, and PROC NLMIXED) and their applications. There are three general methods to examine DIF: generalized Mantel-Haenszel (MH), logistic regression, and item response theory (IRT). The SAS^® System provides flexible procedures for all these approaches. There are two types of DIF: uniform DIF, which remains consistent across ability levels, and non-uniform DIF, which varies across ability levels. Generalized MH is a nonparametric method and is often used to detect uniform DIF while the other two are parametric methods and examine both uniform and non-uniform DIF. In this study, I first describe the underlying theories and mathematical formulations for each method. Then I show the SAS statements, input data format, and SAS output for each method, followed by a detailed demonstration of the differences among the three methods. Specifically, PROC FREQ is used to calculate generalized MH only for dichotomous items. PROC LOGISITIC and PROC GENMOD are used to detect DIF by using logistic regression. PROC NLMIXED and PROC GLIMMIX are used to examine DIF by applying an exploratory item response theory model. Finally, I use SAS/IML^® to call two R packages (that is, difR and lordif) to conduct DIF analysis and then compare the results between SAS procedures and R packages. An example data set, the Verbal Aggression assessment, which includes 316 subjects and 24 items, is used in this stud y. Following the general DIF analysis, the male group is used as the reference group, and the female group is used as the focal group. All the analyses are conducted by SAS^® 9.3 and R 2.15.3. The paper closes with the conclusion that the SAS System provides different flexible and efficient ways to conduct DIF analysis. However, it is essential for SAS users to understand the underlying theories and assumptions of different DIF methods and apply them appropriately in their DIF analyses.

Read the paper (PDF).

Walt Disney World Resort is home to four theme parks, two water parks, five golf courses, 26 owned-and-operated resorts, and hundreds of merchandise and dining experiences. Every year millions of guests stay at Disney resorts to enjoy the Disney Experience. Assigning physical rooms to resort and hotel reservations is a key component to maximizing operational efficiency and guest satisfaction. Solutions can range from automation to optimization programs. The volume of reservations and the variety and uniqueness of guest preferences across the Walt Disney World Resort campus pose an opportunity to solve a number of reasonably difficult room assignment problems by leveraging operations research techniques. For example, a guest might prefer a room with specific bedding and adjacent to certain facilities or amenities. When large groups, families, and friends travel together, they often want to stay near each other using specific room configurations. Rooms might be assigned to reservations in advance and upon request at check-in. Using mathematical programming techniques, the Disney Decision Science team has partnered with the SAS^® Advanced Analytics R&D team to create a room assignment optimization model prototype and implement it in SAS/OR^®. We describe how this collaborative effort has progressed over the course of several months, discuss some of the approaches that have proven to be productive for modeling and solving this problem, and review selected results.

Inpatient treatment is the most common type of treatment ordered for patients who have a serious ailment and need immediate attention. Using a data set about diabetes patients downloaded from the UCI Network Data Repository, we built a model to predict the probability that the patient will be rehospitalized within 30 days of discharge. The data has about 100,000 rows and 51 columns. In our preliminary analysis, a neural network turned out to be the best model, followed closely by the decision tree model and regression model.

Utility companies in America are always challenged when it comes to knowing when their infrastructure fails. One of the most critical components of a utility company's infrastructure is the transformer. It is important to assess the remaining lifetime of transformers so that the company can reduce costs, plan expenditures in advance, and largely mitigate the risk of failure. It is also equally important to identify the high-risk transformers in advance and to maintain them accordingly in order to avoid sudden loss of equipment due to overloading. This paper uses SAS^® to predict the lifetime of transformers, identify the various factors that contribute to their failure, and model the transformer into High, Medium, and Low risk categories based on load for easy maintenance. The data set from a utility company contains around 18,000 observations and 26 variables from 2006 to 2013, and contains the failure and installation dates of the transformers. The data set comprises many transformers that were installed before 2006 (there are 190,000 transformers on which several regression models are built in this paper to identify their risk of failure), but there is no age-related parameter for them. Survival analysis was performed on this left-truncated and right-censored data. The data set has variables such as Age, Average Temperature, Average Load, and Normal and Overloaded Conditions for residential and commercial transformers. Data creation involved merging 12 different tables. Nonparametric models for failure time data were built so as to explore the lifetime and failure rate of the transformers. By building a Cox's regression model, the important factors contributing to the failure of a transformer are also analyzed in this paper. Several risk- based models are then built to categorize transformers into High, Medium, and Low risk categories based on their loads. This categorization can help the utility companies to better manage the risks associated with transformer failures.

Read the paper (PDF).

This paper presents a methodology developed to define and prioritize feeders with the least satisfactory performances for continuity of energy supply, in order to obtain an efficiency ranking that supports a decision-making process regarding investments to be implemented. Data Envelopment Analysis (DEA) was the basis for the development of this methodology, in which the input-oriented model with variable returns to scale was adopted. To perform the analysis of the feeders, data from the utility geographic information system (GIS) and from the interruption control system was exported to SAS^® Enterprise Guide^®, where data manipulation was possible. Different continuity variables and physical-electrical parameters were consolidated for each feeder for the years 2011 to 2013. They were separated according to the geographical regions of the concession area, according to their location (urban or rural), and then grouped by physical similarity. Results showed that 56.8% of the feeders could be considered as efficient, based on the continuity of the service. Furthermore, the results enable identification of the assets with the most critical performance and their benchmarks, and the definition of preliminary goals to reach efficiency.

Read the paper (PDF).

In today's omni-channel world, consumers expect retailers to deliver the product they want, where they want it, when they want it, at a price they accept. A major challenge many retailers face in delighting their customers is successfully predicting consumer demand. Business decisions across the enterprise are affected by these demand estimates. Forecasts used to inform high-level strategic planning, merchandising decisions (planning assortments, buying products, pricing, and allocating and replenishing inventory) and operational execution (labor planning) are similar in many respects. However, each business process requires careful consideration of specific input data, modeling strategies, output requirements, and success metrics. In this session, learn how leading retailers are increasing sales and profitability by operationalizing forecasts that improve decisions across their enterprise.

Read the paper (PDF).

This presentation details the steps involved in using SAS^® Enterprise Miner™ to text mine a sample of member complaints. Specifically, it describes how the Text Parsing, Text Filtering, and Text Topic nodes were used to generate topics that described the complaints. Text mining results are reviewed (slightly modified for confidentiality), as well as conclusions and lessons learned from the project.

Read the paper (PDF).

How does historical production data relate a story about subsurface oil and gas reservoirs? Business and domain experts must perform accurate analysis of reservoir behavior using only rate and pressure data as a function of time. This paper introduces innovative data-driven methodologies to forecast oil and gas production in unconventional reservoirs that, owing to the nature of the tightness of the rocks, render the empirical functions less effective and accurate. You learn how implementations of the SAS^® MODEL procedure provide functional algorithms that generate data-driven type curves on historical production data. Reservoir engineers can now gain more insight to the future performance of the wells across their assets. SAS enables a more robust forecast of the hydrocarbons in both an ad hoc individual well interaction and in an automated batch mode across the entire portfolio of wells. Examples of the MODEL procedure arising in subsurface production data analysis are discussed, including the Duong data model and the stretched exponential data model. In addressing these examples, techniques for pattern recognition and for implementing TREE, CLUSTER, and DISTANCE procedures in SAS/STAT^® are highlighted to explicate the importance of oil and gas well profiling to characterize the reservoir. The MODEL procedure analyzes models in which the relationships among the variables comprise a system of one or more nonlinear equations. Primary uses of the MODEL procedure are estimation, simulation, and forecasting of nonlinear simultaneous equation models, and generating type curves that fit the historical rate production data. You will walk through several advanced analytical methodologies that implement the SEMMA process to enable hypotheses testing as well as directed and undirected data mining techniques. SAS^® Visual Analytics Explorer drives the exploratory data analysis to surface trends and relationships, and the data QC workflows ensure a robust input space for the performance forecasting methodologies that are visualized in a web-based thin client for interactive interpretation by reservoir engineers.

Read the paper (PDF).

A Chinese wind energy company designs several hundred wind farms each year. An important step in its design process is micrositing, in which it creates a layout of turbines for a wind farm. The amount of energy that a wind farm generates is affected by geographical factors (such as elevation of the farm), wind speed, and wind direction. The types of turbines and their positions relative to each other also play a critical role in energy production. Currently the company is using an open-source software package to help with its micrositing. As the size of wind farms increases and the pace of their construction speeds up, the open-source software is no longer able to support the design requirements. The company wants to work with a commercial software vendor that can help resolve scalability and performance issues. This paper describes the use of the OPTMODEL and OPTLSO procedures on the SAS^® High-Performance Analytics infrastructure together with the FCMP procedure to model and solve this highly nonlinear optimization problem. Experimental results show that the proposed solution can meet the company's requirements for scalability and performance. A Chinese wind energy company designs several hundred wind farms each year. An important step of their design process is micro-siting, which creates a layout of turbines for a wind farm. The amount of energy generated from a wind farm is affected by geographical factors (such as elevation of the farm), wind speed, and wind direction. The types of turbines and their positions relative to each other also play critical roles in the energy production. Currently the company is using an open-source software package to help them with their micro-siting. As the size of wind farms increases and the pace of their construction speeds up, the open-source software is no longer able to support their design requirements. The company wants to work with a commercial software vendor that can help them resolve scalability and performance issues. This pap er describes the use of the FCMP, OPTMODEL, and OPTLSO procedures on the SAS^® High-Performance Analytics infrastructure to model and solve this highly nonlinear optimization problem. Experimental results show that the proposed solution can meet the company's requirements for scalability and performance.