SAS Global Forum 2015 Proceedings

SAS^® has been an early leader in big data technology architecture that more easily integrates unstructured files across multi-tier data system platforms. By using SAS^® Data Integration Studio and SAS^® Enterprise Business Intelligence software, you can easily automate big data using SAS^® system accommodations for Hadoop open-source standards. At the same time, another seminal technology has emerged, which involves real-time multi-sensor data integration using Arduino microprocessors. This break-out session demonstrates the use of SAS^® 9.4 coding to define Hadoop clusters and to automate Arduino data acquisition to convert custom unstructured log files into structured tables, which can be analyzed by SAS in near real time. Examples include the use of SAS Data Integration Studio to create and automate stored processes, as well as tips for C language object coding to integrate to SAS data management, with a simple temperature monitoring application for Hadoop to Arduino using SAS.

Your electricity usage patterns reveal a lot about your family and routines. Information collected from electrical smart meters can be mined to identify patterns of behavior that can in turn be used to help change customer behavior for the purpose of altering system load profiles. Demand Response (DR) programs represent an effective way to cope with rising energy needs and increasing electricity costs. The Federal Energy Regulatory Commission (FERC) defines demand response as changes in electric usage by end-use customers from their normal consumption patterns in response to changes in the price of electricity over time, or to incentive payments designed to lower electricity use at times of high wholesale market prices or when system reliability of jeopardized. In order to effectively motivate customers to voluntarily change their consumptions patterns, it is important to identify customers whose load profiles are similar so that targeted incentives can be directed toward these customers. Hence, it is critical to use tools that can accurately cluster similar time series patterns while providing a means to profile these clusters. In order to solve this problem, though, hardware and software that is capable of storing, extracting, transforming, loading and analyzing large amounts of data must first be in place. Utilities receive customer data from smart meters, which track and store customer energy usage. The data collected is sent to the energy companies every fifteen minutes or hourly. With millions of meters deployed, this quantity of information creates a data deluge for utilities, because each customer generates about three thousand data points monthly, and more than thirty-six billion reads are collected annually for a million customers. The data scientist is the hunter, and DR candidate patterns are the prey in this cat-and-mouse game of finding customers willing to curtail electrical usage for a program benefit. The data scientist must connect large siloed data sources, external data , and even unstructured data to detect common customer electrical usage patterns, build dependency models, and score them against their customer population. Taking advantage of Hadoop's ability to store and process data on commodity hardware with distributed parallel processing is a game changer. With Hadoop, no data set is too large, and SAS^® Visual Statistics leverages machine learning, artificial intelligence, and clustering techniques to build descriptive and predictive models. All data can be usable from disparate systems, including structured, unstructured, and log files. The data scientist can use Hadoop to ingest all available data at rest, and analyze customer usage patterns, system electrical flow data, and external data such as weather. This paper will use Cloudera Hadoop with Apache Hive queries for analysis on platforms such as SAS^® Visual Analytics and SAS Visual Statistics. The paper will showcase optionality within Hadoop for querying large data sets with open-source tools and importing these data into SAS^® for robust customer analytics, clustering customers by usage profiles, propensity to respond to a demand response event, and an electrical system analysis for Demand Response events.

Read the paper (PDF).

Predictions, including regressions and classifications, are the predominant focus of many statistical and machine-learning models. However, in the era of big data, a predictive modeling process contains more than just making the final predictions. For example, a large collection of data often represents a set of small, heterogeneous populations. Identification of these sub groups is therefore an important step in predictive modeling. In addition, big data data sets are often complex, exhibiting high dimensionality. Consequently, variable selection, transformation, and outlier detection are integral steps. This paper provides working examples of these critical stages using SAS^® Visual Statistics, including data segmentation (supervised and unsupervised), variable transformation, outlier detection, and filtering, in addition to building the final predictive model using methodology such as linear regressions, decision trees, and logistic regressions. The illustration data was collected from 2010 to 2014, from vehicle emission testing results.

Read the paper (PDF).

A leading killer in the United States is smoking. Moreover, over 8.6 million Americans live with a serious illness caused by smoking or second-hand smoking. Despite this, over 46.6 million U.S. adults smoke tobacco, cigars, and pipes. The key analytic question in this paper is, How would e-cigarettes affect this public health situation? Can monitoring public opinions of e-cigarettes using SAS^® Text Analytics and SAS^® Visual Analytics help provide insight into the potential dangers of these new products? Are e-cigarettes an example of Big Tobacco up to its old tricks or, in fact, a cessation product? The research in this paper was conducted on thousands of tweets from April to August 2014. It includes API sources beyond Twitter--for example, indicators from the Health Indicators Warehouse (HIW) of the Centers for Disease Control and Prevention (CDC)--that were used to enrich Twitter data in order to implement a surveillance system developed by SAS^® for the CDC. The analysis is especially important to The Office of Smoking and Health (OSH) at the CDC, which is responsible for tobacco control initiatives that help states to promote cessation and prevent initiation in young people. To help the CDC succeed with these initiatives, the surveillance system also: 1) automates the acquisition of data, especially tweets; and 2) applies text analytics to categorize these tweets using a taxonomy that provides the CDC with insights into a variety of relevant subjects. Twitter text data can help the CDC look at the public response to the use of e-cigarettes, and examine general discussions regarding smoking and public health, and potential controversies (involving tobacco exposure to children, increasing government regulations, and so on). SAS^® Content Categorization helps health care analysts review large volumes of unstructured data by categorizing tweets in order to monitor and follow what people are saying and why they are saying it. Ultimatel y, it is a solution intended to help the CDC monitor the public's perception of the dangers of smoking and e-cigarettes, in addition, it can identify areas where OSH can focus its attention in order to fulfill its mission and track the success of CDC health initiatives.

Read the paper (PDF).

The SAS^® LASR™ Analytic Server acts as a back-end, in-memory analytics engine for solutions such as SAS^® Visual Analytics and SAS^® Visual Statistics. It is designed to exist in a massively scalable, distributed environment, often alongside Hadoop. This paper guides you through the impacts of the architecture decisions shared by both software applications and what they specifically mean for SAS^®. We then present positive actions you can take to rebound from unexpected outages and resume efficient operations.

Read the paper (PDF).

So you have big data and need to know how to quickly and efficiently keep your data up-to-date and available in SAS^® Visual Analytics? One of the challenges that customers often face is how to regularly update data tables in the SAS^® LASR™ Analytic Server, the in-memory analytical platform for SAS Visual Analytics. Is appending data always the right answer? What are some of the key things to consider when automating a data update and load process? Based on proven best practices and existing customer implementations, this paper provides you with answers to those questions and more, enabling you to optimize your update and data load processes. This ensures that your organization develops an effective and robust data refresh strategy.

Read the paper (PDF).

Managing and organizing external files and directories play an important part in our data analysis and business analytics work. A good file management system can streamline project management and file organizations and significantly improve work efficiency . Therefore, under many circumstances, it is necessary to automate and standardize the file management processes through SAS^® programming. Compared with managing SAS files via PROC DATASETS, managing external files is a much more challenging task, which requires advanced programming skills. This paper presents and discusses various methods and approaches to managing external files with SAS programming. The illustrated methods and skills can have important applications in a wide variety of analytic work fields.