SAS Global Forum 2014 Proceedings

The proliferation of textual data in business is overwhelming. Unstructured textual data is being constantly generated via call center logs, emails, documents on the web, blogs, tweets, customer comments, customer reviews, and so on. While the amount of textual data is increasing rapidly, businesses ability to summarize, understand, and make sense of such data for making better business decisions remain challenging. This presentation takes a quick look at how to organize and analyze textual data for extracting insightful customer intelligence from a large collection of documents and for using such information to improve business operations and performance. Multiple business applications of case studies using real data that demonstrate applications of text analytics and sentiment mining using SAS^® Text Miner and SAS^® Sentiment Analysis Studio are presented. While SAS^® products are used as tools for demonstration only, the topics and theories covered are generic (not tool specific).

The power of social media has increased to such an extent that businesses that fail to monitor consumer responses on social networking sites are now clearly at a disadvantage. In this paper, we aim to provide some insights on the impact of the Digital Rights Management (DRM) policies of Microsoft and the release of Xbox One on their customers' reactions. We have conducted preliminary research to compare the basic text mining capabilities of SAS^® and R, two very diverse yet powerful tools. A total of 6,500 Tweets were collected to analyze the impact of the DRM policies of Microsoft. The Tweets were segmented into three groups based on date: before Microsoft announced its Xbox One policies (May 18 to May 26), after the policies were announced (May 27 to June 16), and after changes were made to the announced policies (June 16 to July 1). Our results suggest that SAS works better than R when it comes to extensive analysis of textual data. In our following work, customers reactions to the release of Xbox One will be analyzed using SAS^® Sentiment Analysis Studio. We will collect Tweets on Xbox posted before and after the release of Xbox One by Microsoft. We will have two categories, Tweets posted between November 15 and November 21 and those posted between November 22 and November 29. Sentiment analysis will then be performed on these Tweets, and the results will be compared between the two categories.

Many different neuroscience researchers have explored how various parts of the brain are connected, but no one has performed association mining using brain data. In this study, we used SAS^® Enterprise Miner^™ 7.1 for association mining of brain data collected by a 14-channel EEG device. An application of the association mining technique is presented in this novel context of brain activities and by linking our results to theories of cognitive neuroscience. The brain waves were collected while a user processed information about Facebook, the most well-known social networking site. The data was cleaned using Independent Component Analysis via an open source MATLAB package. Next, by applying the LORETA algorithm, activations at every fraction of the second were recorded. The data was codified into transactions to perform association mining. Results showing how various parts of brain get excited while processing the information are reported. This study provides preliminary insights into how brain wave data can be analyzed by widely available data mining techniques to enhance researcher s understanding of brain activation patterns.

Do you have an abstract for an idea that you want to submit as a proposal to SAS^® conferences, but you are not sure which section is the most appropriate one? In this paper, we discuss a methodology for automatically identifying the most suitable section or sections for your proposal content. We use SAS^® Text Miner 12.1 and SAS^® Content Categorization Studio 12.1 to develop a rule-based categorization model. This model is used to automatically score your paper abstract to identify the most relevant and appropriate conference sections to submit to for a better chance of acceptance.

SAS^® Visual Analytics and the SAS^® LASR™ Analytic Server provide many capabilities to analyze data fast. Depending on your organization, data can be loaded as a self-service operation. Or, your data can be large and shared with many people. And, as data gets large, effectively loading it and keeping it updated become important. This presentation discusses the range of data scenarios from self-service spreadsheets to very large databases, from single-subject data to large star schema topologies, and from single-use data to continually updated data that requires high levels of resilience and monitoring. Fast and easy access to big data is important to empower your organization to make better business decisions. Understanding how to have a responsive and reliable data tier on which to make these decisions is within your reach.

Energy companies that operate in a highly regulated environment and are constrained in pricing flexibility must employ a multitude of approaches to maintain high levels of customer satisfaction. Many investor-owned utilities are just starting to embrace a customer-centric business model to improve the customer experience and hold the line on costs while operating in an inflationary business setting. Faced with these challenges, it is natural for utility executives to ask: 'What drives customer satisfaction, and what is the optimum balance between influencing customer perceptions and improving actual process performance in order to be viewed as a top-tier performer by our customers?' J.D. Power, for example, cites power quality and reliability as the top influencer of overall customer satisfaction. But studies have also shown that customer perceptions of reliability do not always match actual reliability experience. This apparent gap between actual and perceived performance raises a conundrum: Should the utility focus its efforts and resources on improving actual reliability performance or would it be better to concentrate on influencing customer perceptions of reliability? How can this conundrum be unraveled with an analytically driven approach? In this paper, we explore how the design of experiment techniques can be employed to help understand the relationship between process performance and customer perception, thereby leading to important insights into the energy customer equation and higher customer satisfaction!

The growing adoption of electronic systems for keeping medical records provides an opportunity for health care practitioners and biomedical researchers to access traditionally unstructured data in a new and exciting way. Pathology reports, progress notes, and many other sections of the patient record that are typically written in a narrative format can now be analyzed by employing natural language processing contextual extraction techniques to identify specific concepts contained within the text. Linking these concepts to a standardized nomenclature (for example, SNOMED CT, ICD-9, ICD-10, and so on) frees analysts to explore and test hypotheses using these observational data. Using SAS^® software, we have developed a solution in order to extract data from the unstructured text found in medical pathology reports, link the extracted terms to biomedical ontologies, join the output with more structured patient data, and view the results in reports and graphical visualizations. At its foundation, this solution employs SAS^® Enterprise Content Categorization to perform entity extraction using both manually and automatically generated concept definition rules. Concept definition rules are automatically created using technology developed by SAS, and the unstructured reports are scored using the DS2/SAS^® Content Categorization API. Results are post-processed and added to tables compatible with SAS^® Visual Analytics, thus enabling users to visualize and explore data as required. We illustrate the interrelated components of this solution with examples of appropriate use cases and describe manual validation of performance and reliability with metrics such as precision and recall. We also provide examples of reports and visualizations created with SAS Visual Analytics.

For almost two decades, Western Kentucky University's Office of Institutional Research (WKU-IR) has used SAS^® to help shape the future of the institution by providing faculty and administrators with information they can use to make a difference in the lives of their students. This presentation provides specific examples of how WKU-IR has shaped the policies and practices of our institution and discusses how WKU-IR moved from a support unit to a key strategic partner. In addition, the presentation covers the following topics: How the WKU Office of Institutional Research developed over time; Why WKU abandoned reactive reporting for a more accurate, convenient system using SAS^® Enterprise Intelligence Suite for Education; How WKU shifted from investigating what happened to predicting outcomes using SAS^® Enterprise Miner^™ and SAS^® Text Miner; How the office keeps the system relevant and utilized by key decision makers; What the office has accomplished and key plans for the future.

The role of the Data Scientist is the viral job description of the decade. And like LOLcats, there are many types of Data Scientists. What is this new role? Who is hiring them? What do they do? What skills are required to do their job? What does this mean for the SAS^® programmer and the statistician? Are they obsolete? And finally, if I am a SAS user, how can I become a Data Scientist? Come learn about this job of the future and what you can do to be part of it.

Recent studies suggest that unstructured data, such as customer comments or feedback, can enhance the power of existing predictive models. SAS^® Text Miner can generate singular value decomposition (SVD) units from text documents, which is a vectorial representation of terms in documents. These SVDs, when used as additional inputs along with the existing structured input variables, often prove to capture the response better. However, SVD units are sort of black box variables and are not easy to interpret or explain. This is a big hindrance to win over the decision makers in the organizations to incorporate these derived textual data components in the models. In this paper, we demonstrate a new and powerful feature in SAS^® Text Miner 12.1 that helps in explaining the SVDs or the text cluster components. We discuss two important methods that are useful to interpreting them. For this purpose, we used data from a television network company that has transcripts of its call center notes from three prior calls of each customer. We are able to extract the key terms from the call center notes in the form of Boolean rules, which have contributed to the prediction of customer churn. These rules provide an intuitive sense of which set of terms, when occurring in either the presence or absence of another set of terms in the call center notes, might lead to a churn. It also provides insights into which customers are at a bigger risk of churning from the company s services and, more importantly, why.

SAS^® High-Performance Analytics is a significant step forward in the area of high-speed, analytic processing in a scalable clustered environment. However, Big Data problems generally come with data from lots of data sources, at varying levels of maturity. Teradata s innovative Unified Data Architecture (UDA) represents a significant improvement in the way that large companies can think about Enterprise Data Management, including the Teradata Database, Hortonworks Hadoop, and Aster Data Discovery platform in a seamless integrated platform. Together, the two platforms provide business users, analysts, and data scientists with the ideally suited data management platforms, targeted specifically to their analytic needs, based upon analytic use cases, managed in a single integrated enterprise data management environment. The paper will focus on how several companies today are using Teradata s Integrated Hardware and Software UDA Platform to manage a single enterprise analytic environment, fight the ongoing proliferation of analytic data marts, and speed their operational analytic processes.

A common complaint from users working on identifying fraud and abuse in Medicare is that teams focus on operational applications, static reports, and high-level outliers. But, when faced with the need to constantly evaluate changing Medicare provider and beneficiary or enrollee dynamics, users are clamoring for more dynamic and accurate detection approaches. Providing these organizations with a data discovery and predictive analytics framework that leverages Hadoop and other big data approaches, while providing a clear path for teams to make more fact-based decisions more quickly is very important in pre- and post-fraud and abuse analysis. Organizations that do pursue a framework and a reusable services-based data discovery and analytics framework and architecture approach enjoy greater success in supporting data management, reporting, and analytics demands. They can quickly turn models into prioritized alerts and avoid improper or fraudulent payments. A successful framework should enable organizations to come up with efficient fraud, waste, and abuse models to address complex schemes; identify fraud, waste, and abuse vulnerabilities; and shorten triage efforts using a variety of data sourced from big data platforms like Hadoop and other relational database management systems. This paper talks about the data management, data discovery, predictive analytics, and social network analysis capabilities that are included in the SAS fraud framework and how a unified approach can significantly reduce the lifecycle of building and deploying fraud models. We hope this paper will provide IT leaders with a clear path for resolving issues from the simple to the incredibly complex, through a measured and scalable approach for delivering value for fraud, waste, and abuse models by providing deep insights to support evidence-based investigations.

The use of predictive models in healthcare has steadily increased over the decades. Statistical models now are assumed to be a necessary component in population health management. This session will review practical considerations in the choice of models to develop, criteria for assessing the utility of the models for production, and challenges with incorporating the models into business process flows. Specific examples of models will be provided based upon work by the Health Economics team at Blue Cross Blue Shield of North Carolina.

Sparse data sets are common in applications of text and data mining, social network analysis, and recommendation systems. In SAS^® software, sparse data sets are usually stored in the coordinate list (COO) transactional format. Two major drawbacks are associated with this sparse data representation: First, most SAS procedures are designed to handle dense data and cannot consume data that are stored transactionally. In that case, the options for analysis are significantly limited. Second, a sparse data set in transactional format is hard to store and process in distributed systems. Most techniques require that all transactions for a particular object be kept together; this assumption is violated when the transactions of that object are distributed to different nodes of the grid. This paper presents some different ideas about how to package all transactions of an object into a single row. Approaches include storing the sparse matrix densely, doing variable selection, doing variable extraction, and compressing the transactions into a few text variables by using Base64 encoding. These simple but effective techniques enable you to store and process your sparse data in better ways. This paper demonstrates how to use SAS^® Text Miner procedures to process sparse data sets and generate output data sets that are easy to store and can be readily processed by traditional SAS modeling procedures. The output of the system can be safely stored and distributed in any grid environment.

This workshop provides hands-on experience using SAS^® Text Miner Workshop participants will do the following: read a collection of text documents and convert them for use by SAS Text Miner using the Text Import node use the simple query language supported by the Text Filter node to extract information from a collection of documents use the Text Topic node to identify the dominant themes and concepts in a collection of documents use the Text Rule Builder node to classify documents that have pre-assigned categories

All successful organizations seek ways of communicating the identity of subject matter experts to employees. This information exists as common knowledge when an organization is first starting out, but the common knowledge becomes fragmented as the organization grows. SAS^® Text Analytics can be used on an organization's internal unstructured data to reunite these knowledge fragments. This paper demonstrates how to extract and surface this valuable information from within an organization. First, the organization s unstructured textual data are analyzed by SAS^® Enterprise Content Categorization to develop a topic taxonomy that associates subject matter with subject matter experts in the organization. Then, SAS Text Analytics can be used successfully to build powerful semantic models that enhance an organization's unstructured data. This paper shows how to use those models to process and deliver real-time information to employees, increasing the value of internal company information.

Businesses today are inundated with unstructured data not just social media but books, blogs, articles, journals, manuscripts, and even detailed legal documents. Manually managing unstructured data can be time consuming and frustrating, and might not yield accurate results. Having an analyst read documents often introduces bias because analysts have their own experiences, and those experiences help shape how the text is interpreted. The fact that people become fatigued can also impact the way that the text is interpreted. Is the analyst as motivated at the end of the day as they are at the beginning? Data science involves using data management, analytical, and visualization strategies to uncover the story that the data is trying to tell in a more automated fashion. This is important with structured data but becomes even more vital with unstructured data. Introducing automated processes for managing unstructured data can significantly increase the value and meaning gleaned from the data. This paper outlines the data science processes necessary to ingest, transform, analyze, and visualize three Star Wars movie scripts: A New Hope, The Empire Strikes Back, and Return of the Jedi. It focuses on the need to create structure from unstructured data using SAS^® Data Management, SAS^® Text Miner, and SAS^® Content Categorization. The results are featured using SAS^® Visual Analytics.

Contrasting two sets of textual data points out important differences. For example, consider social media data that have been collected on the race between incumbent Kay Hagan and challenger Thom Tillis in the 2014 election for the seat of US Senator from North Carolina. People talk about the candidates in different terms for different topics, and you can extract the words and phrases that are used more in messages about one candidate than about the other. By using SAS^® Sentiment Analysis on the extracted information, you can discern not only the most important topics and sentiments for each candidate, but also the most prominent and distinguishing terms that are used in the discussion. Find out if Republicans and Democrats speak different languages!

Global businesses must react to daily changes in market conditions over multiple geographies and industries. Consuming reputable daily economic reports assists in understanding these changing conditions, but requires both a significant human time commitment and a subjective assessment of each topic area of interest. To combat these constraints, Dow's Advanced Analytics team has constructed a process to calculate sentence-level topic frequency and sentiment scoring from unstructured economic reports. Daily topic sentiment scores are aggregated to weekly and monthly intervals and used as exogenous variables to model external economic time series data. These models serve to both validate the relationship between our sentiment scoring process and also as near-term forecasts where daily or weekly variables are unavailable. This paper will first describe our process of using SAS^® Text Miner to import and discover economic topics and sentiment from unstructured economic reports. The next section describes sentiment variable selection techniques that use SAS/STAT^®, SAS/ETS^®, and SAS^® Enterprise Miner^™ to generate similarity measures to economic indices. Our process then uses ARIMAX modeling in SAS^® Forecast Studio to create economic index forecasts with topic sentiments. Finally, we show how the sentiment model components are used as a matrix of economic key performance indicators by topic and geography.

Nowadays, in the Big Data era, Business Intelligence Departments collect, store, process, calculate, and monitor massive amounts of data. Nevertheless, sometimes hundreds of metrics built on the structured data are inefficient to explain why the offered deal sold better or worse than expected. The answer might be found in text data that every company owns and yet is not aware of its possible usage or neglects its value. This project shows text mining methods, implemented in SAS^® Text Miner 12.1, that enable the determination of a deal's success or failure factors based on in-house or Internet-scattered customers' views and opinions. The study is conducted on data gathered from Groupon Sp. z o.o. (Polish business unit) - e-commerce company, as it is assumed that the market is by and large a customer-driven environment.

Understanding previous research in key domain areas can help R&D organizations focus new research in non-duplicative areas and ensure that future endeavors do not repeat the mistakes of the past. However, manual analysis of previous research efforts can prove insufficient to meet these ends. This paper highlights how a combination of SAS^® Text Analytics and SAS^® Visual Analytics can deliver the capability to understand key topics and patterns in previous research and how it applies to a current research endeavor. We will explore these capabilities in two use cases. The first will be in uncovering trends in publicly visible government funded research (SBIR) and how these trends apply to future research in nanotechnology. The second will be visualizing past research trends in publicly available NASA publications, and how these might impact the development of next-generation spacecraft.