SAS Global Forum 2016 Proceedings

Movie reviews provide crucial information and act as an important factor when deciding whether to spend money on seeing a film in the theater. Each review reflects an individual's take on the movie and there are often contrasting reviews for the same movie. Going through each review may create confusion in the mind of a reader. Analyzing all the movie reviews and generating a quick summary that describes the performance, direction, and screenplay, among other aspects is helpful to the readers in understanding the sentiments of movie reviewers and in deciding if they would like to watch the movie in the theaters. In this paper, we demonstrate how the SAS^® Enterprise Miner™ nodes enable us to generate a quick summary of the terms and their relationship with each other, which describes the various aspects of a movie. The Text Cluster and Text Topic nodes are used to generate groups with similar subjects such as genres of movies, acting, and music. We use SAS^® Sentiment Analysis Studio to build models that help us classify 10,000 reviews where each review is a separate document and the reviews are equally split into good or bad. The Smoothed Relative Frequency and Chi-square model is found to be the best model with overall precision of 78.37%. As soon as the latest reviews are out, such analysis can be performed to help viewers quickly grasp the sentiments of the reviews and decide if the movie is worth watching.

Read the paper (PDF)

The Armed Conflict Location & Event Data Project (ACLED) is a comprehensive public collection of conflict data for developing states. As such, the data is instrumental in informing humanitarian and development work in crisis and conflict-affected regions. The ACLED project currently manually codes for eight types of events from a summarized event description, which is a time-consuming process. In addition, when determining the root causes of the conflict across developing states, these fixed event types are limiting. How can researchers get to a deeper level of analysis? This paper showcases a repeatable combination of exploratory and classification-based text analytics provided by SAS^® Contextual Analysis, applied to the publicly available ACLED for African states. We depict the steps necessary to determine (using semi-automated processes) the next deeper level of themes associated with each of the coded event types; for example, while there is a broad protests and riots event type, how many of those are associated individually with rallies, demonstrations, marches, strikes, and gatherings? We also explore how to use predictive models to highlight the areas that require aid next. We prepare the data for use in SAS^® Visual Analytics and SAS^® Visual Statistics using SAS^® DS2 code and enable dynamic explorations of the new event sub-types. This ultimately provides the end-user analyst with additional layers of data that can be used to detect trends and deploy humanitarian aid where limited resources are needed the most.

Read the paper (PDF)

Since its inception, SAS^® Text Miner has used the singular value decomposition (SVD) to convert a term-document matrix to a representation that is crucial for building successful supervised and unsupervised models. In this presentation, using SAS^® code and SAS Text Miner, we compare these models with those that are based on SVD representations of subcomponents of documents. These more granular SVD representations are also used to provide further insights into the collection. Examples featuring visualizations, discovery of term collocations, and near-duplicate subdocument detection are shown.

Read the paper (PDF)

In the pharmaceutical industry, generating summary tables or listings in PDF format is becoming increasingly popular. You can create various PDF output files with different styles by combining the ODS PDF statement and the REPORT procedure in SAS^® software. In order to validate those outputs, you need to import the PDF summary tables or listings into SAS data sets. This paper presents an overview of a utility that reads in an uncompressed PDF file that is generated by SAS, extracts the data, and converts it to SAS data sets.

Read the paper (PDF)

Quality issues can easily go undetected in the field for months, even years. Why? Customers are more likely to complain through their social networks than directly through customer service channels. The Internet is the closest friend for most users and focusing purely on internal data isn't going to help. Customer complaints are often handled by marketing or customer service and aren't shared with engineering. This lack of integrated vision is due to the silo mentality in organizations and proves costly in the long run. Information systems are designed with business rules to find out of spec situations or to trigger when a certain threshold of reported cases is reached before engineering is involved. Complex business problems demand sophisticated technology to deliver advanced analytical capabilities. Organizations can react only to what they know about, but unstructured data is an invaluable asset that we must hatch on. So what if you could infuse the voice of the customer directly into engineering and detect emerging issues months earlier than your existing process? Lenovo just did that. Issues were detected in time to remove problematic parts, eventually leading to massive reductions in cost and time. Lenovo observed a significant decline in call volume to its corporate call center. It became evident that customers' perception of quality relative to the market is a voice we can't afford to ignore. Lenovo Corporate Analytics unit partnered with SAS^® to build an end-to-end process that constitutes data acquisition and preparation, analysis using linguistic text analytics models, and quality control metrics for visualization dashboards. Lenovo executives consume these reports and make informed decisions. In this paper, we talk in detail about perceptual quality and why we should care about it. We also tell you where to look for perceptual quality data and best practices for starting your own perceptual quality program through the voice of customer analytics.

Read the paper (PDF)

Create a model with SAS Contextual Analysis, deploy the model to your Hadoop cluster, run the model using map/reduce driven by SAS Code Accelerator not alongside, but in Hadoop. Abstract: Hadoop took the world by storm as a cost efficient software framework for storing data. Hadoop is more than that, it's an analytics platform. Analytics is more than just crunching numbers, it's also about gaining knowledge from your organization's unstructured data. SAS has the tools to do both. In this paper, we'll demonstrate how to access your unstructured data, automatically create a Text Analytics model, deploy that model in Hadoop and run SAS's Code Accelerator to give you insights into your big and unstructured data lying dormant in Hadoop. We'll show you how the insights gained can help you make better informed decisions for your organization.

Read the paper (PDF) | Download the data file (ZIP) | Watch the recording

Cycling is one of the fastest growing recreational activities and sports in India. Many non-government organizations (NGOs) support this emission-free mode of transportation that helps protect the environment from air pollution hazards. Lots of cyclist groups in metropolitan areas organize numerous ride events and build social networks. Although India was a bit late for joining the Internet, the social networking sites are getting popular in every Indian state and are expected to grow after the announcement of Digital India Project. Many networking groups and cycling blogs share tons of information and resources across the globe. However, these blogs posts are difficult to categorize according to their relevance, making it difficult to access required information quickly on the blogs. This paper provides ways to categorize the content of these blog posts and classify them in meaningful categories for easy identification. The initial data set is created from scraping the online cycling blog posts (for example, Cyclists.in, velocrushindia.com, and so on) using Python. The initial data set consists of 1,446 blog posts with titles and blog content since 2008. Approximately 25% of the blog posts are manually categorized into six different categories, such as Ride stories, Events, Nutrition, Bicycle hacks, Bicycle Buy and Sell, and Others, by three independent raters to generate a training data set. The common blog-post categories are identified, and the text classification model is built to identify the blog-post classification and generate the text rules to classify the blogs based on those categories using SAS^® Text Miner. To improve the look and feel of the social networking blog, a tool developed in JavaScript automatically classifies the blog posts into six categories and provides appropriate suggestions to the blog user.