SAS Global Forum 2017 Proceedings

Location information plays a big role in business data. Everything that happens in a business happens somewhere, whether it s sales of products in different regions or crimes that happened in a city. Business analysts typically use the historic data that they have gathered for years for analysis. One of the most important pieces of data that can help answer more questions qualitatively, is the demographic data along with the business data. An analyst can match the sales or the crimes with the population metrics like gender, age groups, family income, race, and other pieces of information, which are part of the demographic data, for better insight. This paper demonstrates how a business analyst can bring the demographic and lifestyle data from Esri into SAS^® Visual Analytics and join the data with business data. The integration of SAS Visual Analytics with Esri allows this to happen. We demonstrate different methods of accessing Esri demographic data from SAS Visual Analytics. We also demonstrate how you can use custom shape files and integrate with Esri Portal for ArcGIS.

Read the paper (PDF)

Uber has changed the face of taxi ridership, making it more convenient and comfortable for riders. But, there are times when customers are dissatisfied because of a shortage of Uber vehicles, which ultimately leads to Uber surge pricing. It's a very difficult task to forecast the number of riders at different locations in a city at different points in time. This gets more complicated with changes in weather. In this paper, we attempt to estimate the number of trips per borough on a daily basis in New York City. We add an exogenous factor weather to this analysis to see how it impacts the changes in the number of trips. We fetched six months worth of data (approximately 9.7 million records) of Uber rides in New York City ranging from January 2015 to June 2015 from GitHub. We gathered weather data (about 3.5 million records) for New York City for the same period from the National Climatic Data Center. We analyzed Uber data and weather data together to estimate the change in the number of trips per borough due to changing weather conditions. We built a model to predict the number of trips per day for a one-week-ahead forecast for each borough of New York City. As part of a further analysis, we got the number of trips on a particular day for each borough. Using time series analysis, we forecast the number of trips that might be required in the near future (probably one week).

Read the paper (PDF) | View the e-poster or slides (PDF)

Customer churn is an important area of concern that affects not just the growth of your company, but also the profit. Conventional survival analysis can provide a customer's likelihood to churn in the near term, but it does not take into account the lifetime value of the higher-risk churn customers you are trying to retain. Not all customers are equally important to your company. Recency, frequency, and monetary (RFM) analysis can help companies identify customers that are most important and most likely to respond to a retention offer. In this paper, we use the IML and PHREG procedures to combine the RFM analysis and survival analysis in order to determine the optimal number of higher-risk and higher-value customers to retain.

Read the paper (PDF)

A/B testing is a form of statistical hypothesis testing on two business options (A and B) to determine which is more effective in the modern Internet age. The challenge for startups or new product businesses leveraging A/B testing are two-fold: a small number of customers and poor understanding of their responses. This paper shows you how to use the IML and POWER procedures to deal with the reassessment of sample size for adaptive multiple business stage designs based on conditional power arguments, using the data observed at the previous business stage.

Read the paper (PDF)

For all business analytics projects big or small, the results are used to support business or managerial decision-making processes, and many of them eventually lead to business actions. However, executives or decision makers are often confused and feel uninformed about contents when presented with complicated analytics steps, especially when multi-processes or environments are involved. After many years of research and experiment, a web reporting framework based on SAS^® Stored Processes was developed to smooth the communication between data analysts, researches, and business decision makers. This web reporting framework uses a storytelling style to present essential analytical steps to audiences, with dynamic HTML5 content and drill-down and drill-through functions in text, graph, table, and dashboard formats. No special skills other than SAS^® programming are needed for implementing a new report. The model-view-controller (MVC) structure in this framework significantly reduced the time needed for developing high-end web reports for audiences not familiar with SAS. Additionally, the report contents can be used to feed to tablet or smartphone users. A business analytical example is demonstrated during this session. By using this web reporting framework based on SAS Stored Processes, many existing SAS results can be delivered more effectively and persuasively on a SAS^® Enterprise BI platform.

Read the paper (PDF)

Creating an environment that enables and empowers self-service and agile analytic capabilities requires a tremendous amount of working together and extensive agreements between IT and the business. Business and IT users are struggling to know what version of the data is valid, where they should get the data from, and how to combine and aggregate all the data sources to apply analytics and deliver results in a timely manner. All the while, IT is struggling to supply the business with more and more data that is becoming available through many different data sources such as the Internet, sensors, the Internet of Things, and others. In addition, once they start trying to join and aggregate all the different types of data, the manual coding can be very complicated and tedious, can demand extraneous resources and processing, and can negatively impact the overhead on the system. If IT enables agile analytics in a data lab, it can alleviate many of these issues, increase productivity, and deliver an effective self-service environment for all users. This self-service environment using SAS^® analytics in Teradata has decreased the time required to prepare the data and develop the statistical data model, and delivered faster results in minutes compared to days or even weeks. This session discusses how you can enable agile analytics in a data lab, leverage SAS analytics in Teradata to increase performance, and learn how hundreds of organizations have adopted this concept to deliver self-service capabilities in a streamlined process.

Would you like to be more confident in producing graphs and figures? Do you understand the differences between the OVERLAY, GRIDDED, LATTICE, DATAPANEL, and DATALATTICE layouts? Finally, would you like to learn the fundamental Graph Template Language methods in a relaxed environment that fosters questions? Great this topic is for you! In this hands-on workshop, you are guided through the fundamental aspects of the GTL procedure, and you can try fun and challenging SAS^® graphics exercises to enable you to more easily retain what you have learned.

Read the paper (PDF) | Download the data file (ZIP)

Do you need to add annotations to your graphs? Do you need to specify your own colors on the graph? Would you like to add Unicode characters to your graph, or would you like to create templates that can also be used by non-programmers to produce the required figures? Great, then this topic is for you! In this hands-on workshop, you are guided through the more advanced features of the GTL procedure. There are also fun and challenging SAS^® graphics exercises to enable you to more easily retain what you have learned.

Read the paper (PDF) | Download the data file (ZIP)

When analyzing data with SAS^®, we often use the SAS DATA step and the SQL procedure to explore and manipulate data. Though they both are useful tools in SAS, many SAS users do not fully understand their differences, advantages, and disadvantages and thus have numerous unnecessary biased debates on them. Therefore, this paper illustrates and discusses these aspects with real work examples, which give SAS users deep insights into using them. Using the right tool for a given circumstance not only provides an easier and more convenient solution, it also saves time and work in programming, thus improving work efficiency. Furthermore, the illustrated methods and advanced programming skills can be used in a wide variety of data analysis and business analytics fields.

Read the paper (PDF)

Every organization, from the most mature to a day-one start-up, needs to grow organically. A deep understanding of internal customer and operational data is the single biggest catalyst to develop and sustain the data. Advanced analytics and big data directly feed into this, and there are best practices that any organization (across the entire growth curve) can adopt to drive success. Analytics teams can be drivers of growth. But to be truly effective, key best practices need to be implemented. These practices include in-the-weeds details, like the approach to data hygiene, as well as strategic practices, like team structure and model governance. When executed poorly, business leadership and the analytics team are unable to communicate with each other they talk past each other and do not work together toward a common goal. When executed well, the analytics team is part of the business solution, aligned with the needs of business decision-makers, and drives the organization forward. Through our engagements, we have discovered best practices in three key areas. All three are critical to analytics team effectiveness. 1) Data Hygiene 2) Complex Statistical Modeling 3) Team Collaboration

Read the paper (PDF)

Prince Niccolo Machiavelli said things on the order of, The promise given was a necessity of the past: the word broken is a necessity of the present. His utilitarian philosophy can be summed up by the phrase, The ends justify the means. As a personality trait, Machiavelianism is characterized by the drive to pursue one's own goals at the cost of others. In 1970, Richard Christie and Florence L. Geis created the MACH-IV test to assign a MACH score to an individual, using 20 Likert-scaled questions. The purpose of this study was to build a regression model that can be used to predict the MACH score of an individual using fewer factors. Such a model could be useful in screening processes where personality is considered, such as in job screening, offender profiling, or online dating. The research was conducted on a data set from an online personality test similar to the MACH-IV test. It was hypothesized that a statistically significant model exists that can predict an average MACH score for individuals with similar factors. This hypothesis was accepted.

View the e-poster or slides (PDF)

Optimizing delivery routes and efficiently using delivery drivers are examples of classic problems in Operations Research, such as the Traveling Salesman Problem. In this paper, Oberweis and Zencos collaborate to describe how to leverage SAS/OR^® procedures to solve these problems and optimize delivery routes for a retail delivery service. Oberweis Dairy specializes in home delivery service that delivers premium dairy products directly to customers homes. Because freshness is critical to delivering an excellent customer experience, Oberweis is especially motivated to optimize their delivery logistics. As Oberweis works to develop an expanding footprint and a growing business, Zencos is helping to ensure that delivery routes are optimized and delivery drivers are used efficiently.

Read the paper (PDF)

As a programmer specializing in tracking systems for research projects, I was recently given the task of implementing a newly developed, web-based tracking system for a complex field study. This tracking system uses an SQL database on the back end to hold a large set of related tables. As I learned about the new system, I found that there were deficiencies to overcome to make the system work on the project. Fortunately, I was able to develop a set of utilities in SAS^® to bridge the gaps in the system and to integrate the system with other systems used for field survey administration on the project. The utilities helped to do the following: 1) connect schemas and compare cases across subsystems; 2) compare the statuses of cases across multiple tracked processes; 3) generate merge input files to be used to initiate follow-up activities; 4) prepare and launch SQL stored procedures from a running SAS job; and 5) develop complex queries in Microsoft SQL Server Management Studio and making them run in SAS. This paper puts each of these needs into a larger context by describing the need that is not addressed by the tracking system. Then, each program is explained and documented, with comments.

Read the paper (PDF)

Airbnb is the world's largest home-sharing company and has over 800,000 listings in more than 34,000 cities and 190 countries. Therefore, the pricing of their property, done by the Airbnb hosts, is crucial to the business. Setting low prices during a high-demand period might hinder profits, while setting high prices during a low-demand period might result in no bookings at all. In this paper, we suggest a price recommendation methodology for Airbnb hosts that helps in overcoming the problems of overpricing and underpricing. Through this methodology, we try to identify key factors related to Airbnb pricing: factors influential in determining a price for a property; the relation between the price of a property and the frequency of its booking; and similarities among successful and profitable properties. The constraints outlined in the analysis were entered into SAS^® optimization procedures to achieve a best possible price. As a part of this methodology, we built a scraping tool to get details of New York City host user data along with their metrics. Using this data, we build a pricing model to predict the optimal price of an Airbnb home.

Read the paper (PDF)

Self-driving cars are no longer a futuristic dream. In the recent past, Google has launched a prototype of the self-driving car, while Apple is also developing its own self-driving car. Companies like Tesla have just introduced an Auto Pilot version in their newer version of electric cars which have created quite a buzz in the car market. This technology is said to enable aging or disabled people to remain mobile, while also increasing overall traffic saftery. But many people are still skeptical about the idea of self-driving cars, and that's our area of interest. In this project, we plan to do sentiment analysis on thoughts voiced by people on the Internet about self-driving cars. We have obtained the data from http://www.crowdflower.com/data-for-everyone which contain these reviews about the self-driving cars. Our dataset contains 7,156 observations and 9 variables. We plan to do descriptive analysis of the reviews to identify key topics and then use supervised sentiment analysis. We also plan to track and report how the topics and the sentiments change over time.

View the e-poster or slides (PDF)

In 2012, US Customs scanned nearly 4% and physically inspected less than 1% of the 11.5 million cargo containers that entered the United States. Laundering money through trade is one of the three primary methods used by criminals and terrorists. The other two methods used to launder money are using financial institutions and physically moving money via cash couriers. The Financial Action Task Force (FATF) roughly defines trade-based money laundering (TBML) as disguising proceeds from criminal activity by moving value through the use of trade transactions in an attempt to legitimize their illicit origins. As compared to other methods, this method of money laundering receives far less attention than those that use financial institutions and couriers. As countries have budget shortfalls and realize the potential loss of revenue through fraudulent trade, they are becoming more interested in TBML. Like many problems, applying detection methods against relevant data can result in meaningful insights, and can result in the ability to investigate and bring to justice those perpetuating fraud. In this paper, we apply TBML red flag indicators, as defined by John A. Cassara, against shipping and trade data to detect and explore potentially suspicious transactions. (John A. Cassara is an expert in anti-money laundering and counter-terrorism, and author of the book Trade-Based Money Laundering. ) We use the latest detection tool in SAS^® Viya , along with SAS^® Visual Investigator.

View the e-poster or slides (PDF)

Ensemble models have become increasingly popular in boosting prediction accuracy over the last several years. Stacked ensemble techniques combine predictions from multiple machine learning algorithms and use these predictions as inputs to a second level-learning algorithm. This paper shows how you can generate a diverse set of models by various methods (such as neural networks, extreme gradient boosting, and matrix factorizations) and then combine them with popular stacking ensemble techniques, including hill-climbing, generalized linear models, gradient boosted decision trees, and neural nets, by using both the SAS^® 9.4 and SAS^® Visual Data Mining and Machine Learning environments. The paper analyzes the application of these techniques to real-life big data problems and demonstrates how using stacked ensembles produces greater prediction accuracy than individual models and na ve ensembling techniques. In addition to training a large number of models, model stacking requires the proper use of cross validation to avoid overfitting, which makes the process even more computationally expensive. The paper shows how to deal with the computational expense and efficiently manage an ensemble workflow by using parallel computation in a distributed framework.

Read the paper (PDF)

This project described the method to classify movie genres based on synopses text data by two approaches: term frequency, and inverse document frequency (tf-idf) and C4.5 decision tree. Using the performance comparison of the classifiers by manipulating the different parameters, the strength and improvement of this method in substantial text analysis were also interpreted. As the result, these two approaches are powerful to identify movie genres.

Read the paper (PDF) | View the e-poster or slides (PDF)

Every visualization tells a story. The effectiveness of showing data through visualization becomes clear as these visualizations will tell stories about differences in US mortality using the National Longitudinal Mortality Study (NLMS) data, using the Public-Use Microdata Samples (PUMS) of 1.2 million cases and 122 thousand records of mortality. SAS^® Visual Analytics is a versatile and flexible tool that easily displays the simple effects of differences in mortality rates between age groups, genders, races, places of birth (native or foreign), education and income levels, and so on. Sophisticated analyses including logistical regression (with interactions), decision trees, and neural networks that are displayed in a clear, concise manner help describe more interesting relationships among variables that influence mortality. Some of the most compelling examples are: Males who live alone have a higher mortality rate than females. White men have higher rates of suicide than black men.

Read the paper (PDF) | View the e-poster or slides (PDF)

Hash tables are powerful tools when building an electronic code book, which often requires a lot of match-merging between the SAS^® data sets. In projects that span multiple years (e.g., longitudinal studies), there are usually thousands of new variables introduced at the end of every year or at the end of each phase of the project. These variables usually have the same stem or core as the previous year's variables. However, they differ only in a digit or two that usually signifies the year number of the project. So, every year, there is this extensive task of comparing thousands of new variables to older variables for the sake of carrying forward key database elements corresponding to the previously defined variables. These elements can include the length of the variable, data type, format, discrete or continuous flag, and so on. In our SAS program, hash objects are efficiently used to cut down not only time, but also the number of DATA and PROC steps used to accomplish the task. Clean and lean code is much easier to understand. A macro is used to create the data set containing new and older variables. For a specific new variable, the FIND method in hash objects is used in a loop to find the match to the most recent older variable. What was taking around a dozen PROC SQL steps is now a single DATA step using hash tables.

Read the paper (PDF)

Increasingly, customers are using social media and other Internet-based applications such as review sites and discussion boards to voice their opinions and express their sentiments about brands. Such spontaneous and unsolicited customer feedback can provide brand managers with valuable insights about competing brands. There is a general consensus that listening to and reacting to the voice of the customer is a vital component of brand management. However, the unstructured, qualitative, and textual nature of customer data that is obtained from customers poses significant challenges for data scientists and business analysts. In this paper, we propose a methodology that can help brand managers visualize the competitive structure of a market based on an analysis of customer perceptions and sentiments that are obtained from blogs, discussion boards, review sites, and other similar sources. The brand map is designed to graphically represent the association of product features with brands, thus helping brand managers assess a brand's true strengths and weaknesses based on the voice of customers. Our multi-stage methodology uses the principles of topic modeling and sentiment analysis in text mining. The results of text mining are analyzed using correspondence analysis to graphically represent the differentiating attributes of each brand. We empirically demonstrate the utility of our methodology by using data collected from Edmunds.com, a popular review site for car buyers.