SAS Global Forum 2015 Proceedings

Companies that offer subscription-based services (such as telecom and electric utilities) must evaluate the tradeoff between month-to-month (MTM) customers, who yield a high margin at the expense of lower lifetime, and customers who commit to a longer-term contract in return for a lower price. The objective, of course, is to maximize the Customer Lifetime Value (CLV). This tradeoff must be evaluated not only at the time of customer acquisition, but throughout the customer's tenure, particularly for fixed-term contract customers whose contract is due for renewal. In this paper, we present a mathematical model that optimizes the CLV against this tradeoff between margin and lifetime. The model is presented in the context of a cohort of existing customers, some of whom are MTM customers and others who are approaching contract expiration. The model optimizes the number of MTM customers to be swapped to fixed-term contracts, as well as the number of contract renewals that should be pursued, at various term lengths and price points, over a period of time. We estimate customer life using discrete-time survival models with time varying covariates related to contract expiration and product changes. Thereafter, an optimization model is used to find the optimal trade-off between margin and customer lifetime. Although we specifically present the contract expiration case, this model can easily be adapted for customer acquisition scenarios as well.

Read the paper (PDF). | Download the data file (ZIP).

The importance of econometrics in the analytics toolkit is increasing every day. Econometric modeling helps uncover structural relationships in observational data. This paper highlights the many recent changes to the SAS/ETS^® portfolio that increase your power to explain the past and predict the future. Examples show how you can use Bayesian regression tools for price elasticity modeling, use state space models to gain insight from inconsistent time series, use panel data methods to help control for unobserved confounding effects, and much more.

Read the paper (PDF).

Improving tourists' satisfaction and intention to revisit the festival is an ongoing area of interest to the tourism industry. Many organizers at the festival site strive very hard to attract and retain attendees by investing heavily in their marketing and promotion strategies for the festival. To meet this challenge, the advanced analytical model though data mining approach is proposed to answer the following research question: What are the most important factors that influence tourists' intentions to revisit the festival site? Cluster analysis, neural network, decision tree, stepwise regression, polynomial regression, and support vector machine are applied in this study. The main goal is to determine what it takes not only to retain the loyalty attendees, but also attract and encourage new attendees to be back at the site.

Read the paper (PDF).

The use of administrative databases for understanding practice patterns in the real world has become increasingly apparent. This is essential in the current health-care environment. The Affordable Care Act has helped us to better understand the current use of technology and different approaches to surgery. This paper describes a method for extracting specific information about surgical procedures from the Healthcare Cost and Utilization Project (HCUP) database (also referred to as the National (Nationwide) Inpatient Sample (NIS)).The analyses provide a framework for comparing the different modalities of surgerical procedures of interest. Using an NIS database for a single year, we want to identify cohorts based on surgical approach. We do this by identifying the ICD-9 codes specific to robotic surgery, laparoscopic surgery, and open surgery. After we identify the appropriate codes using an ARRAY statement, a similar array is created based on the ICD-9 codes. Any minimally invasive procedure (robotic or laparoscopic) that results in a conversion is flagged as a conversion. Comorbidities are identified by ICD-9 codes representing the severity of each subject and merged with the NIS inpatient core file. Using a FORMAT statement for all diagnosis variables, we create macros that can be regenerated for each type of complication. These created macros are compiled in SAS^® and stored in the library that contains the four macros that are called by tables. They call the macros for different macros variables. In addition, they create the frequencies of all cohorts and create the table structure with the title and number of the table. This paper describes a systematic method in SAS/STAT^® 9.2 to extract the data from NIS using the ARRAY statement for the specific ICD-9 codes, to format the extracted data for the analysis, to merge the different NIS databases by procedures, and to use automatic macros to generate the report.

Read the paper (PDF).

Over the years, the SAS^® Business Intelligence platform has proved its importance in this big data world with its suite of applications that enable us to efficiently process, analyze, and transform huge amounts of business data. Within the data warehouse universe, 'batch execution' sits in the heart of SAS Data Integration technologies. On a day-to-day basis, batches run, and the current status of the batch is generally sent out to the team or to the client as a 'static' e-mail or as a report. From experience, we know that they don't provide much insight into the real 'bits and bytes' of a batch run. Imagine if the status of the running batch is automatically captured in one central repository and is presented on a beautiful web browser on your computer or on your iPad. All this can be achieved without asking anybody to send reports and with all 'post-batch' queries being answered automatically with a click. This paper aims to answer the same with a framework that is designed specifically to automate the reporting aspects of SAS batches and, yes, it is all about collecting statistics of the batch, and we call it - 'BatchStats.'

Respondent Driven Sampling (RDS) is both a sampling method and a data analysis technique. As a sampling method, RDS is a chain referral technique with strategic recruitment quotas and specific data gathering requirements. Like other chain referral techniques (for example, snowball sampling), the chains and waves are the start point to conduct analysis. But building the chains and waves still would be a daunting task because it involves too many transpositions and merges. This paper provides an efficient method of using Base SAS^® to build up chains and waves.

Read the paper (PDF).

As pollution and population continue to increase, new concepts of eco-friendly commuting evolve. One of the emerging concepts is the bicycle sharing system. It is a bike rental service on a short-term basis at a moderate price. It provides people the flexibility to rent a bike from one location and return it to another location. This business is quickly gaining popularity all over the globe. In May 2011, there were only 375 bike rental schemes consisting of nearly 236,000 bikes. However, this number jumped to 535 bike sharing programs with approximately 517,000 bikes in just a couple of years. It is expected that this trend will continue to grow at a similar pace in the future. Most of the businesses involved in this system of bike rental are faced with the challenge of balancing supply and inconsistent demand. The number of bikes needed on a particular day can vary on several factors such as season, time, temperature, wind speed, humidity, holiday and day of the week. In this paper, we have tried to solve this problem using SAS^® Forecast Studio. Incorporating the effects of all the above factors and analyzing the demand trends of the last two years, we have been able to precisely forecast the number of bikes needed on any day in the future. Also, we are able to do the scenario analysis to observe the effect of particular variables on the demand.

Read the paper (PDF).

In data mining modelling, data preparation is the most crucial, most difficult, and longest part of the mining process. A lot of steps are involved. Consider the simple distribution analysis of the variables, the diagnosis and reduction of the influence of variables' multicollinearity, the imputation of missing values, and the construction of categories in variables. In this presentation, we use data mining models in different areas like marketing, insurance, retail and credit risk. We show how to implement data preparation through SAS^® Enterprise Miner™, using different approaches. We use simple code routines and complex processes involving statistical insights, cluster variables, transform variables, graphical analysis, decision trees, and more.

Read the paper (PDF).

Customer Long-Term Value (LTV) is a concept that is readily explained at a high level to marketing management of a company, but its analytic development is complex. This complexity involves the need to forecast customer behavior well into the future. This behavior includes the timing, frequency, and profitability of a customer's future purchases of products and services. This paper describes a method for computing LTV. First, a multinomial logistic regression provides probabilities for time-of-first-purchase, time-of-second-purchase, and so on, for each customer. Then the profits for the first purchase, second purchase, and so on, are forecast but only after adjustment for non-purchaser selection bias. Finally, these component models are combined in the LTV formula.

Read the paper (PDF).

Many scientific and academic journals require that statistical tables be created in a specific format, with one of the most common formats being that of the American Psychological Association (APA). The APA publishes a substantial guide book to writing and formatting papers, including an extensive section on creating tables (Nichol 2010). However, the output generated by SAS^® procedures does not match this style. This paper discusses techniques to change the SAS procedure output to match the APA guidelines using SAS ODS (Output Delivery System).

Read the paper (PDF).

Sampling for audits and forensics presents special challenges: Each survey/sample item requires examination by a team of professionals, so sample size must be contained. Surveys involve estimating--not hypothesis testing. So power is not a helpful concept. Stratification and modeling is often required to keep sampling distributions from being skewed. A precision of alpha is not required to create a confidence interval of 1-alpha, but how small a sample is supportable? Many times replicated sampling is required to prove the applicability of the design. Given the robust, programming-oriented approach of SAS^®, the random selection, stratification, and optimizing techniques built into SAS can be used to bring transparency and reliability to the sample design process. While a sample that is used in a published audit or as a measure of financial damages must endure a special scrutiny, it can be a rewarding process to design a sample whose performance you truly understand and which will stand up under a challenge.

Read the paper (PDF).

The measurement of factors influencing consumer purchasing decisions is of interest to all manufacturers of goods, retailers selling these goods, and consumers buying these goods. In the past decade, conjoint analysis has become one of the commonly used statistical techniques for analyzing the decisions or trade-offs consumers make when they purchase products. Although recent years have seen increased use of conjoint analysis and conjoint software, there is limited work that has spelled out a systematic procedure on how to do a conjoint analysis or how to use conjoint software. This paper reviews basic conjoint analysis concepts, describes the mathematical and statistical framework on which conjoint analysis is built, and introduces the TRANSREG and PHREG procedures, their syntaxes, and the output they generate using simplified real-life data examples. This paper concludes by highlighting some of the substantives issues related to the application of conjoint analysis in a business environment and the available auto call macros in SAS/STAT^®, SAS/IML^®, and SAS/QC^® software that can handle more complex conjoint designs and analyses. The paper will benefit the basic SAS user, and statisticians and research analysts in every industry, especially those in marketing and advertisement.

Read the paper (PDF).

Random Forest (RF) is a trademarked term for an ensemble approach to decision trees. RF was introduced by Leo Breiman in 2001.Due to our familiarity with decision trees--one of the intuitive, easily interpretable models that divides the feature space with recursive partitioning and uses sets of binary rules to classify the target--we also know some of its limitations such as over-fitting and high variance. RF uses decision trees, but takes a different approach. Instead of growing one deep tree, it aggregates the output of many shallow trees and makes a strong classifier model. RF significantly improves the accuracy of classification by growing an ensemble of trees and allowing for the selection of the most popular one. Unlike decision trees, RF has a robustness against over-fitting and high variance, since it randomly selects a subset of variables in each split node. This paper demonstrates this simple yet powerful classification algorithm by building an income-level prediction system. Data extracted from the 1994 Census Bureau database was used for this study. The data set comprises information about 14 key attributes for 45,222 individuals. Using SAS^® Enterprise Miner™ 13.1, models such as random forest, decision tree, probability decision tree, gradient boosting, and logistic regression were built to classify the income level( >50K or <50k) of the population. The results showed that the random forest model was the best model for this data, based on the misclassification rate criteria. The RF model predicts the income-level group of the individuals with an accuracy of 85.1%, with the predictors capturing specific characteristic patterns. This demonstration using SAS^® can lead to useful insights into RF for solving classification problems.

Read the paper (PDF).

SAS^® Visual Analytics is a product that easily enables the interactive analysis of data. It offers capabilities for analyzing data using a visual approach. This paper discusses architecture options for configuring a SAS Visual Analytics installation that serves multiple customers in parallel. The overall objective is to create an environment that scales with the volume of data and also with the number of customer groups. This paper explains several concepts for serving multiple customers groups and explains the pros and cons of each approach.

Read the paper (PDF).

Working with multiple data sources in SAS^® was not a straight forward thing until PROC FEDSQL was introduced in the SAS^® 9.4 release. Federated Query Language, or FEDSQL, is a vendor-independent language that provides a common SQL syntax to communicate across multiple relational databases without having to worry about vendor-specific SQL syntax. PROC FEDSQL is a SAS implementation of the FEDSQL language. PROC FEDSQL enables us to write federated queries that can be used to perform joins on tables from different databases with a single query, without having to worry about loading the tables into SAS individually and combining them using DATA steps and PROC SQL statements. The objective of this paper is to demonstrate the working of PROC FEDSQL to fetch data from multiple data sources such as Microsoft SQL Server database, MySQL database, and a SAS data set, and run federated queries on all the data sources. Other powerful features of PROC FEDSQL such as transactions and FEDSQL pass-through facility are discussed briefly.