SAS Global Forum 2017 Proceedings

A propensity score is the probability that an individual will be assigned to a condition or group, given a set of baseline covariates when the assignment is made. For example, the type of drug treatment given to a patient in a real-world setting might be non-randomly based on the patient's age, gender, geographic location, and socioeconomic status when the drug is prescribed. Propensity scores are used in many different types of observational studies to reduce selection bias. Subjects assigned to different groups are matched based on these propensity score probabilities, rather than matched based on the values of individual covariates. Although the underlying statistical theory behind the use of propensity scores is complex, implementing propensity score matching with SAS^® is relatively straightforward. An output data set of each subject's propensity score can be generated with SAS using PROC LOGISTIC. And, a generalized SAS macro can generate optimized N:1 propensity score matching of subjects assigned to different groups using the radius method. Matching can be optimized either for the number of matches within the maximum allowable radius or by the closeness of the matches within the radius. This presentation provides the general PROC LOGISTIC syntax to generate propensity scores, provides an overview of different propensity score matching techniques, and discusses how to use the SAS macro for optimized propensity score matching using the radius method.

Read the paper (PDF)

Disseminating data to potential collaborators can be essential in the development of models, algorithms, and innovative research opportunities. However, it is often time-consuming to get approval to access sensitive data such as health data. An alternative to sharing the real data is to use synthetic data, which has similar properties to the original data but does not disclose sensitive information. The collaborators can use the synthetic data to make preliminary models or to work out bugs in their code while waiting to get approval to access the original data. A data owner can also use the synthetic data to crowdsource solutions from the public through competitions like Kaggle and then test those solutions on the original data. This paper implements a method that generates fully synthetic data in a way that matches the statistical moments of the true data up to a specified moment order as a SAS^® macro. Variables in the synthetic data set are of the same data type as the true data (for example, integer, binary, continuous). The implementation uses the linear programming solver within a column generation algorithm and the mixed integer linear programming solver from the OPTMODEL procedure in SAS/OR^® software. The COFOR statement in PROC OPTMODEL automatically parallelizes a portion of the algorithm. This paper demonstrates the method by using the Sashelp.Heart data set to generate fully synthetic data copies.

Read the paper (PDF)

The hospital Medicare readmission rate has become a key indicator for measuring the quality of health care in the US. This rate is currently used by major health-care stakeholders including the Centers for Medicare and Medicaid Services (CMS), the Agency for Healthcare Research and Quality (AHRQ), and the National Committee for Quality Assurance (NCQA) (Fan and Sarfarazi, 2014). Although many papers have been written about how to calculate readmissions, this paper provides updated code that includes ICD-10 (International Classification of Diseases) code and offers a novel and comprehensive approach using SAS^® DATA step options and PROC SQL. We discuss: 1) De-identifying patient data 2) Calculating sequential admissions 3) Subsetting criteria required to report for CMS 30-day readmissions. In addition, this papers demonstrates: 1) Using the output delivery system (ODS) to create a labeled and de-identified data set 2) Macro variables to examine data quality 3) Summary statistics for further reporting and analysis.

Read the paper (PDF) | View the e-poster or slides (PDF)

Healthcare is weird. Healthcare data is even more so. The digitization of healthcare data that describes the patient experience is a modern phenomenon, with most healthcare organizations still in their infancy. While the business of healthcare is already a century old, most organizations have focused their efforts on the financial aspects of healthcare and not on stakeholder experience or clinical outcomes. Think of the workflow that you might have experienced such as scheduling an appointment through doctor visits, obtaining lab tests, or obtaining prescriptions for interventions such as surgery or physical therapy. The modern healthcare system creates a digital footprint of administrative, process, quality, epidemiological, financial, clinical, and outcome measures, which range in size, cleanliness, and usefulness. Whether you are new to healthcare data or are looking to advance your knowledge of healthcare data and the techniques used to analyze it, this paper serves as a practical guide to understanding and using healthcare data. We explore common methods for how we structure and access data, discuss common challenges such as aggregating data into episodes of care, describe reverse engineering real world events, and talk about dealing with the myriad of unstructured data found in nursing notes. Finally, we discuss the ethical uses of healthcare data and the limits of informed consent, which are critically important for those of us in analytics.

Read the paper (PDF)

Visual+D2:D18ization is a critical part of turning data into knowledge. A customized graph is essential to make data visualization meaningful, powerful, and interpretable. Furthermore, customizing grouped data into a desired layout with specific requirements such as clusters, colors, symbols, and patterns for each group can be challenging. This paper provides a start-from-scratch, step-by-step solution to create a customized graph for grouped data using SAS^® Graph Template Language (GTL). By analyzing the data and target graph with the available tools and options that GTL provided, this paper demonstrates GTL is a powerful and flexible tool to create a customized, complex graph.

Read the paper (PDF)

Visualization is a critical part to turn data into knowledge. A customized graph is essential to make data visualization meaningful, powerful, and interpretable. Furthermore, customizing grouped data into a desired layout with specific requirements such as clusters, colors, symbols, and patterns for each group can be challenging. This paper provides a start-from-scratch, step-by-step solution to create a customized graph for grouped data using the Graph Template Language (GTL). From analyzing the data to creating the target graph with the tools and options that are available with GTL, this paper demonstrates GTL is a powerful and flexible tool for creating a customized, complex graph.

Read the paper (PDF)

SAS/ACCESS^® software grants access to data in third-party database management systems (DBMS), but how do you access data in DBMS not supported by SAS/ACCESS products? The introduction of the GROOVY procedure in SAS^® 9.3 lets you retrieve this formerly inaccessible data through a JDBC connection. Groovy is an object-oriented, dynamic programming language executed on the Java Virtual Machine (JVM). Using Microsoft Azure HDInsight as an example, this paper demonstrates how to access and read data into a SAS data set using PROC GROOVY and a JDBC connection.

Read the paper (PDF)

SAS^® Visual Analytics provides a robust platform to perform business intelligence through a high-end and advanced dashboarding style. In today's technology era, dashboards not only help in gaining insight into an organization's operations, but they also are a key performance indicator. In this paper, I discuss five important and frequently used objects in SAS Visual Analytics. These objects are used to get the most out of dashboards in an effective and efficient way. This paper covers the use of dates (as a format) in the date slider and gauges, cascading filters, custom graphs, linking reports within sections of the same report or with other reports, and associating buttons with graphs for dynamic functionality.

Read the paper (PDF)

As you know, real world data (RWD) provides highly valuable and practical insights. But as valuable as RWD is, it still has limitations. It is encounter-based, and we are largely blind to what happens between encounters in the health-care system. The encounters generally occur in a clinical setting that might not reflect actual patient experience. Many of the encounters are subjective interviews, observations, or self-reports rather than objective data. Information flow can be slow (even real time is not fast enough in health care anymore). And some data that could be transformative cannot be captured currently. Select Internet of Things (IoT) data can fill the gaps in our current RWD for certain key conditions and provide missing components that are key to conducting Analytics of Healthcare Things (AoHT), such as direct, objective measurements; data collected in usual patient settings rather than artificial clinical settings; data collected continuously in a patient s setting; insights that carry greater weight in Regulatory and Payer decision-making; and insights that lead to greater commercial value. Teradata has partnered with an IoT company whose technology generates unique data for conditions impacted by mobility or activity. This data can fill important gaps and provide new insights that can help distinguish your value in your marketplace. Join us to hear details of successful pilots that have been conducted as well as ongoing case studies.

Read the paper (PDF)

Customer churn is an important area of concern that affects not just the growth of your company, but also the profit. Conventional survival analysis can provide a customer's likelihood to churn in the near term, but it does not take into account the lifetime value of the higher-risk churn customers you are trying to retain. Not all customers are equally important to your company. Recency, frequency, and monetary (RFM) analysis can help companies identify customers that are most important and most likely to respond to a retention offer. In this paper, we use the IML and PHREG procedures to combine the RFM analysis and survival analysis in order to determine the optimal number of higher-risk and higher-value customers to retain.

Read the paper (PDF)

The Consumer Financial Protection Bureau (CFPB) collects tens of thousands of complaints against companies each year, many of which result in the companies in question taking action, including making payouts to the individuals who filed the complaints. Given the volume of the complaints, how can an overseeing organization quantitatively assess the data for various trends, including the areas of greatest concern for consumers? In this presentation, we propose a repeatable model of text analytics techniques to the publicly available CFPB data. Specifically, we use SAS^® Contextual Analysis to explore sentiment, and machine learning techniques to model the natural language available in each free-form complaint against a disposition code for the complaint, primarily focusing on whether a company paid out money. This process generates a taxonomy in an automated manner. We also explore methods to structure and visualize the results, showcasing how areas of concern are made available to analysts using SAS^® Visual Analytics and SAS^® Visual Statistics. Finally, we discuss the applications of this methodology for overseeing government agencies and financial institutions alike.

Read the paper (PDF)

The singular spectrum analysis (SSA) method of time series analysis applies nonparametric techniques to decompose time series into principal components. SSA is particularly valuable for long time series, in which patterns (such as trends and cycles) are difficult to visualize and analyze. An important step in SSA is determining the spectral groupings; this step can be automated by analyzing the w-correlations of the spectral components. This paper provides an introduction to singular spectrum analysis and demonstrates how to use SAS/ETS^® software to perform it. To illustrate, monthly data on temperatures in the United States over the last century are analyzed to discover significant patterns.

Read the paper (PDF)

A lot of time and effort goes into creating presentations or dashboards for the purposes of management business reviews. Data for the presentation is produced from a variety of tools, and the output is cut and pasted into Microsoft PowerPoint or Microsoft Excel. Time is spent not only on the data preparation and reporting, but also on the finishing and touching up of these presentations. In previous years, SAS^® Global Forum authors have described the automation capabilities of SAS^® and Microsoft Office. The default look and feel of SAS output in Microsoft PowerPoint and Microsoft Excel is not always adequate for the more polished requirement of an executive presentation. This paper focuses on how to combine the capabilities of SAS^® Enterprise Guide^®, SAS^® Visual Analytics, and Microsoft PowerPoint into a finished, professional presentation. We will build and automate a beautiful finished end product that can be refreshed by anyone with the click of a mouse.

Read the paper (PDF)

Understanding customer behavior profiles is of great value to companies. Customer behavior is influenced by a multitude of elements-some are capricious, presumably resulting from environmental, economic, and other factors, while others are more fundamentally aligned with value and belief systems. In this paper, we use unstructured textual cheque card data to model and estimate latent spending behavioral profiles of banking customers. These models give insight into unobserved spending habits and patterns. SAS^® Text Miner is used in an atypical manner to determine the buying segments of customers and the latent buying profile using a clustering approach. Businesses benefit in the way the behavioral spend model is used. The model can be used for market segmentation, where each cluster is seen as a target marketing segment, leads optimization, or product offering where products are specifically compiled to align to each customer's requirements. It can also be used to predict future spend or to align customer needs with business offerings, supported by signing customers onto loyalty programs. This unique method of determining the spend behavior of customers makes it ideal for companies driving retention and loyalty in their customers.

Read the paper (PDF) | View the e-poster or slides (PDF)

It's essential that SAS^® users enhance their skills to implement best-practice programming techniques when using Base SAS^® software. This presentation illustrates core concepts with examples to ensure that code is readable, clearly written, understandable, structured, portable, and maintainable. Attendees learn how to apply good programming techniques including implementing naming conventions for data sets, variables, programs, and libraries; code appearance and structure using modular design, logic scenarios, controlled loops, subroutines and embedded control flow; code compatibility and portability across applications and operating platforms; developing readable code and program documentation; applying statements, options, and definitions to achieve the greatest advantage in the program environment; and implementing program generality into code to enable its continued operation with little or no modifications.

Read the paper (PDF)

Data that are gathered in modern data collection processes are often large and contain geographic information that enables you to examine how spatial proximity affects the outcome of interest. For example, in real estate economics, the price of a housing unit is likely to depend on the prices of housing units in the same neighborhood or nearby neighborhoods, either because of their locations or because of some unobserved characteristics that these neighborhoods share. Understanding spatial relationships and being able to represent them in a compact form are vital to extracting value from big data. This paper describes how to glean analytical insights from big data and discover their big value by using spatial econometric methods in SAS/ETS^® software.

Read the paper (PDF)

Whether you are calculating a credit risk, a health risk, or something entirely different, you need instant, on-the-fly risk score calculation across multiple industries. This paper demonstrates how you can produce individualized risk scores through interactive dashboards. Your risk scores are backed by powerful SAS^® analytics because they leverage score code that you produce in SAS^® Visual Statistics. Advanced topics, including the use of calculated items and parameters in your dashboards, as well as how to develop SAS^® Stored Processes capable of accepting parameters that are passed through your SAS^® Visual Analytics Dashboard are covered in detail.

Read the paper (PDF)

This presentation discusses the options for including continuous covariates in regression models. In his book, 'Clinical Prediction Models,' Ewout Steyerberg presents a hierarchy of procedures for continuous predictors, starting with dichotomizing the variable and moving to modeling the variable using restricted cubic splines or using a fractional polynomial model. This presentation discusses all of the choices, with a focus on the last two. Restricted cubic splines express the relationship between the continuous covariate and the outcome using a set of cubic polynomials, which are constrained to meet at pre-specified points, called knots. Between the knots, each curve can take on the shape that best describes the data. A fractional polynomial model is another flexible method for modeling a relationship that is possibly nonlinear. In this model, polynomials with noninteger and negative powers are considered, along with the more conventional square and cubic polynomials, and the small subset of powers that best fits the data is selected. The presentation describes and illustrates these methods at an introductory level intended to be useful to anyone who is familiar with regression analyses.

Read the paper (PDF)

An interactive voice response (IVR) system is a powerful tool that automates routine inbound call tasks. Companies leverage this system and make substantial savings by cutting down call center costs while customers from do-it-yourselfers to non-tech-savvies take advantage of this technology rather than wait in line to speak to a Customer Care Representative (CSR). The flip side of the coin is that customers often see IVR as a barrier to overcome in order to talk to a real person. So it is important that IVR is managed in such a way that it is mutually beneficial for both a business and their customers. If managing IVR is critical, then measuring Customer Satisfaction (CSAT) scores is paramount as it helps in understanding customers better. The first section of this paper discusses analysis of different use cases on how CSAT scores correlate with customers' journeys inside IVR. West Corporation's leading financial services client offers a survey to their customers, and customers rate questions on a scale of 1 to 10 based on their IVR experience (10 being extremely satisfied). Analysis of survey ratings using SAS^® helped Operations understand challenges faced by customers traversing different sections of the IVR. The second section of the paper discusses how the research helped Operations to identify a population specification error that occurred while surveying customers. The error was rectified, and IVR CSAT scores improved by 3%.

Read the paper (PDF)

Computer and video games are complex these days. Events in video games are in some cases recorded automatically in text files, creating a history or story of game play. There are countable items in these event records that can be used as data for statistics and other types of modeling. This E-Poster shows you how to statistically analyze text files for video game events using SAS^®. Two games are analyzed. EVE Online, a massive multi-user online role-playing spaceship game, is one. The other game is Ingress, a cell phone game that combines exercise with a GPS and real-world environments. In both examples, the techniques involve parsing large amounts of text data to examine recurring patterns in text that describe events in the game play.

View the e-poster or slides (PDF)

This presentation describes an ongoing effort to standardize and simplify SAS^® coding across a rapidly growing analytics team in the health care industry. The number of SAS analysts in Kaiser Permanente's Data and Information Management Enhancement (DIME) department has nearly doubled in the past two years, going from approximately 20 to 40 analysts. The level of experience and technical skill varies greatly within the department. Analysts are required to provide quick turn-around on a large volume of analytical requests in this dynamic and high-demand environment. An effort was initiated in 2016 to create a SAS^® Enterprise Guide^® Template to standardize and simplify SAS coding across the department. The SAS Enterprise Guide^® template is designed to be a standard project file containing predefined code shells and examples that can be used as a basis for all new SAS Enterprise Guide^® projects. The primary goals of the template are to: 1) Effectively onboard new analysts to department standards; 2) Increase the efficiency of SAS development; 3) Bring consistency to how SAS is used; and 4) Simplify the transitioning of SAS jobs to the department's Production Support team. This presentation focuses on the process in which the template was initiated, drafted, and socialized across a large and diverse team of SAS analysts. It also highlights plans for ongoing maintenance of and improvements to the original template.

Read the paper (PDF)

Detection and adjustment of structural breaks are an important step in modeling time series and panel data. In some cases, such as studying the impact of a new policy or an advertising campaign, structural break analysis might even be the main goal of a data analysis project. In other cases, the adjustment of structural breaks is a necessary step to achieve other analysis objectives, such as obtaining accurate forecasts and effective seasonal adjustment. Structural breaks can occur in a variety of ways during the course of a time series. For example, a series can have an abrupt change in its trend, its seasonal pattern, or its response to a regressor. The SSM procedure in SAS/ETS^® software provides a comprehensive set of tools for modeling different types of sequential data, including univariate and multivariate time series data and panel data. These tools include options for easy detection and adjustment of a wide variety of structural breaks. This paper shows how you can use the SSM procedure to detect and adjust structural breaks in many different modeling scenarios. Several real-world data sets are used in the examples. The paper also includes a brief review of the structural break detection facilities of other SAS/ETS procedures, such as the ARIMA, AUTOREG, and UCM procedures.

Read the paper (PDF)

In highly competitive markets, the response rates to economically reasonable marketing campaigns are as low as a few percentage points or less. In that case, the direct measure of the delta between the average key performance indicators (KPIs) of the treated and control groups is heavily 'contaminated' by non-responders. This paper focuses on measuring promotional marketing campaigns with two properties: (1) price discounts or other benefits, which are changing profitability of the targeted group for at least the promotion periods, and (2) impact of self-responders. The paper addresses the decomposition of the KPI measurement between responders and non-responders for both groups. Assuming that customers who rejected promotional offers will not change their behavior and that non-responders of both treated and control groups are not biased, the delta of the average KPIs for non-responders should be equal to zero. In practice, this component might be significantly deviated from zero. It might be caused by an initial nonzero delta of KPI values despite a random split between groups or by existence of outliers, especially for non-balanced campaigns. In order to address the deviation of the delta from zero, it might require running additional statistical tests comparing not just the means but the distributions of KPIs as well. The decomposition of the measurement between responders and non-responders for both groups can then be used in differential modeling.

Read the paper (PDF)

SAS^® has a very efficient and powerful way to get distances between an event and a customer. Using the tables and code located at http://support.sas.com/rnd/datavisualization/mapsonline/html/geocode.html#street, you can load latitude and longitude to addresses that you have for your events and customers. Once you have the tables downloaded from SAS, and you have run the code to get them into SAS data sets, this paper helps guide you through the rest using PROC GEOCODE and the GEODIST function. This can help you determine to whom to market an event. And, you can see how far a client is from one of your facilities.

Read the paper (PDF) | View the e-poster or slides (PDF)

Whether you have been programming in SAS^® for years, or you are new to it, or you have dabbled with SAS^® Enterprise Guide^® before, this hands-on workshop sheds some light on the depth, breadth, and power of the SAS Enterprise Guide environment. With all the demands on your time, you need powerful tools that are easy to learn and that deliver end-to-end support for your data exploration, reporting, and analytics needs. Included in this workshop are data exploration tools; formatting code (cleaning up after your coworkers); enhanced programming environment (and how to calm it down); easily creating reports and graphics; producing the output formats you need (XLS, PDF, RTF, HTML); workspace layout; and productivity tips. This workshop uses SAS Enterprise Guide 7.1, but most of the content is applicable to earlier versions.

Read the paper (PDF) | Download the data file (ZIP)

Customer feedback is a critical aspect of businesses in today's world as it is invaluable in determining what customers like and dislike about the business' service. This loop of regularly listening to customers' voice through survey comments and improving services based on it leads to better business, and more importantly, to an enhancement in customer experience. The challenge is to classify and analyze these unstructured text comments to gain insights and to focus on areas of improvement. The purpose of this paper is to illustrate how text mining in SAS^® Enterprise Miner 14.1 helped one of our clients a leading financial services company convert their customers problems into opportunities. The customers' feedback pertaining to their experience with an Interactive Voice Response (IVR) system is collected by an enterprise feedback management (EFM) company. The comments are then split into two groups, which helps us differentiate customer opinions. This grouping is based on customers who have given a rating of 0 6 and a rating of 9 10 on a Likert scale of 0 10 (10 being extremely satisfied) in the survey questionnaire. Text mining is performed on both these groups, and an algorithm creates clusters that are consequentially used to segment customers based on opinions they are interested in voicing. Furthermore, sentiment scores are calculated for each one of the segments. The scores classify the polarity of customer feedback and prioritizes the problems the client needs to focus on.

Read the paper (PDF)

Randomized control trials have long been considered the gold standard for establishing causal treatment effects. Can causal effects be reasonably estimated from observational data too? In observational studies, you observe treatment T and outcome Y without controlling confounding variables that might explain the observed associations between T and Y. Estimating the causal effect of treatment T therefore requires adjustments that remove the effects of the confounding variables. The new CAUSALTRT (causal-treat) procedure in SAS/STAT^® 14.2 enables you to estimate the causal effect of a treatment decision by modeling either the treatment assignment T or the outcome Y, or both. Specifically, modeling the treatment leads to the inverse probability weighting methods, and modeling the outcome leads to the regression methods. Combined modeling of the treatment and outcome leads to doubly robust methods that can provide unbiased estimates for the treatment effect even if one of the models is misspecified. This paper reviews the statistical methods that are implemented in the CAUSALTRT procedure and includes examples of how you can use this procedure to estimate causal effects from observational data. This paper also illustrates some other important features of the CAUSALTRT procedure, including bootstrap resampling, covariate balance diagnostics, and statistical graphics.

Read the paper (PDF)

The British Airways (BA) revenue management team is responsible for surfacing prices made available in the market with the objective of maximizing revenue from our 40,000,000 passenger journeys. BA is currently working to understand how competitor data can be exploited to help facilitate better decision making. Due to the low level of aggregation, competitor data is too large (and consequently too expensive) to store on conventional relational databases. Therefore, it has been stored on a small Hadoop installation at BA. Thanks to SAS/ACCESS^® Interface to Hadoop, we have been able to run our complex algorithms on these large data sets without changing the way we work and whilst exploiting the full capabilities of SAS^®.

Read the paper (PDF)

Longitudinal count data arise when a subject's outcomes are measured repeatedly over time. Repeated measures count data have an inherent within subject correlation that is commonly modeled with random effects in the standard Poisson regression. A Poisson regression model with random effects is easily fit in SAS^® using existing options in the NLMIXED procedure. This model allows for overdispersion via the nature of the repeated measures; however, departures from equidispersion can also exist due to the underlying count process mechanism. We present an extension of the cross-sectional COM-Poisson (CMP) regression model established by Sellers and Shmueli (2010) (a generalized regression model for count data in light of inherent data dispersion) to incorporate random effects for analysis of longitudinal count data. We detail how to fit the CMP longitudinal model via a user-defined log-likelihood function in PROC NLMIXED. We demonstrate the model flexibility of the CMP longitudinal model via simulated and real data examples.

Read the paper (PDF)

In fantasy football, it is a relatively common strategy to rotate which team's defense a player uses based on some combination of favorable/unfavorable player matchups, recent performance, and projection of expected points. However, there is danger in this strategy because defensive scoring volatility is high, and any team has the possibility of turning in a statistically bad performance in a given week. This paper uses data mining techniques to identify which National Football League (NFL) teams give up high numbers of defensive penalties on a week-to-week basis, and to what degree those high-penalty games correlate with poor team defensive fantasy scores. Examining penalty count and penalty yards allowed totals, we can narrow down which teams are consistently hurt by poor technique and find correlation between games with high penalty totals to their respective fantasy football score. By doing so, we seek to find which teams should be avoided in fantasy football due to their likelihood of poor performance.

Are you a marketing analyst who speaks SAS^®? Congratulations, you are in high demand! Or are you? Marketing analysts with programming skills are critical today. The ability to extract large volumes of data, massage it into a manageable format, and display it simply are necessary skills in the world of big data. However, programming skills are not nearly enough. In fact, some marketing managers are putting less and less weight on them and are focusing more on the softer skills that they require. This session will help ensure that you are not left out. In this session, Emma Warrillow shares why being a good programmer is only the beginning. She provide practical tips on moving from being a someone who is good at coding to becoming a true collaborator with marketing taking your marketing analytics to the next level. In 2016, Emma Warrillow's presentation at SAS^® Global Forum was very well received (http://blogs.sas.com/content/sgf/2016/04/21/always-be-yourself-unless-you-can-be-a-unicorn/). In this follow-up, she revisits some of the highlights from 2016 and shares some new ideas. You can be sure of an engaging code-free session!

Read the paper (PDF)

The presentation will give a brief introduction to Bayesian Analysis within SAS. Participants will learn the difference between Bayesian and Classical Statistics and be introduced to PROC MCMC.

Getting Started with ARIMA Models will introduce the basic features of time series variation, and the model components used to accommodate them; stationary (ARMA), trend and seasonal (the 'I' in ARIMA) and exogenous (input variable related). The Identify, Estimate and Forecast framework for building ARIMA models is illustrated with two demonstrations.

Because many SAS^® users either work for or own companies that house big data, the threat that malicious software poses becomes even more extreme. Malicious software, often abbreviated as malware, includes many different classifications, ways of infection, and methods of attack. This E-Poster highlights the types of malware, detection strategies, and removal methods. It provides guidelines to secure essential assets and prevent future malware breaches.

Read the paper (PDF) | View the e-poster or slides (PDF)

This workshop provides hands-on experience with using SAS Enterprise Miner. Workshop participants will learn to do the following: open a project; create and explore a data source; build and compare models; and produce and examine score code that can be used for deployment.

This workshop provides hands-on experience with some basic functionality of SAS Data Loader for Hadoop. You will learn how to: Copy Data to Hadoop Profile Data in Hadoop Cleanse Data in Hadoop

How would you answer this question? Most of us struggle to articulate the value of the tools, techniques, and teams we use when using analytics. How do you help the new director understand the value of SAS^® to you, your job, and the company? In this interactive session, you will discover the components that make up total cost of ownership (TCO) as they apply to the analytics lifecycle. What should you consider when you evaluate total cost of ownership and why should you measure it? How can you help your management team understand the value that SAS provides?

Read the paper (PDF)

Immediately after a new credit card product is launched and in the wallets of cardholders, sentiment begins to build. Positive and negative experiences of current customers posted online generate impressions among prospective cardholders in the form of technological word of mouth. Companies that issue credit cards can use sentiment analysis to understand how their product is being received by consumers, and, by taking suitable measure, can propel the card's market success. With the help of text mining and sentiment analysis using SAS^® Enterprise Miner and SAS^® Sentiment Analysis Studio, we are trying to answer which aspects of a credit card garnered the most favor, and conversely, which generated negative impressions among consumers. Credit Karma is a free credit and financial management platform for US consumers available on the web and on major mobile platforms. It provides free weekly updated credit scores and credit reports from the national credit bureaus TransUnion and Equifax. The implications of this project are as follows: 1) all companies that issue credit cards can use this technique to determine how their product is fairing in the market, and they can make business decisions to improve the flaws, based on public opinion; and 2) sentiment analysis can simulate the word-of-mouth influence of millions of existing users about a credit card.

Read the paper (PDF)

Investors usually trade stocks or exchange-traded funds (ETFs) based on a methodology, such as a theory, a model, or a specific chart pattern. There are more than 10,000 securities listed on the US stock market. Picking the right one based on a methodology from so many candidates is usually a big challenge. This paper presents the methodology based on the CANSLIM1 theorem and momentum trading (MT) theorem. We often hear of the cup and handle shape (C&H), double bottoms and multiple bottoms (MB), support and resistance lines (SRL), market direction (MD), fundamental analyses (FA), and technical analyses (TA). Those are all covered in CANSLIM theorem. MT is a trading theorem based on stock moving direction or momentum. Both theorems are easy to learn but difficult to apply without an appropriate tool. The brokers' application system usually cannot provide such filtering due to its complexity. For example, for C&H, where is the handle located? For the MB, where is the last bottom you should trade at? Now, the challenging task can be fulfilled through SAS^®. This paper presents the methods on how to apply the logic and graphically present them though SAS. All SAS users, especially those who work directly on capital market business, can benefit from reading this document to achieve their investment goals. Much of the programming logic can also be adopted in SAS finance packages for clients.

Read the paper (PDF)

The traditional view is that a utility's long-term forecast must have a standard against which it is judged. Weather normalization is one of the industry-standard practices that utilities use to assess the efficacy of a forecasting solution. While recent advances in probabilistic load forecasting techniques are proving to be a methodology that brings many benefits to a forecast, many utilities still require the benchmarking process to determine the accuracy of their long-term forecasts. Due to climatological volatility and the potentially large annual variances in temperature, humidity, and other relevant weather variables, most utilities create normalized weather profiles through various processes in order to estimate what is traditionally called a weather normalized load profile. However, new research shows that due to the nonlinear response of electric demand to weather variations, a simple normal weather profile in many cases might not equate to a normal load. In this paper, we introduce a probabilistic approach to deriving normalized load profiles and monthly peak and energy in through a process we label load normalization against the effects of weather . We compare it with the traditional weather normalization process to quantify the costs and benefits of using such a process. The proposed method has been successfully deployed at utilities for their long-term operation and planning purposes, and risk management.

Read the paper (PDF)

In this technology-driven era, multi-channel communication has become a pivotal part of an effective customer care strategy for companies. Old ways of delivering customer service are no longer adequate. To survive a tough competitive market and retain current customer base, companies are spending heavily to serve customers in the manner in which they wish to be served. West Corporation helps their clients in designing a strategy that would provide their customers with a connected inbound and outbound communication experience. This paper illustrates how the Data Science team at West Corporation has measured the effect of outbound short message service (SMS) notification in reducing inbound interactive voice response (IVR) call volume and improving customer satisfaction for a leading telecom services company. As part of a seamless experience, customers have the option of receiving outbound SMS notifications at several stages while traversing inside IVR. Notifications can involve successful payment and appointment confirmations, outage updates in the area, and an option of receiving text with details to reset Wi-Fi password and activate new devices. This study was performed on two groups of customers one whose members opted to receive notifications and one whose members did not opt in. Also, analysis was performed using SAS^® to understand repeat caller behaviors within both groups. The group that opted to receive SMS notifications were less likely to call back than those who did not opt in.

Read the paper (PDF)

Interrupted time series analysis (ITS) is a tool that can be used to help Learning Healthcare Systems evaluate programs in settings where randomization is not feasible. Interrupted time series is a statistical method to assess repeated snap shots over regular intervals of time before and after a system-level intervention or program is implemented. This method can be used by Learning Healthcare Systems to evaluate programs aimed at improving patient outcomes in real-world, clinical settings. In practice, the number of patients and the timing of observations are restricted. This presentation describes a program that helps statisticians identify optimal segments of time within a fixed population size for an interrupted time series analysis. A macro creates simulations based on DO loops to calculate power to detect changes over time due to system-level interventions. Parameters used in the macro are sample size, number of subjects in each time frame in each year, number of intervals in a year, and the probability of the event before and after the intervention. The macro gives the user the ability to specify different assumptions that result in design options that yield varying power based on the number of patients in each time intervals given the fixed parameters. The output from the macro can help stakeholders understand necessary parameters to help determine the optimal evaluation design.

Read the paper (PDF)

Statistical analysis is like detective work, and a data set is like the crime scene. The data set contains unorganized clues and patterns that can, with proper analysis, ultimately lead to meaningful conclusions. Using SAS^® tools, a statistical analyst (like any good crime scene investigator) performs a preliminary analysis of the data set through visualization and descriptive statistics. Based on the preliminary analysis, followed by a detailed analysis, both the crime scene investigator (CSI) and the statistical analyst (SA) can use scientific or analytical tools to answer the key questions: What happened? What were the causes and effects? Why did this happen? Will it happen again? Applying the CSI analogy, this paper presents an example case study using a two-step process to investigate a big-data crime scene. Part I shows the general procedures that are used to identify clues and patterns and to obtain preliminary insights from those clues. Part II narrows the focus on the specific statistical analyses that provide answers to different questions.

Read the paper (PDF)

In 1993, Erin Brockovich, a legal clerk to Edward L. Masry, began a lengthy manual investigation after discovering a link between elevated clusters of cancer cases in Hinkley, CA, and contaminated water in the same area due to the disposal of chemicals from a utility company. In this session, we combine disparate data sources - cancer cases and chemical spillages - to identify connections between the two data sets using SAS^® Visual Investigator. Using the map and network functionalities, we visualize the contaminated areas and their link to cancer clusters. What took Erin Brockovich months and months to investigate, we can do in minutes with SAS Visual Investigator.

Read the paper (PDF)

Data is your friend. This presentation discusses the use of data for quality improvement (QI). Measurement over time is integral to quality improvement, and statistical process control charts (also known as Shewhart or SPC charts) are a good way to learn from the way measures change over time, in response to our improvement efforts. The presentation explains what an SPC chart is, how to chose the correct type of chart, how to create and update a chart using SAS^®, and how to learn from the chart. The examples come from QI projects in health care, and the material is based on the Institute for Healthcare Improvement's Model for Improvement. However, the material is applicable to other fields, including manufacturing and business. The presentation is intended for people newly considering a QI project, people who want to graph their data and need help with getting started, and anyone interested in interpreting SPC charts created by someone else.

Read the paper (PDF)

Linear regression, which is widely used, can be improved by the inclusion of the penalizing parameter. This helps reduce variance (at the cost of a slight increase in bias) and improves prediction accuracy and model interpretability. The regularization model is implemented on the sample data set, and recommendations for the practice are included.

View the e-poster or slides (PDF)

One of the first maps of the present United States was John White's 1585 map of the Albemarle Sound and Roanoke Island, the site of the Lost Colony and the site of my present home. This presentation looks at advances in mapping through the ages, from the early surveys and hand-painted maps, through lithographic and photochemical processes, to digitization and computerization. Inherent difficulties in including small pieces of coastal land (often removed from map boundary files and data sets to smooth a boundary) are also discussed. The paper concludes with several current maps of Roanoke Island created with SAS^®.

Read the paper (PDF)

SAS^® BI Dashboard is an important business intelligence and data visualization product used by many customers worldwide. They still rely on SAS BI Dashboard for performance monitoring and decision support. SAS^® Visual Analytics is a new-generation product, which empowers customers to explore huge volumes of data very quickly and view visualized results with web browsers and mobile devices. Since SAS Visual Analytics is used by more and more regular customers, some SAS BI Dashboard customers might want to migrate existing dashboards to SAS Visual Analytics to take advantage of new technologies. In addition, some customers might hope to deploy the two products in parallel and keep everyone on the same page. Because the two products use different data models and formats, a special conversion tool is developed to convert SAS BI Dashboard dashboards into SAS Visual Analytics dashboards and reports. This paper comprehensively describes the guidelines, methods, and detailed steps to migrate dashboards from SAS BI Dashboard to SAS Visual Analytics. Then the converted dashboards can be shown in supported viewers of SAS Visual Analytics including mobile devices and modern browsers.

Read the paper (PDF)

This paper presents a case study in which social media posts by individuals related to public transport companies in the United Kingdom were collected from social media sites such as Twitter and Facebook and also from forums using SAS^® and Python. The posts were then further processed by SAS^® Text Miner and SAS^® Visual Analytics to retrieve brand names, means of public transport (underground, trains, buses), and any mentioned attributes. Relevant concepts and topics are identified using text mining techniques and visualized using concept maps and word clouds. Later, we aim to identify and categorize sentiments against public transport in the corpus of the posts. Finally, we create an association map/mind-map of the different service dimensions/topics and the brands of public transport, using correspondence analysis.

Read the paper (PDF)

Microsoft Excel worksheets enable you to explore data that answers the difficult questions that you face daily in your work. When you combine the SAS^® Output Deliver System (ODS) with the capabilities of Excel, you have a powerful toolset that you can use to manipulate data in various ways, including highlighting data, using formulas to answer questions, and adding a pivot table or graph. In addition, ODS and Excel give you many methods for enhancing the appearance of your tables and graphs. This paper, written for the beginning analyst to the most advanced programmer, illustrates first how to manipulate styles and presentation elements in your worksheets by controlling text wrapping, highlighting and exploring data, and specifying Excel templates for data. Then, the paper explains how to use the TableEditor tagset and other tools to build and manipulate both basic and complex pivot tables that can help you answer all of the questions about your data. You will also learn techniques for sorting, filtering, and summarizing pivot-table data. ^®

Read the paper (PDF)

Recent advances in computing technology, monitoring systems, and data collection mechanisms have prompted renewed interest in multivariate time series analysis. In contrast to univariate time series models, which focus on temporal dependencies of individual variables, multivariate time series models also exploit the interrelationships between different series, thus often yielding improved forecasts. This paper focuses on cointegration and long memory, two phenomena that require careful consideration and are observed in time series data sets from several application areas, such as finance, economics, and computer networks. Cointegration of time series implies a long-run equilibrium between the underlying variables, and long memory is a special type of dependence in which the impact of a series' past values on its future values dies out slowly with the increasing lag. Two examples illustrate how you can use the new features of the VARMAX procedure in SAS/ETS^® 14.1 and 14.2 to glean important insights and obtain improved forecasts for multivariate time series. One example examines cointegration by using the Granger causality tests and the vector error correction models, which are the techniques frequently applied in the Federal Reserve Board's Comprehensive Capital Analysis and Review (CCAR), and the other example analyzes the long-memory behavior of US inflation rates.

Read the paper (PDF) | Download the data file (ZIP)

A new ODS destination for creating Microsoft Excel workbooks is available starting in the third maintenance release for SAS^® 9.4. This destination creates native Microsoft Excel XLSX files, supports graphic images, and offers other advantages over the older ExcelXP tagset. In this presentation, you learn step-by-step techniques for quickly and easily creating attractive multi-sheet Excel workbooks that contain your SAS^® output. The techniques can be used regardless of the platform on which SAS software is installed. You can even use them on a mainframe! Creating and delivering your workbooks on demand and in real time using SAS server technology is discussed. Using earlier versions of SAS to create multi-sheet workbooks is also discussed. Although the title is similar to previous presentations by this author, this presentation contains new and revised material not previously presented.

Read the paper (PDF) | Download the data file (ZIP)

In order to display data visually, our audience preferred charts and graphs generated by Microsoft Excel over those generated by SAS^®. However, to make the necessary 30 graphs in Excel took 2 3 hours of manual work, even though the chart templates had already been created, and led to mistakes due to human error. SAS graphs took much less time to create, but lacked key functionality that the audience preferred and that was available in Excel graphs. Thanks to SAS, the answer came in Excel 4 Macro Language (X4ML) programming. SAS can actually submit coding to Excel in order to create customized data reporting, to create graphs or to update templates' data series, and even to populate Microsoft Word documents for finalized reports. This paper explores how SAS can be used to create presentation-ready graphs in a proven process that takes less than one minute, compared to the earlier process that took hours. The following code is used and discussed: %macro(macro_var), filename, rc commands, Output Delivery System (ODS), X4ML, and Microsoft Visual Basic for Applications (VBA).

Read the paper (PDF)

Many communication channels exist for customers to engage with businesses, yet an interactive voice response (IVR) system remains the most critical of them. The reason is is because IVR acts as the front end to consumer interaction and is the most effective method for customers to do business with companies in order to resolve their issues before talking to an agent. If the IVR interface is not designed properly, customers can be stuck in an endless loop of pressing buttons that can lead to consumer annoyance. The bottom line is: An IVR system should be set up to quickly resolve as many routine inbound inquires as possible and to allow customers to speak to an agent when necessary. In order to accomplish this, the IVR interface has to be optimized so that it is fully effective and provides a great customer experience. This paper demonstrates how SAS^® tools helped optimize the IVR system of a book publishing company. The data set used in this study was obtained from a telecom services company and contained IVR logs of more than 300,000 calls with 1.4 million observations. To gain insights into customer behaviors, path analysis was performed on this data using SAS^® Enterprise Miner and obstacles faced by customers were identified. This helped in determining underperforming prompts, and analysis using SAS procedures was conducted on such prompts. Prompts tuning was recommended and new self-service areas were identified that avoid transfers and can save clients thousands of dollars in investments in call centers.

Read the paper (PDF)

People typically invest in more than one stock to help diversify their risk. These stock portfolios are a collection of assets that each have their own inherit risk. If you know the future risk of each of the assets, you can optimize how much of each asset to keep in the portfolio. The real challenge is trying to evaluate the potential future risk of these assets. Different techniques provide different forecasts, which can drastically change the optimal allocation of assets. This talk presents a case study of portfolio optimization in three different scenarios historical standard deviation estimation, capital asset pricing model (CAPM), and GARCH-based volatility modeling. The structure and results of these three approaches are discussed.

Read the paper (PDF)

Outliers, such as unusual, violated, unexpected or rare events, have been focused on intensively by researchers and practitioners, providing their impacts on estimated statistics and developed models. Today, some business disciplines are focusing primarily on outliers such as defaults of credit, operational risks, quality nonconformities, fraud, or even the results of marketing initiatives in highly competitive environments with low response rates of a couple percent or even less. This paper discusses the importance of detecting, isolating, and categorizing business outliers to discover their root causes and to monitor them dynamically. Addressing not only extreme values or multivariable densities detecting outliers, but also addressing distributions, patterns, clusters, combinations of items, and sequences of events will allow for opportunities to be established for business improvement. SAS^® Enterprise Miner can be used to perform such detections. Thus, creating special business segments or running specialized outlier oriented data mining processes, such as decision trees, allows for isolation of business important outliers, which are normally masked in traditional statistical techniques. This process combined with 'What-If' scenario generation prepares businesses for future possible surges even when having no current specific type outliers. Furthermore, analyzing some specific outliers may play a role in assessing business stability to corresponding stress tests.

Read the paper (PDF)

Research using electronic health records (EHR) is emerging, but questions remain about its completeness, due in part to physicians' time to enter data in all fields. This presentation demonstrates the use of SAS^® Enterprise Miner to predict completeness of clinical data using claims data as the standard 'source of truth' against which to compare it. A method for assessing and predicting the completeness of clinical data is presented using the tools and techniques from SAS Enterprise Miner. Some of the topics covered include: tips for preparing your sample data set for use in SAS Enterprise Miner; tips for preparing your sample data set for modeling, including effective use of the Input Data, Data Partition, Filter, and Replacement nodes; and building predictive models using Stat Explore, Decision Tree, Regression, and Model Compare nodes.

View the e-poster or slides (PDF)

SAS^® Enterprise Guide^® empowers organizations, programmers, business analysts, statisticians, and end users with all the capabilities that SAS has to offer. This hands-on workshop presents the SAS Enterprise Guide graphical user interface (GUI). It covers access to multi-platform enterprise data sources, various data manipulation techniques that do not require you to learn complex coding constructs, built-in wizards for performing reporting and analytical tasks, the delivery of data and results to a variety of mediums and outlets, and support for data management and documentation requirements. Attendees learn how to use the graphical user interface to access SAS^® data sets and tab-delimited and Microsoft Excel input files; to subset and summarize data; to join (or merge) two tables together; to flexibly export results to HTML, PDF, and Excel; and to visually manage projects using flow diagrams.

Read the paper (PDF)

We live in a world of data; small data, big data, and data in every conceivable size between small and big. In today's world, data finds its way into our lives wherever we are. We talk about data, create data, read data, transmit data, receive data, and save data constantly during any given hour in a day, and we still want and need more. So, we collect even more data at work, in meetings, at home, on our smartphones, in emails, in voice messages, sifting through financial reports, analyzing profits and losses, watching streaming videos, playing computer games, comparing sports teams and favorite players, and countless other ways. Data is growing and being collected at such astounding rates, all in the hope of being able to better understand the world around us. As SAS^® professionals, the world of data offers many new and exciting opportunities, but it also presents a frightening realization that data sources might very well contain a host of integrity issues that need to be resolved first. This presentation describes the available methods to remove duplicate observations (or rows) from data sets (or tables) based on the row's values and keys using SAS.

Read the paper (PDF)

There are so many ways for SAS/ACCESS^® users to read and write data from and to Microsoft Excel files: SAS^® PC Files Server, XLS and XLSX engines, the SAS IMPORT and EXPORT procedures, various Excel file formats (.xls, .xlsx, .xlsb, .xlsm), and more. Many users ask, 'Which is best for me?' This paper explores the requirements and limitations of each engine, along with performance considerations and some of the not-so-obvious things to consider. It also includes a brief analogous discussion on Microsoft Access databases, which share some of the same mechanisms.

Read the paper (PDF)

A common issue in data integration is that often the documentation and the SAS^® data integration job source code start to diverge and eventually become out of sync. At Capgemini, working for a specific client, we developed a solution to rectify this challenge. We proposed moving all necessary documentation into the SAS^® Data Integration Studio job itself. In this way, all documentation then becomes part of the metadata we have created, with the possibility of automatically generating Job and Release documentation from the metadata. This presentation therefore focuses on the metadata documentation generator. Specifically, this presentation: 1) looks at how to use programming and documentation standards in SAS data integration jobs to enable the generation of documentation from the metadata; and 2) shows how the documentation is generated from the metadata, and the challenges that were encountered creating the code. I draw on our hands-on experience; Capgemini has implemented this for a customer in the Netherlands, and we are rolling this out as an accelerator in other SAS data integration projects worldwide. I share examples of the generated documentation, which contains functional and technical designs, including a list with all source tables, a list with the target tables, all transformations with their own documentation, job dependencies, and more.

Read the paper (PDF)

SAS^® In-Memory Analytics for Hadoop is an analytical programming environment that enables a user to use many components of an analytics project in a single environment, rather than switching between different applications. Users can easily prepare raw data for different types of analytics procedures. These techniques explore the data to enhance the information extractions. They can apply a large variety of statistical and machine learning techniques to the data to compare different analytical approaches. The model comparison capabilities let them quickly find the best model, which they can deploy and score in the Hadoop environment. All of these different components of the analytics project are supported in a distributed in-memory environment for lightning-fast processing. This paper highlights tips for working with the interaction between Hadoop data and for dealing with SAS^® LASR Analytic Server. It contains multiple scenarios with elementary but pragmatic approaches that enable SAS^® programmers to work efficiently within the SAS^® In-Memory Analytics environment.

Read the paper (PDF) | View the e-poster or slides (PDF)

Binary logistic regression models are widely used in CRM (customer relationship management) or credit risk modeling. In these models, it is common to use nominal, ordinal, or discrete (NOD) predictors. NOD predictors typically are binned (reducing the number of their levels) before usage in a logistic model. The primary purpose of binning is to obtain parsimony without greatly reducing the strength of association of the predictor X to the binary target Y. In this paper, two SAS^® macros are discussed. The %NOD_BIN macro bins predictors with nominal values (and ordinal and discrete values) by collapsing levels to maximize information value (IV). The %ORDINAL_BIN macro is applied to predictors that are ordered and in which collapsing can occur only for levels that are adjacent in the ordering of X. The %ORDINAL_BIN macro finds all possible binning solutions by complete enumeration. Solutions are ranked by IV, and monotonic solutions are identified.

Read the paper (PDF)

Mediation analysis is a statistical technique for investigating the extent to which a mediating variable transmits the relation of an independent variable to a dependent variable. Because it is useful in many fields, there have been rapid developments in statistical mediation methods. The most cutting-edge statistical mediation analysis focuses on the causal interpretation of mediated effect estimates. Cause-and-effect inferences are particularly challenging in mediation analysis because of the difficulty of randomizing subjects to levels of the mediator (MacKinnon, 2008). The focus of this paper is how incorporating longitudinal measures of the mediating and outcome variables aides in the causal interpretation of mediated effects. This paper provides useful SAS^® tools for designing adequately powered studies to detect the mediated effect. Three SAS macros were developed using the powerful but easy-to-use REG, CALIS, and SURVEYSELECT procedures to do the following: (1) implement popular statistical models for estimating the mediated effect in the pretest-posttest control group design; (2) conduct a prospective power analysis for determining the required sample size for detecting the mediated effect; and (3) conduct a retrospective power analysis for studies that have already been conducted and a required sample to detect an observed effect is desired. We demonstrate the use of these three macros with an example.

Read the paper (PDF)

Web Intelligence is a business objects web-based application used by the FDA for accessing and querying data files, and ultimately creating reports from multiple databases. The system allows querying of different databases using common business terms, and in the case of the FDA's Center for Food Safety and Applied Nutrition (CFSAN), careful review of dietary supplement information. However, in order to create timely and efficient reports for detection of safety signals leading to adverse events, a more efficient system is needed to obtain and visually display the data. Using SAS^® Visual Analytics and SAS^® Enterprise Guide^® can assist with timely extraction of data from multiple databases commonly used by CFSAN and create a more user friendly interface for management's review to help make key decisions in prevention of adverse events for the public.

Read the paper (PDF)

The SAS^® platform with Unicode's UTF-8 encoding is ready to help you tackle the challenges of dealing with data in multiple languages. In today's global economy, software needs are changing. Companies are globalizing and consolidating systems from various parts of the world. Software must be ready to handle data from social media, international web pages, and databases that have characters in many different languages. SAS makes migrating your data to Unicode a snap! This paper helps you move smoothly from your legacy SAS environment to the powerful SAS Unicode environment with UTF-8 support. Along the way, you will uncover secrets to successfully manipulate your characters, so that all of your data remains intact.

Read the paper (PDF)

SAS^® Visual Analytics provides a complete platform for analytics visualization and exploration of the data. There are several interactive visualizations such as charts, histograms, heat maps, decision tree, and Sankey diagrams. A Sankey diagram helps in performing path analytics and offers a better understanding of complex data. It is a graphic illustration of flows from one set of values to another as a series of paths, where the width of each flow represents the quantity. It is a better and more efficient way to illustrate which flows represent advantages and which flows are responsible for the disadvantages or losses. Sankey diagrams are named after Matthew Henry Phineas Riall Sankey, who first used this in a publication on energy efficiency of a steam engine in 1898. This paper begins with information regarding the essentials or parts of Sankey: nodes, links, drop-off links, and path. Later, the paper explains the method for creating a meaningful visualization (with the help of examples) with a Sankey diagram by looking into the data roles and properties, describing ways to manage the path selection, exploring the transaction identifier values for a path selection, and using the spotlight tool to view multiple data tips in SAS Visual Analytics. Finally, the paper provides recommendation and tips to work effectively and efficiently with the Sankey diagram.

Read the paper (PDF)

Imagine if you will a program, a program that loves its data, a program that loves its data to be in the same directory as the program itself. Together, in the same directory. True love. The program loves its data so much, it just refers to it by filename. No need to say what directory the data is in; it is the same directory. Now imagine that program being thrust into the world of the server. The server knows not what directory this program resides in. The server is an uncaring, but powerful, soul. Yet, when the program is executing, and the program refers to the data just by filename, the server bellows nay, no path, no data. A knight in shining armor emerges, in the form of a SAS^® macro, who says lo, with the help of the SAS^® Enterprise Guide^® macro variable minions, I can gift you with the location of the program directory and send that with you to yon mighty server. And there was much rejoicing. Yay. This paper shows you a SAS macro that you can include in your SAS Enterprise Guide pre-code to automatically set your present working directory to the same directory where your program is saved on your UNIX or Linux operating system. This is applicable to submitting to any type of server, including a SAS Grid Server. It gives you the flexibility of moving your code and data to different locations without having to worry about modifying the code. It also helps save time by not specifying complete pathnames in your programs. And can't we all use a little more time?

Read the paper (PDF)

Have you ever used a control chart to assess the variation in a process? Did you wonder how you could modify the chart to tell a more complete story about the process? This paper explains how you can use the SHEWHART procedure in SAS/QC^® software to make the following enhancements: display multiple sets of control limits that visualize the evolution of the process, visualize stratified variation, explore within-subgroup variation with box-and-whisker plots, and add information that improves the interpretability of the chart. The paper begins by reviewing the basics of control charts and then illustrates the enhancements with examples drawn from real-world quality improvement efforts.

Read the paper (PDF)

Socioeconomic status (SES) is a major contributor to health disparities in the United States. Research suggests that those with a low SES versus a high SES are more likely to have lower life expectancy; participate in unhealthy behaviors such as smoking and alcohol consumption; experience higher rates of depression, childhood obesity, and ADHD; and experience problems accessing appropriate health care. Interpreting SES can be difficult due to the complexity of data, multiple data sources, and the large number of socioeconomic and demographic measures available. When SES is expanded to include additional social determinants of health (SDOH) such as language barriers and transportation barriers to care; access to employment and affordable housing; adequate nutrition, family support and social cohesion; health literacy; crime and violence; quality of housing; and other environmental conditions, the ability to measure and interpret the concept becomes even more difficult. This paper presents an approach to measuring SES and SDOH using publicly available data. Various statistical modeling techniques are used to define state-specific composite SES scores at local areas-ZIP Code and Census Tract. Once developed, the SES/SDOH models are applied to health care claims data to evaluate the relationship between health services utilization, cost, and social factors. The analysis includes a discussion of the potential impact of social factors on population risk adjustment.

Read the paper (PDF)

You've all seen the job posting that looks more like an advertisement for the ever-elusive unicorn. It begins by outlining the required skills that include a mixture of tools, technologies, and masterful things that you should be able to do. Unfortunately, many such postings begin with restrictions to those with advanced degrees in math, science, statistics, or computer science and experience in your specific industry. They must be able to perform predictive modeling, natural language processing, and, for good measure, candidates should apply only if they know artificial intelligence, cognitive computing, and machine learning. The candidate should be proficient in SAS^®, R, Python, Hadoop, ETL, real-time, in-cloud, in-memory, in-database and must be a master storyteller. I know of no one who would be able to fit that description and still be able to hold a normal conversation with another human. In our work, we have developed a competency model for analytics, which describes nine performance domains that encompass the knowledge, skills, behaviors, and dispositions that today's analytic professional should possess in support of a learning, analytically driven organization. In this paper, we describe the model and provide specific examples of job families and career paths that can be followed based on the domains that best fit your skills and interests. We also share with participants a self-assessment tool so that they can see where the stack up!

Read the paper (PDF)

Does your job require you to create reports in Microsoft Excel on a quarterly, monthly, or even weekly basis? Are you creating all or part of these reports by hand, referencing another sheet containing rows and rows and rows of data? If so, stop! There is a better way! The new ODS destination for Excel enables you to create native Excel files directly from SAS^®. Now you can include just the data you need, create great-looking tabular output, and do it all in a fraction of the time! This paper shows you how to use the REPORT procedure to create polished tables that contain formulas, colored cells, and other customized formatting. Also presented in the paper are the destination options used to create various workbook structures, such as multiple tables per worksheet. Using these techniques to automate the creation of your Excel reports will save you hours of time and frustration, enabling you to pursue other endeavors.

Read the paper (PDF)

Many organizations need to analyze large numbers of time series that have time-varying or frequency-varying properties (or both). The time-varying properties can include time-varying trends, and the frequency-varying properties can include time-varying periodic cycles. Time-frequency analysis simultaneously analyzes both time and frequency; it is particularly useful for monitoring time series that contain several signals of differing frequency. These signals are commonplace in data that are associated with the internet of things. This paper introduces techniques for large-scale time-frequency analysis and uses SAS^® Forecast Server and SAS/ETS^® software to demonstrate these techniques.

Read the paper (PDF)

This presentation explores the steps taken by a large public research institution to develop a five-year enrollment forecasting model to support the critical enrollment management process at an institution. A key component of the process is providing university stakeholders with a self-service, secure, and flexible tool that enables them to quickly generate different enrollment projections using the most up-to-date information as possible in Microsoft Excel. The presentation shows how we integrated both SAS^® Enterprise Guide^® and the SAS^® Add-In for Microsoft Office to support this critical process, which had very specific stakeholder requirements and expectations.

Read the paper (PDF)

Hash tables are powerful tools when building an electronic code book, which often requires a lot of match-merging between the SAS^® data sets. In projects that span multiple years (e.g., longitudinal studies), there are usually thousands of new variables introduced at the end of every year or at the end of each phase of the project. These variables usually have the same stem or core as the previous year's variables. However, they differ only in a digit or two that usually signifies the year number of the project. So, every year, there is this extensive task of comparing thousands of new variables to older variables for the sake of carrying forward key database elements corresponding to the previously defined variables. These elements can include the length of the variable, data type, format, discrete or continuous flag, and so on. In our SAS program, hash objects are efficiently used to cut down not only time, but also the number of DATA and PROC steps used to accomplish the task. Clean and lean code is much easier to understand. A macro is used to create the data set containing new and older variables. For a specific new variable, the FIND method in hash objects is used in a loop to find the match to the most recent older variable. What was taking around a dozen PROC SQL steps is now a single DATA step using hash tables.

Read the paper (PDF)

This paper uses a simulation comparison to evaluate quantile approximation methods in terms of their practical usefulness and potential applicability in an operational risk context. A popular method in modeling the aggregate loss distribution in risk and insurance is the Loss Distribution Approach (LDA). Many banks currently use the LDA for estimating regulatory capital for operational risk. The aggregate loss distribution is a compound distribution resulting from a random sum of losses, where the losses are distributed according to some severity distribution and the number (of losses) distributed according to some frequency distribution. In order to estimate the regulatory capital, an extreme quantile of the aggregate loss distribution has to be estimated. A number of numerical approximation techniques have been proposed to approximate the extreme quantiles of the aggregate loss distribution. We use PROC SEVERITY to fit various severity distributions to simulated samples of individual losses from a preselected severity distribution. The accuracy of the approximations obtained is then evaluated against a Monte Carlo approximation of the extreme quantiles of the compound distribution resulting from the preselected severity distribution. We find that the second-order perturbative approximation, a closed-form approximation, performs very well at the extreme quantiles and over a wide range of distributions and is very easy to implement.

Read the paper (PDF)

Access to care for Medicaid beneficiaries is a topic of frequent study and debate. Section 1202 of the Affordable Care Act (ACA) requires states to raise Medicaid primary care payment rates to Medicare levels in 2013 and 2014. The federal government paid 100% of the increase. This program was designed to encourage primary care providers to participate in Medicaid, since this has long been a challenge for Medicaid. Whether this fee increase has increased access to primary care providers is still debated. Using SAS^®, we evaluated whether Medicaid patients have a higher incidence of non-urgent visits to local emergency departments (ED) than do patients with other payment sources. The National Hospital Ambulatory Medical Care Survey (NHAMCS) data set, obtained from the Centers for Disease Control (CDC), was selected, since it contains data relating to hospital emergency departments. This emergency room data, for years 2003 2011, was analyzed by diagnosis, expected payment method, reason for the visit, region, and year. To evaluate whether the ED visits were considered urgent or non-urgent, we used the NYU Billings algorithm for classifying ED utilization (NYU Wagner 2015). Three models were used for the analyses: Binary Classification, Multi-Classification, and Regression. In addition to finding no regional differences, decision trees and SAS^® Visual Analytics revealed that Medicaid patients do not have a higher rate of non-emergent visits when compared to other payment types.

Read the paper (PDF)

The challenge is to assign outbound calling agents in a telemarketing campaign to geographic districts. The districts have a variable number of leads, and each agent needs to be assigned entire districts with the total number of leads being as close as possible to a specified number for each of the agents (usually, but not always, an equal number). In addition, there are constraints concerning the distribution of assigned districts across time zones, in order to maximize productivity and availability. The SAS/OR^® CLP procedure solves the problem by formulating the challenge as a constraint satisfaction problem (CSP). Our use of PROC CLP places the actual leads within a specified percentage of the target number.

Read the paper (PDF)

Graphs are mathematical structures capable of representing networks of objects and their relationships. Clustering is an area in graph theory where objects are split into groups based on their connections. Depending on the application domain, object clusters have various meanings (for example, in market basket analysis, clusters are families of products that are frequently purchased together). This paper provides a SAS^® macro featuring PROC OPTGRAPH, which enables the transformation of transactional data, or any data with a many-to-many relationship between two entities, into graph data, allowing for the generation and application of the co-occurrence graph and the probability graph.

Read the paper (PDF)

More than ever, customers are demanding consistent and relevant interaction across all channels. Businesses are having to develop omnichannel marketing capabilities to please these customers. Implementing omnichannel marketing is often difficult, especially when using digital channels. Most products designed solely for digital channels lack capabilities to integrate with traditional channels that have on-premises processes and data. SAS^® Customer Intelligence 360 is a new offering that enables businesses to leverage both cloud and on-premises channels and data. This is possible due to the solution's hybrid cloud architecture. This paper discusses the SAS Customer Intelligence 360 approach to the hybrid cloud, and covers key capabilities on security, throughput, and integration.

Read the paper (PDF)

Weight of evidence (WOE) coding of a nominal or discrete variable X is widely used when preparing predictors for usage in binary logistic regression models. The concept of WOE is extended to ordinal logistic regression for the case of the cumulative logit model. If the target (dependent) variable has L levels, then L-1 WOE variables are needed to recode X. The appropriate setting for implementing WOE coding is the cumulative logit model with partial proportionate odds. As in the binary case, it is important to bin X to achieve parsimony before the WOE coding. SAS^® code to perform this binning is discussed. An example is given that shows the implementation of WOE coding and binning for a cumulative logit model with the target variable having three levels.

Read the paper (PDF)

The latest releases of SAS^® Data Management software provide a comprehensive and integrated set of capabilities for collecting, transforming, and managing your data. The latest features in the product suite include capabilities for working with data from a wide variety of environments and types including Hadoop, cloud data sources, RDBMS, files, unstructured data, streaming, and others, and the ability to perform ETL and ELT transformations in diverse run-time environments including SAS^®, database systems, Hadoop, Spark, SAS^® Analytics, cloud, and data virtualization environments. There are also new capabilities for lineage, impact analysis, clustering, and other data governance features for enhancements to master data and support metadata management. This paper provides an overview of the latest features of the SAS^® Data Management product suite and includes use cases and examples for leveraging product capabilities.

Read the paper (PDF)

In the world of gambling, superstition drives behavior, which can be difficult to explain. Conflicting evidence suggests that slot machines, like BCLC's Cleopatra, perform well regardless of where they are placed on a casino floor. Other evidence disputes this, arguing that performance is driven by their strategic placement (for example, in high-traffic areas). We explore and quantify the location sensitivity of slot machines by leveraging SAS^® to develop robust models. We test various methodologies and data import techniques (such as casino CAD floor plans) to unlock some of the nebulous concepts of player behavior, product performance, and superstition. By demystifying location sensitivity, key drivers of performance can be identified to aid in optimizing the placement of slot machines.

Read the paper (PDF)

We are at a tipping point for credit risk modeling. To meet the technical and regulatory challenges of IFRS 9 and stress testing, and to strengthen model risk management, CBS aims to create an integrated, end-to-end, tools-based solution across the model lifecycle, with strong governance and controls and an improved scenario testing and forecasting capability. SAS has been chosen as the technology partner to enable CBS to meet these aims. A new predictive analytics platform combining well-known tools such as SAS^® Enterprise Miner , SAS^® Model Manager, and SAS^® Data Management alongside SAS^® Model Implementation Platform powered by SAS^® High-Performance Risk is being deployed. Driven by technology, CBS has also considered the operating model for credit risk, restructuring resources around the new technology with clear lines of accountability, and has incorporated a dedicated data engineering function within the risk modeling team. CBS is creating a culture of collaboration across credit risk that supports the development of technology-led, innovative solutions that not only meet regulatory and model risk management requirements but that set a platform for the effective use of predictive analytics enterprise-wide.

Global trade and more people and freight moving across international borders present border and security agencies with a difficult challenge. While supporting freedom of movement, agencies must minimize risks, preserve national security, guarantee that correct duties are collected, deploy human resources to the right place, and ensure that additional checks do not result in increased delays for passengers or cargo. To meet these objectives, border agencies must make the most efficient use of their data, which is often found across disparate intelligence sources. Bringing this data together with powerful analytics can help them identify suspicious events, highlight areas of risk, process watch lists, and notify relevant agents so that they can investigate, take immediate action to intercept illegal or high-risk activities, and report findings. With SAS^® Visual Investigator, organizations can use advanced analytical models and surveillance scenarios to identify and score events, and to deliver them to agents and intelligence analysts for investigation and action. SAS Visual Investigator provides analysts with a holistic view of people, cargo, relationships, social networks, patterns, and anomalies, which they can explore through interactive visualizations before capturing their decision and initiating an action.