Imbalanced data are frequently seen in fraud detection, direct marketing, disease prediction, and many other areas. Rare events are sometimes of primary interest. Classifying them correctly is the challenge that many predictive modelers face today. In this paper, we use SAS® Enterprise Miner™ on a marketing data set to demonstrate and compare several approaches that are commonly used to handle imbalanced data problems in classification models. The approaches are based on cost-sensitive measures and sampling measures. A rather novel technique called SMOTE (Synthetic Minority Over-sampling TEchnique), which has achieved the best result in our comparison, will be discussed.
Ruizhe Wang, GuideWell Connect
Novik Lee, Guidewell Connect
Yun Wei, Guidewell Connect
When large amounts of data are available, choosing the variables for inclusion in model building can be problematic. In this analysis, a subset of variables was required from a larger set. This subset was to be used in a later cluster analysis with the aim of extracting dimensions of human flourishing. A genetic algorithm (GA), written in SAS®, was used to select the subset of variables from a larger set in terms of their association with the dependent variable life satisfaction. Life satisfaction was selected as a proxy for an as yet undefined quantity, human flourishing. The data were divided into subject areas (health, environment). The GA was applied separately to each subject area to ensure adequate representation from each in the future analysis when defining the human flourishing dimensions.
Lisa Henley, University of Canterbury
This paper introduces a macro that generates a calendar report in two different formats. The first format displays the entire month in one plot, which is called a month-by-month calendar report. The second format displays the entire month in one row and is called an all-in-one calendar report. To use the macro, you just need to prepare a simple data set that has three columns: one column identifies the ID, one column contains the date, and one column specifies the notes for the dates. On the generated calendar reports, you can include notes and add different styles to certain dates. Also, the macro provides the option for you to decide whether those months in your data set that do not contain data should be shown on the reports.
Ting Sa, Cincinnati Children's Hospital Medical Center
Companies that offer subscription-based services (such as telecom and electric utilities) must evaluate the tradeoff between month-to-month (MTM) customers, who yield a high margin at the expense of lower lifetime, and customers who commit to a longer-term contract in return for a lower price. The objective, of course, is to maximize the Customer Lifetime Value (CLV). This tradeoff must be evaluated not only at the time of customer acquisition, but throughout the customer's tenure, particularly for fixed-term contract customers whose contract is due for renewal. In this paper, we present a mathematical model that optimizes the CLV against this tradeoff between margin and lifetime. The model is presented in the context of a cohort of existing customers, some of whom are MTM customers and others who are approaching contract expiration. The model optimizes the number of MTM customers to be swapped to fixed-term contracts, as well as the number of contract renewals that should be pursued, at various term lengths and price points, over a period of time. We estimate customer life using discrete-time survival models with time varying covariates related to contract expiration and product changes. Thereafter, an optimization model is used to find the optimal trade-off between margin and customer lifetime. Although we specifically present the contract expiration case, this model can easily be adapted for customer acquisition scenarios as well.
Atul Thatte, TXU Energy
Goutam Chakraborty, Oklahoma State University
The Centers for Medicare & Medicaid Services (CMS) uses the Proportion of Days Covered (PDC) to measure medication adherence. There is also some PDC-related research based on Medicare Part D Event (PDE) Data. However, Under Medicare rules, beneficiaries who receive care at an Inpatient (IP) [facility] may receive Medicare covered medications directly from the IP, rather than by filling prescriptions through their Part D contracts; thus, their medication fills during an IP stay would not be included in the PDE claims used to calculate the Patient Safety adherence measures. (Medicare 2014 Part C&D star rating technical notes). Therefore, the previous PDC calculation method underestimated the true PDC value. Starting with 2013 Star rating, PDC calculation was adjusted with IP stays. This is, when a patient has an inpatient admission during the measurement period, the inpatient stays are censored for the PDC calculation. If the patient also has measured drug coverage during the inpatient stay, the drug supplied during inpatient stay will be shifted after the inpatient stay. This shifting also causes a chain of shifting. This paper presents a SAS R Macro using the SAS Hash Object to match inpatient stays, censoring the inpatient stays, shifting the drug starting and ending dates, and calculating the adjusted PDC.
anping Chang, IHRC Inc.
Since Maine established the first All Payer Claims Database (APCD) in 2003, 10 additional states have established APCDs and 30 others are in development or show strong interest in establishing APCDs. APCDs are generally mandated by legislation, though voluntary efforts exist. They are administered through various agencies, including state health departments or other governmental agencies and private not-for-profit organizations. APCDs receive funding from various sources, including legislative appropriations and private foundations. To ensure sustainability, APCDs must also consider the sale of data access and reports as a source of revenue. With the advent of the Affordable Care Act, there has been an increased interest in APCDs as a data source to aid in health care reform. The call for greater transparency in health care pricing and quality, development of Patient-Centered Medical Homes (PCMHs) and Accountable Care Organizations (ACOs), expansion of state Medicaid programs, and establishment of health insurance and health information exchanges have increased the demand for the type of administrative claims data contained in an APCD. Data collection, management, analysis, and reporting issues are examined with examples from implementations of live APCDs. Developing data intake, processing, warehousing, and reporting standards are discussed in light of achieving the triple aim of improving the individual experience of care; improving the health of populations; and reducing the per capita costs of care. APCDs are compared and contrasted with other sources of state-level health care data, including hospital discharge databases, state departments of insurance records, and institutional and consumer surveys. The benefits and limitations of administrative claims data are reviewed. Specific issues addressed with examples include implementing transparent reporting of service prices and provider quality, maintaining master patient and provider identifiers, validating APCD data and comparison with o
ther state health care data available to researchers and consumers, defining data suppression rules to ensure patient confidentiality and HIPAA-compliant data release and reporting, and serving multiple end users, including policy makers, researchers, and consumers with appropriately consumable information.
Paul LaBrec, 3M Health Information Systems
In Scotia - Colpatria Bank, the retail segment is very important. The quantity of lending applications makes it necessary to use statistical models and analytic tools in order to do an initial selection of good customers, who our credit analyst will study in depth to finally approve or deny a credit application. The construction of target vintages using the Cox model will generate past-due alerts in a shorter time, so the mitigation measures can be applied one or two months earlier than currently. This can reduce the losses by 100 bps in the new vintages. This paper makes the estimation of a proportional hazard model of Cox and compares the results with a logit model for a specific product of the bank. Additionally, we will estimate the objective vintage for the product.
Ivan Atehortua Rojas, Scotia - Colpatria Bank
In our management and collection area, there was no methodology that provided the optimal number of collection calls to get the customer to make the minimum payment of his or her financial obligation. We wanted to determine the optimal number of calls using the data envelopment analysis (DEA) optimization methodology. Using this methodology, we obtained results that positively impacted the way our customers were contacted. We can maintain a healthy bank and customer relationship, keep management and collection at an operational level, and obtain a more effective and efficient portfolio recovery. The DEA optimization methodology has been successfully used in various fields of manufacturing production. It has solved multi-criteria optimization problems, but it has not been commonly used in the financial sector, especially in the collection area. This methodology requires specialized software, such as SAS® Enterprise Guide® and its robust optimization. In this presentation, we present the PROC OPTMODEL and show how to formulate the optimization problem, create the programming, and process the data available.
Jenny Lancheros, Banco Colpatria Of ScotiaBank Group
Ana Nieto, Banco Colpatria of Scotiabank Group
Missing data are a common and significant problem that researchers and data analysts encounter in applied research. Because most statistical procedures require complete data, missing data can substantially affect the analysis and the interpretation of results if left untreated. Methods to treat missing data have been developed so that missing values are imputed and analyses can be conducted using standard statistical procedures. Among these missing data methods, multiple imputation has received considerable attention and its effectiveness has been explored (for example, in the context of survey and longitudinal research). This paper compares four multiple imputation approaches for treating missing continuous covariate data under MCAR, MAR, and NMAR assumptions, in the context of propensity score analysis and observational studies. The comparison of the four MI approaches in terms of bias in parameter estimates, Type I error rates, and statistical power is presented. In addition, complete case analysis (listwise deletion) is presented as the default analysis that would be conducted if missing data are not treated. Issues are discussed, and conclusions and recommendations are provided.
Patricia Rodriguez de Gil, University of South Florida
Shetay Ashford, University of South Florida
Chunhua Cao, University of South Florida
Eun-Sook Kim, University of South Florida
Rheta Lanehart, University of South Florida
Reginald Lee, University of South Florida
Jessica Montgomery, University of South Florida
Yan Wang, University of South Florida
This session will describe an innovative way to identify groupings of customer offerings using SAS® software. The authors investigated the customer enrollments in nine different programs offered by a large energy utility. These programs included levelized billing plans, electronic payment options, renewable energy, energy efficiency programs, a home protection plan, and a home energy report for managing usage. Of the 640,788 residential customers, 374,441 had been solicited for a program and had adequate data for analysis. Nearly half of these eligible customers (49.8%) enrolled in some type of program. To examine the commonality among programs based on characteristics of customers who enroll, cluster analysis procedures and correlation matrices are often used. However, the value of these procedures was greatly limited by the binary nature of enrollments (enroll or no enroll), as well as the fact that some programs are mutually exclusive (limiting cross-enrollments for correlation measures). To overcome these limitations, PROC LOGISTIC was used to generate predicted scores for each customer for a given program. Then, using the same predictor variables, PROC LOGISTIC was used on each program to generate predictive scores for all customers. This provided a broad range of scores for each program, under the assumption that customers who are likely to join similar programs would have similar predicted scores for these programs. PROC FASTCLUS was used to build k-means cluster models based on these predicted logistic scores. Two distinct clusters were identified from the nine programs. These clusters not only aligned with the hypothesized model, but were generally supported by correlations (using PROC CORR) among program predicted scores as well as program enrollments.
Brian Borchers, PhD, Direct Options
Ashlie Ossege, Direct Options
This paper provides an overview of analysis of data derived from complex sample designs. General discussion of how and why analysis of complex sample data differs from standard analysis is included. In addition, a variety of applications are presented using PROC SURVEYMEANS, PROC SURVEYFREQ, PROC SURVEYREG, PROC SURVEYLOGISTIC, and PROC SURVEYPHREG, with an emphasis on correct usage and interpretation of results.
Patricia Berglund, University of Michigan
This presentation provides an overview of the advancement of analytics in National Hockey League (NHL) hockey, including how it applies to areas such as player performance, coaching, and injuries. The speaker discusses his analysis on predicting concussions that was featured in the New York Times, as well as other examples of statistical analysis in hockey.
Peter Tanner, Capital One
Approximately 80% of world trade at present uses the seaways, with around 110,000 merchant vessels and 1.25 million marine farers transported and almost 6 billion tons of goods transferred every year. Marine piracy stands as a serious challenge to sea trade. Understanding how the pirate attacks occur is crucial in effectively countering marine piracy. Predictive modeling using the combination of textual data with numeric data provides an effective methodology to derive insights from both structured and unstructured data. 2,266 text descriptions about pirate incidents that occurred over the past seven years, from 2008 to the second quarter of 2014, were collected from the International Maritime Bureau (IMB) website. Analysis of the textual data using SAS® Enterprise Miner™ 12.3, with the help of concept links, answered questions on certain aspects of pirate activities, such as the following: 1. What are the arms used by pirates for attacks? 2. How do pirates steal the ships? 3. How do pirates escape after the attacks? 4. What are the reasons for occasional unsuccessful attacks? Topics are extracted from the text descriptions using a text topic node, and the varying trends of these topics are analyzed with respect to time. Using the cluster node, attack descriptions are classified into different categories based on attack style and pirate behavior described by a set of terms. A target variable called Attack Type is derived from the clusters and is combined with other structured input variables such as Ship Type, Status, Region, Part of Day, and Part of Year. A Predictive model is built with Attact Type as the target variable and other structured data variables as input predictors. The Predictive model is used to predict the possible type of attack given the details of the ship and its travel. Thus, the results of this paper could be very helpful for the shipping industry to become more aware of possible attack types for different vessel types when traversing different routes
, and to devise counter-strategies in reducing the effects of piracy on crews, vessels, and cargo.
Raghavender Reddy Byreddy, Oklahoma State University
Nitish Byri, Oklahoma State University
Goutam Chakraborty, Oklahoma State University
Tejeshwar Gurram, Oklahoma State University
Anvesh Reddy Minukuri, Oklahoma State University
This analysis is based on data for all transactions at four parking meters within a small area in central Copenhagen for a period of four years. The observations show the exact minute parking was bought and the amount of time for which parking was bought in each transaction. These series of at most 80,000 transactions are aggregated to the hour, day, week, and month using PROC TIMESERIES. The aggregated series of parking times and the number of transactions are analyzed for seasonality and interdependence by PROC X12, PROC UCM, and PROC VARMAX.
Anders Milhoj, Copenhagen University
The Ebola virus outbreak is producing some of the most significant and fastest trending news throughout the globe today. There is a lot of buzz surrounding the deadly disease and the drastic consequences that it potentially poses to mankind. Social media provides the basic platforms for millions of people to discuss the issue and allows them to openly voice their opinions. There has been a significant increase in the magnitude of responses all over the world since the death of an Ebola patient in a Dallas, Texas hospital. In this paper, we aim to analyze the overall sentiment that is prevailing in the world of social media. For this, we extracted the live streaming data from Twitter at two different times using the Python scripting language. One instance relates to the period before the death of the patient, and the other relates to the period after the death. We used SAS® Text Miner nodes to parse, filter, and analyze the data and to get a feel for the patterns that exist in the tweets. We then used SAS® Sentiment Analysis Studio to further analyze and predict the sentiment of the Ebola outbreak in the United States. In our results, we found that the issue was not taken very seriously until the death of the Ebola patient in Dallas. After the death, we found that prominent personalities across the globe were talking about the disease and then raised funds to fight it. We are continuing to collect tweets. We analyze the locations of the tweets to produce a heat map that corresponds to the intensity of the varying sentiment across locations.
Dheeraj Jami, Oklahoma State University
Goutam Chakraborty, Oklahoma State University
Shivkanth Lanka, Oklahoma State University
Improving tourists' satisfaction and intention to revisit the festival is an ongoing area of interest to the tourism industry. Many organizers at the festival site strive very hard to attract and retain attendees by investing heavily in their marketing and promotion strategies for the festival. To meet this challenge, the advanced analytical model though data mining approach is proposed to answer the following research question: What are the most important factors that influence tourists' intentions to revisit the festival site? Cluster analysis, neural network, decision tree, stepwise regression, polynomial regression, and support vector machine are applied in this study. The main goal is to determine what it takes not only to retain the loyalty attendees, but also attract and encourage new attendees to be back at the site.
Jongsawas Chongwatpol, NIDA Business School, National Institute of Development Administration
Thanathorn Vajirakachorn, Dept. of Tourism Management, School of Business, University of the Thai Chamber of Commerce
With the increase in government and commissions incentivizing electric utilities to get consumers to save energy, there has been a large increase in the number of energy saving programs. Some are structural, incentivizing consumers to make improvements to their home that result in energy savings. Some, called behavioral programs, are designed to get consumers to change their behavior to save energy. Within behavioral programs, Home Energy Reports are a good method to achieve behavioral savings as well as to educate consumers on structural energy savings. This paper examines the different Home Energy Report communication channels (direct mail and e-mail) and the marketing channel effect on energy savings, using SAS® for linear models. For consumer behavioral change, we often hear the questions: 1) Are the people that responded via direct mail solicitation saving at a higher rate than people who responded via an e-mail solicitation? 1a) Hypothesis: Because e-mail is easy to respond to, the type of customers that enroll through this channel will exert less effort for the behavior changes that require more time and investment toward energy efficiency changes and thus will save less. 2) Does the mode of that ongoing dialog (mail versus e-mail) impact the amount of consumer savings? 2a) Hypothesis: E-mail is more likely to be ignored and thus these recipients will save less. As savings is most often calculated by comparing the treatment group to a control group (to account for weather and economic impact over time), and by definition you cannot have a dialog with a control group, the answers are not a simple PROC FREQ away. Also, people who responded to mail look very different demographically than people who responded to e-mail. So, is the driver of savings differences the channel, or is it the demographics of the customers that happen to use those chosen channels? This study used clustering (PROC FASTCLUS) to segment the consumers by mail versus e-mail and append cluster assignment
s to the respective control group. This study also used DID (Difference-in-Differences) as well as Billing Analysis (PROC GLM) to calculate the savings of these groups.
Angela Wells, Direct Options
Ashlie Ossege, Direct Options
In observational data analyses, it is often helpful to use patients as their own controls by comparing their outcomes before and after some signal event, such as the initiation of a new therapy. It might be useful to have a control group that does not have the event but that is instead evaluated before and after some arbitrary point in time, such as their birthday. In this context, the change over time is a continuous outcome that can be modeled as a (possibly discontinuous) line, with the same or different slope before and after the event. Mixed models can be used to estimate random slopes and intercepts and compare patients between groups. A specific example published in a peer-reviewed journal is presented.
David Pasta, ICON Clinical Research
Kaiser Permanente Northwest is contractually obligated for regulatory submissions to Oregon Health Authority, Health Share of Oregon, and Molina Healthcare in Washington. The submissions consist of Medicaid Encounter data for medical and pharmacy claims. SAS® programs are used to extract claims data from Kaiser's claims data warehouse, process the data, and produce output files in HIPAA ASC X12 and NCPDP format. Prior to April 2014, programs were written in SAS® 8.2 running on a VAX server. Several key drivers resulted in the conversion of the existing system to SAS® Enterprise Guide® 5.1 running on UNIX. These drivers were: the need to have a scalable system in preparation for the Affordable Care Act (ACA); performance issues with the existing system; incomplete process reporting and notification to business owners; and a highly manual, labor-intensive process of running individual programs. The upgraded system addressed these drivers. The estimated cost reduction was from $1.30 per reported encounter to $0.13 per encounter. The converted system provides for better preparedness for the ACA. One expected result of ACA is significant Medicaid membership growth. The program has already increased in size by 50% in the preceding 12 months. The updated system allows for the expected growth in membership.
Eric Sather, Kaiser Permanente
Creating DataFlux® jobs that can be executed from job scheduling software can be challenging. This presentation guides participants through the creation of a template job that accepts an email distribution list, subject, email verbiage, and attachment file name as macro variables. It also demonstrates how to call this job from a command line.
Jeanne Estridge, Sinclair Community College
This session is an introduction to predictive analytics and causal analytics in the context of improving outcomes. The session covers the following topics: 1) Basic predictive analytics vs. causal analytics; 2) The causal analytics framework; 3) Testing whether the outcomes improve because of an intervention; 4) Targeting the cases that have the best improvement in outcomes because of an intervention; and 5) Tweaking an intervention in a way that improves outcomes further.
Jason Pieratt, Humana
Data analysis begins with cleaning up data, calculating descriptive statistics, and examining variable distributions. Before more rigorous statistical analysis begins, many statisticians perform basic inferential statistical tests such as chi-square and t tests to assess unadjusted associations. These tests help guide the direction of the more rigorous statistical analysis. How to perform chi-square and t tests is presented. We explain how to interpret the output and where to look for the association or difference based on the hypothesis being tested. We propose the next steps for further analysis using example data.
Maribeth Johnson, Georgia Regents University
Jennifer Waller, Georgia Regents University
The impact of price on brand sales is not always linear or independent of other brand prices. We demonstrate, using sales information and SAS® Enterprise Miner, how to uncover relative price bands where prices might be increased without losing market share or decreased slightly to gain share.
Ryan Carr, SAS
Charles Park, Lenovo
Many organizations need to forecast large numbers of time series that are discretely valued. These series, called count series, fall approximately between continuously valued time series, for which there are many forecasting techniques (ARIMA, UCM, ESM, and others), and intermittent time series, for which there are a few forecasting techniques (Croston's method and others). This paper proposes a technique for large-scale automatic count series forecasting and uses SAS® Forecast Server and SAS/ETS® software to demonstrate this technique.
Michael Leonard, SAS
Data sharing through healthcare collaboratives and national registries creates opportunities for secondary data analysis projects. These initiatives provide data for quality comparisons as well as endless research opportunities to external researchers across the country. The possibilities are bountiful when you join data from diverse organizations and look for common themes related to illnesses and patient outcomes. With these great opportunities comes great pain for data analysts and health services researchers tasked with compiling these data sets according to specifications. Patient care data is complex, and, particularly at large healthcare systems, might be managed with multiple electronic health record (EHR) systems. Matching data from separate EHR systems while simultaneously ensuring the integrity of the details of that care visit is challenging. This paper demonstrates how data management personnel can use traditional SAS PROCs in new and inventive ways to compile, clean, and complete data sets for submission to healthcare collaboratives and other data sharing initiatives. Traditional data matching methods such as SPEDIS are uniquely combined with iterative SQL joins using the SAS® functions INDEX, COMPRESS, CATX, and SUBSTR to yield the most out of complex patient and physician name matches. Recoding, correcting missing items, and formatting data can be efficiently achieved by using traditional functions such as MAX, PROC FORMAT, and FIND in new and inventive ways.
Gabriela Cantu, Baylor Scott &White Health
Christopher Klekar, Baylor Scott and White Health
Graduate students often need to explore data and summarize multiple statistical models into tables for a dissertation. The challenges of data summarization include coding multiple, similar statistical models, and summarizing these models into meaningful tables for review. The default method is to type (or copy and paste) results into tables. This often takes longer than creating and running the analyses. Students might spend hours creating tables, only to have to start over when a change or correction in the underlying data requires the analyses to be updated. This paper gives graduate students the tools to efficiently summarize the results of statistical models in tables. These tools include a macro-based SAS/STAT® analysis and ODS OUTPUT statement to summarize statistics into meaningful tables. Specifically, we summarize PROC GLM and PROC LOGISTIC output. We convert an analysis of hospital-acquired delirium from hundreds of pages of output into three formatted Microsoft Excel files. This paper is appropriate for users familiar with basic macro language.
Elisa Priest, Texas A&M University Health Science Center
Ashley Collinsworth, Baylor Scott & White Health/Tulane University
Learn how leading retailers are developing key findings in digital data to be leveraged across marketing, merchandising, and IT.
Rachel Thompson, SAS
Data scientists and analytic practitioners have become obsessed with quantifying the unknown. Through text mining third-person posthumous narratives in SAS® Enterprise Miner™ 12.1, we measured tangible aspects of personalities based on the broadly accepted big-five characteristics: extraversion, agreeableness, conscientiousness, neuroticism, and openness. These measurable attributes are linked to common descriptive terms used throughout our data to establish statistical relationships. The data set contains over 1,000 obituaries from newspapers throughout the United States, with individuals who vary in age, gender, demographic, and socio-economic circumstances. In our study, we leveraged existing literature to build the ontology used in the analysis. This literature suggests that a third person's perspective gives insight into one's personality, solidifying the use of obituaries as a source for analysis. We statistically linked target topics such as career, education, religion, art, and family to the five characteristics. With these taxonomies, we developed multivariate models in order to assign scores to predict an individual's personality type. With a trained model, this study has implications for predicting an individual's personality, allowing for better decisions on human capital deployment. Even outside the traditional application of personality assessment for organizational behavior, the methods used to extract intangible characteristics from text enables us to identify valuable information across multiple industries and disciplines.
Mark Schneider, Deloitte & Touche
Andrew Van Der Werff, Deloitte & Touche, LLP
Do you need to deliver business insight and analytics to support decision-making? Using SAS® Enterprise Guide®, you can access the full power of SAS® for analytics, without needing to learn the details of SAS programming. This presentation focuses on the following uses of SAS Enterprise Guide: Exploring and understanding--getting a feel for your data and for its issues and anomalies Visualizing--looking at the relationships, trends, surprises Consolidating--starting to piece together the story Presenting--building the insight and analytics into a presentation using SAS Enterprise Guide
Marje Fecht, Prowerk Consulting
Proper management of master data is a critical component of any enterprise information system. However, effective master data management (MDM) requires that both IT and Business understand the life cycle of master data and the fundamental principles of entity resolution (ER). This presentation provides a high-level overview of current practices in data matching, record linking, and entity information life cycle management that are foundational to building an effective strategy to improve data integration and MDM. Particular areas of focus are: 1) The need for ongoing ER analytics--the systematic and quantitative measurement of ER performance; 2) Investing in clerical review and asserted resolution for continuous improvement; and 3) Addressing the large-scale ER challenge through distributed processing.
John Talburt, Black Oak Analytics, Inc
Medicaid programs are the second largest line item in each state's budget. In 2012, they contributed $421.2 billion, or 15 percent of total national healthcare expenditures. With US health care reform at full speed, state Medicaid programs must establish new initiatives that will reduce the cost of healthcare, while providing coordinated, quality care to the nation's most vulnerable populations. This paper discusses how states can implement innovative reform through the use of data analytics. It explains how to establish a statewide health analytics framework that can create novel analyses of health data and improve the health of communities. With solutions such as SAS® Claims Analytics, SAS® Episode Analytics, and SAS® Fraud Framework, state Medicaid programs can transform the way they make business and clinical decisions. Moreover, new payment structures and delivery models can be successfully supported through the use of healthcare analytics. A statewide health analytics framework can support initiatives such as bundled and episodic payments, utilization studies, accountable care organizations, and all-payer claims databases. Furthermore, integrating health data into a single analytics framework can provide the flexibility to support a unique analysis that each state can customize with multiple solutions and multiple sources of data. Establishing a health analytics framework can significantly improve the efficiency and effectiveness of state health programs and bend the healthcare cost curve.
Krisa Tailor, SAS
Jeremy Racine
SAS® Enterprise Guide® is a great interface for businesses running SAS® in a shared server environment. However, interacting with the shared server outside of SAS can require costly third-party software and knowledge of specific server programming languages. This can create a barrier between the SAS program and the server, which can be frustrating for even the best SAS programmers. This paper reviews the X and SYSTASK commands and creates a template of SAS code to pass commands from SAS to the server. By writing the server log to a text file, we demonstrate how to display critical server information in the code results. Using macros and the prompt functionality of SAS Enterprise Guide, we form stored procedures, allowing SAS users of all skill levels to interact with the server environment. These stored procedures can improve programming efficiency by providing a quick in-program solution to complete common server tasks such as copying folders or changing file permissions. They might also reduce the need for third-party programs to communicate with the server, which could potentially reduce software costs.
Cody Murray, Medica Health Plans
Chad Stegeman, Medica
Data quality is now more important than ever before. According to Gartner (2011), poor data quality is the primary reason why 40% of all business initiatives fail to achieve their targeted benefits. As a response, Deloitte Belgium has created an agile data quality framework using SAS® Data Management to rapidly identify and resolve root causes of data quality issues to jump-start business initiatives, especially a data migration one. Moreover, the approach uses both standard SAS Data Management functionalities (such as standardization, parsing, etc.) and advanced features (such as using macros, dynamic profiling in deployment mode, extracting the profiling results, etc.), allowing the framework to be agile and flexible and to maximize the reusability of specific components built in SAS Data Management.
yves wouters, Deloitte
Valérie Witters, Deloitte
Quality measurement is increasingly important in the health-care sphere for both performance optimization and reimbursement. Treatment of chronic conditions is a key area of quality measurement. However, medication compendiums change frequently, and health-care providers often free text medications into a patient's record. Manually reviewing a complete medications database is time consuming. In order to build a robust medications list, we matched a pharmacist-generated list of categorized medications to a raw medications database that contained names, name-dose combinations, and misspellings. The matching procedure we used is called PROC COMPGED. We were able to combine a truncation function and an upcase function to optimize the output of PROC COMPGED. Using these combinations and manipulating the scoring metric of PROC COMPGED enabled us to narrow the database list to medications that were relevant to our categories. This process transformed a tedious task for PROC COMPARE or an Excel macro into a quick and efficient method of matching. The task of sorting through relevant matches was still conducted manually, but the time required to do so was significantly decreased by the fuzzy match in our application of PROC COMPGED.
Arti Virkud, NYC Department of Health
This presentation provides a brief introduction to logistic regression analysis in SAS. Learn differences between Linear Regression and Logistic Regression, including ordinary least squares versus maximum likelihood estimation. Learn to: understand LOGISTIC procedure syntax, use continuous and categorical predictors, and interpret output from ODS Graphics.
Danny Modlin, SAS
For decades, mixed models been used by researchers to account for random sources of variation in regression-type models. Now they are gaining favor in business statistics to give better predictions for naturally occurring groups of data, such as sales reps, store locations, or regions. Learn about how predictions based on a mixed model differ from predictions in ordinary regression, and see examples of mixed models with business data.
Catherine Truxillo, SAS
Text data constitutes more than half of the unstructured data held in organizations. Buried within the narrative of customer inquiries, the pages of research reports, and the notes in servicing transactions are the details that describe concerns, ideas and opportunities. The historical manual effort needed to develop a training corpus is now no longer required, making it simpler to gain insight buried in unstructured text. With the ease of machine learning refined with the specificity of linguistic rules, SAS Contextual Analysis helps analysts identify and evaluate the meaning of the electronic written word. From a single, point-and-click GUI interface the process of developing text models is guided and visually intuitive. This presentation will walk through the text model development process with SAS Contextual Analysis. The results are in SAS format, ready for text-based insights to be used in any other SAS application.
George Fernandez, SAS
SAS/ETS provides many tools to improve the productivity of the analyst who works with time series data. This tutorial will take an analyst through the process of turning transaction-level data into a time series. The session will then cover some basic forecasting techniques that use past fluctuations to predict future events. We will then extend this modeling technique to include explanatory factors in the prediction equation.
Kenneth Sanford, SAS
Learning the Graph Template Language (GTL) might seem like a daunting task. However, creating customized graphics with SAS® is quite easy using many of the tools offered with Base SAS® software. The point-and-click interface of ODS Graphics Designer provides us with a tool that can be used to generate highly polished graphics and to store the GTL-based code that creates them. This opens the door for users who would like to create canned graphics that can be used on various data sources, variables, and variable types. In this hands-on training, we explore the use of ODS Graphics Designer to create sophisticated graphics and to save the template code. We then discuss modifications using basic SAS macros in order to create stored graphics code that is flexible enough to accommodate a wide variety of situations.
Rebecca Ottesen, City of Hope and Cal Poly SLO
Leanne Goldstein, City of Hope
Predicting the profitability of future sales orders in a price-sensitive, highly competitive make-to-order market can create a competitive advantage for an organization. Order size and specifications vary from order to order and customer to customer, and might or might not be repeated. While it is the intent of the sales groups to take orders for a profit, because of the volatility of steel prices and the competitive nature of the markets, gross margins can range dramatically from one order to the next and in some cases can be negative. Understanding the key factors affecting the gross margin percent and their impact can help the organization to reduce the risk of non-profitable orders and at the same time improve their decision-making ability on market planning and forecasting. The objective of this paper is to identify the best model amongst multiple predictive models inside SAS® Enterprise Miner™, which could accurately predict the gross margin percent for future orders. The data used for the project consisted of over 30,000 transactional records and 33 input variables. The sales records have been collected from multiple manufacturing plants of the steel manufacturing company. Variables such as order quantity, customer location, sales group, and others were used to build predictive models. The target variable gross margin percent is the net profit on the sales, considering all the factors such as labor cost, cost of raw materials, and so on. The model comparison node of SAS Enterprise Miner was used to determine the best among different variations of regression models, decision trees, and neural networks, as well as ensemble models. Average squared error was used as the fit statistic to evaluate each model's performance. Based on the preliminary model analysis, the ensemble model outperforms other models with the least average square error.
Kushal Kathed, Oklahoma State University
Patti Jordan
Ayush Priyadarshi, Oklahoma State University
Graduate students encounter many challenges when conducting health services research using real world data obtained from electronic health records (EHRs). These challenges include cleaning and sorting data, summarizing and identifying present-on-admission diagnosis codes, identifying appropriate metrics for risk-adjustment, and determining the effectiveness and cost effectiveness of treatments. In addition, outcome variables commonly used in health service research are not normally distributed. This necessitates the use of nonparametric methods in statistical analyses. This paper provides graduate students with the basic tools for the conduct of health services research with EHR data. We will examine SAS® tools and step-by-step approaches used in an analysis of the effectiveness and cost-effectiveness of the ABCDE (Awakening and Breathing Coordination, Delirium monitoring/management, and Early exercise/mobility) bundle in improving outcomes for intensive care unit (ICU) patients. These tools include the following: (1) ARRAYS; (2) lookup tables; (3) LAG functions; (4) PROC TABULATE; (5) recycled predictions; and (6) bootstrapping. We will discuss challenges and lessons learned in working with data obtained from the EHR. This content is appropriate for beginning SAS users.
Ashley Collinsworth, Baylor Scott & White Health/Tulane University
Elisa Priest, Texas A&M University Health Science Center
With the constant need to inform researchers about neighborhood health data, the Santa Clara County Health Department created socio-demographic and health profiles for 109 neighborhoods in the county. Data was pulled from many public and county data sets, compiled, analyzed, and automated using SAS®. With over 60 indicators and 109 profiles, an efficient set of macros was used to automate the calculation of percentages, rates, and mean statistics for all of the indicators. Macros were also used to automate individual census tracts into pre-decided neighborhoods to avoid data entry errors. Simple SQL procedures were used to calculate and format percentages within the macros, and output was pushed out using Output Delivery System (ODS) Graphics. This output was exported to Microsoft Excel, which was used to create a sortable database for end users to compare cities and/or neighborhoods. Finally, the automated SAS output was used to map the demographic data using geographic information system (GIS) software at three geographies: city, neighborhood, and census tract. This presentation describes the use of simple macros and SAS procedures to reduce resources and time spent on checking data for quality assurance purposes. It also highlights the simple use of ODS Graphics to export data to an Excel file, which was used to mail merge the data into 109 unique profiles. The presentation is aimed at intermediate SAS users at local and state health departments who might be interested in finding an efficient way to run and present health statistics given limited staff and resources.
Roshni Shah, Santa Clara County
Is it a better business decision to determine profitability of all business units/kiosks and then decide to prune the nonprofitable ones? Or does model performance improve if we decide to first find the units that meet the break-even point and then try to calculate their profits? In our project, we did a two-stage regression process due to highly skewed distribution of the variables. First, we performed logistic regression to predict which kiosks would be profitable. Then, we used linear regression to predict the average monthly revenue at each kiosk. We used SAS® Enterprise Guide® and SAS® Enterprise Miner™ for the modeling process. The effectiveness of the linear regression model is much more for predicting the target variable at profitable kiosks as compared to unprofitable kiosks. The two-phase regression model seemed to perform better than simply performing a linear regression, particularly when the target variable has too many levels. In real-life situations, the dependent and independent variables can have highly skewed distributions, and two-phase regression can help improve model performance and accuracy. Some results: The logistic regression model has an overall accuracy of 82.9%, sensitivity of 92.6%, and specificity of 61.1% with comparable figures for the training data set at 81.8%, 90.7%, and 63.8% respectively. This indicates that the regression model seems to be consistently predicting the profitable kiosks at a reasonably good level. Linear regression model: For the training data set, the MAPE (mean absolute percentage errors in prediction) is 7.2% for the kiosks that earn more than $350 whereas the MAPE (mean absolute percentage errors in prediction) for kiosks that earn less than $350 is -102% for the predicted values (not log-transformed) of the target versus the actual value of the target respectively. For the validation data set, the MAPE (mean absolute percentage errors in prediction) is 7.6% for the kiosks that earn more
than $350 whereas the MAPE (mean absolute percentage errors in prediction) for kiosks that earn less than $350 is -142% for the predicted values (not log-transformed) of the target versus the actual value of the target respectively. This means that the average monthly revenue figures seem to be better predicted for the model where the kiosks were earning higher than the threshold value of $350--that is, for those kiosk variables with a flag variable of 1. The model seems to be predicting the target variable with lower APE for higher values of the target variable for both the training data set above and the entire data set below. In fact, if the threshold value for the kiosks is moved to even say $500, the predictive power of the model in terms of APE will substantially increase. The validation data set (Selection Indicator=0) has fewer data points, and, therefore, the contrast in APEs is higher and more varied.
Shrey Tandon, Sobeys West
Since the financial crisis of 2008, banks and bank holding companies in the United States have faced increased regulation. One of the recent changes to these regulations is known as the Comprehensive Capital Analysis and Review (CCAR). At the core of these new regulations, specifically under the Dodd-Frank Wall Street Reform and Consumer Protection Act and the stress tests it mandates, are a series of what-if or scenario analyses requirements that involve a number of scenarios provided by the Federal Reserve. This paper proposes frequentist and Bayesian time series methods that solve this stress testing problem using a highly practical top-down approach. The paper focuses on the value of using univariate time series methods, as well as the methodology behind these models.
Kenneth Sanford, SAS
Christian Macaro, SAS
Generalized linear models are highly useful statistical tools in a broad array of business applications and scientific fields. How can you select a good model when numerous models that have different regression effects are possible? The HPGENSELECT procedure, which was introduced in SAS/STAT® 12.3, provides forward, backward, and stepwise model selection for generalized linear models. In SAS/STAT 14.1, the HPGENSELECT procedure also provides the LASSO method for model selection. You can specify common distributions in the family of generalized linear models, such as the Poisson, binomial, and multinomial distributions. You can also specify the Tweedie distribution, which is important in ratemaking by the insurance industry and in scientific applications. You can run the HPGENSELECT procedure in single-machine mode on the server where SAS/STAT is installed. With a separate license for SAS® High-Performance Statistics, you can also run the procedure in distributed mode on a cluster of machines that distribute the data and the computations. This paper shows you how to use the HPGENSELECT procedure both for model selection and for fitting a single model. The paper also explains the differences between the HPGENSELECT procedure and the GENMOD procedure.
Gordon Johnston, SAS
Bob Rodriguez, SAS
Examples include: how to join when your data is perfect, how to join when your data does not match and you have to manipulate it in order to join, how to create a subquery, how to use a subquery.
Anita Measey, Bank of Montreal, Risk Capital & Stress Testing
There are various economic factors that affect retail sales. One important factor that is expected to correlate is overall customer sentiment toward a brand. In this paper, we analyze how location-specific customer sentiment could vary and correlate with sales at retail stores. In our attempt to find any dependency, we have used location-specific Twitter feeds related to a national-brand chain retail store. We opinion-mine their overall sentiment using SAS® Sentiment Analysis Studio. We estimate correlation between the opinion index and retail sales within the studied geographic areas. Later in the analysis, using ArcGIS Online from Esri, we estimate whether other location-specific variables that could potentially correlate with customer sentiment toward the brand are significantly important to predict a brand's retail sales.
Asish Satpathy, University of California, Riverside
Goutam Chakraborty, Oklahoma State University
Tanvi Kode, Oklahoma State University
Automated decision-making systems are now found everywhere, from your bank to your government to your home. For example, when you inquire for a loan through a website, a complex decision process likely runs combinations of statistical models and business rules to make sure you are offered a set of options for tantalizing terms and conditions. To make that happen, analysts diligently strive to encode their complex business logic into these systems. But how do you know if you are making the best possible decisions? How do you know if your decisions conform to your business constraints? For example, you might want to maximize the number of loans that you provide while balancing the risk among different customer categories. Welcome to the world of optimization. SAS® Business Rules Manager and SAS/OR® software can be used together to manage and optimize decisions. This presentation demonstrates how to build business rules and then optimize the rule parameters to maximize the effectiveness of those rules. The end result is more confidence that you are delivering an effective decision-making process.
David Duling, SAS
File management is a tedious process that can be automated by using SAS® to create and execute a Windows command script. The macro in this paper copies files from one location to another, identifies obsolete files by the version number, and then moves them to an archive folder. Assuming that some basic conditions are met, this macro is intended to be easy to use and robust. Windows users who run routine programs for projects with rework might want to consider this solution.
Jason Wachsmuth, Public Policy Center at The University of Iowa
With the move to value-based benefit and reimbursement models, it is essential toquantify the relative cost, quality, and outcome of a service. Accuratelymeasuring the cost and quality of doctors, practices, and health systems iscritical when you are developing a tiered network, a shared savings program, ora pay-for-performance incentive. Limitations in claims payment systems requiredeveloping methodological and statistical techniques to improve the validityand reliability of provider's scores on cost and quality of care. This talkdiscusses several key concepts in the development of a measurement systemfor provider performance, including measure selection, risk adjustment methods,and peer group benchmark development.
Daryl Wansink, Qualmetrix, Inc.
The goal of this session is to describe the whole process of model creation from the business request through model specification, data preparation, iterative model creation, model tuning, implementation, and model servicing. Each mentioned phase consists of several steps in which we describe the main goal of the step, the expected outcome, the tools used, our own SAS codes, useful nodes, and settings in SAS® Enterprise Miner™, procedures in SAS® Enterprise Guide®, measurement criteria, and expected duration in man-days. For three steps, we also present deep insights with examples of practical usage, explanations of used codes, settings, and ways of exploring and interpreting the output. During the actual model creation process, we suggest using Microsoft Excel to keep all input metadata along with information about transformations performed in SAS Enterprise Miner. To get faster information about model results, we combine an automatic SAS® code generator implemented in Excel, and then we input this code to SAS Enterprise Guide and create a specific profile of results directly from the nodes output tables of SAS Enterprise Miner. This paper also focuses on an example of a binary model stability check-in time performed in SAS Enterprise Guide through measuring optimal cut-off percentage and lift. These measurements are visualized and automatized using our own codes. By using this methodology, users would have direct contact with transformed data along with the possibility to analyze and explore any semi-results. Furthermore, the proposed approach could be used for several types of modeling (for example, binary and nominal predictive models or segmentation models). Generally, we have summarized our best practices of combining specific procedures performed in SAS Enterprise Guide, SAS Enterprise Miner, and Microsoft Excel to create and interpret models faster and more effectively.
Peter Kertys, VÚB a.s.
This presentation emphasizes use of SAS® 9.4 to perform multiple imputation of missing data using the PROC MI Fully Conditional Specification (FCS) method with subsequent analysis using PROC SURVEYLOGISTIC and PROC MIANALYZE. The data set used is based on a complex sample design. Therefore, the examples correctly incorporate the complex sample features and weights. The demonstration is then repeated in Stata, IVEware, and R for a comparison of major software applications that are capable of multiple imputation using FCS or equivalent methods and subsequent analysis of imputed data sets based on complex sample design data.
Patricia Berglund, University of Michigan
There are times when the objective is to provide a summary table and graph for several quality improvement measures on a single page to allow leadership to monitor the performance of measures over time. The challenges were to decide which SAS® procedures to use, how to integrate multiple SAS procedures to generate a set of plots and summary tables within one page, and how to determine whether to use box plots or series plots of means or medians. We considered the SGPLOT and SGPANEL procedures, and Graph Template Language (GTL). As a result, given the nature of the request, the decision led us to use GTL and the SGRENDER procedure in the %BXPLOT2 macro. For each measure, we used the BOXPLOTPARM statement to display a series of box plots and the BLOCKPLOT statement for a summary table. Then we used the LAYOUT OVERLAY statement to combine the box plots and summary tables on one page. The results display a summary table (BLOCKPLOT) above each box plot series for each measure on a single page. Within each box plot series, there is an overlay of a system-level benchmark value and a series line connecting the median values of each box plot. The BLOCKPLOT contains descriptive statistics per time period illustrated in the associated box plot. The discussion points focus on techniques for nesting the lattice overlay with box plots and BLOCKPLOTs in GTL and some reasons for choosing box plots versus series plots of medians or means.
Greg Stanek, Fannie Mae
Walt Disney World Resort is home to four theme parks, two water parks, five golf courses, 26 owned-and-operated resorts, and hundreds of merchandise and dining experiences. Every year millions of guests stay at Disney resorts to enjoy the Disney Experience. Assigning physical rooms to resort and hotel reservations is a key component to maximizing operational efficiency and guest satisfaction. Solutions can range from automation to optimization programs. The volume of reservations and the variety and uniqueness of guest preferences across the Walt Disney World Resort campus pose an opportunity to solve a number of reasonably difficult room assignment problems by leveraging operations research techniques. For example, a guest might prefer a room with specific bedding and adjacent to certain facilities or amenities. When large groups, families, and friends travel together, they often want to stay near each other using specific room configurations. Rooms might be assigned to reservations in advance and upon request at check-in. Using mathematical programming techniques, the Disney Decision Science team has partnered with the SAS® Advanced Analytics R&D team to create a room assignment optimization model prototype and implement it in SAS/OR®. We describe how this collaborative effort has progressed over the course of several months, discuss some of the approaches that have proven to be productive for modeling and solving this problem, and review selected results.
HAINING YU, Walt Disney Parks & Resorts
Hai Chu, Walt Disney Parks & Resorts
Tianke Feng, Walt Disney Parks & Resorts
Matthew Galati, SAS
Ed Hughes, SAS
Ludwig Kuznia, Walt Disney Parks & Resorts
Rob Pratt, SAS
Faced with diminishing forecast returns from the forecast engine within the existing replenishment application, Tractor Supply Company (TSC) engaged SAS® Institute to deliver a fully integrated forecasting solution that promised a significant improvement of chain-wide forecast accuracy. The end-to-end forecast implementation including problems faced, solutions delivered, and results realized will be explored.
Chris Houck, SAS
SAS/QC® provides procedures, such as PROC SHEWHART, to produce control charts with centerlines and control limits. When quality improvement initiatives create an out-of-control process of improvement, centerlines and control limits need to be recalculated. While this is not a complicated process, producing many charts with multiple centerline shifts can quickly become difficult. This paper illustrates the use of a macro to efficiently compute centerlines and control limits when one or more recalculations are needed for multiple charts.
Jesse Pratt, Cincinnati Children's Hospital Medical Center
If your data do not meet the assumptions for a standard parametric test, you might want to consider using a permutation test. By randomly shuffling the data and recalculating a test statistic, a permutation test can calculate the probability of getting a value equal to or more extreme than an observed test statistic. With the power of matrices, vectors, functions, and user-defined modules, the SAS/IML® language is an excellent option. This paper covers two examples of permutation tests: one for paired data and another for repeated measures analysis of variance. For those new to SAS/IML® software, this paper offers a basic introduction and examples of how effective it can be.
John Vickery, North Carolina State University
Okay, you've read all the books, manuals, and papers and can produce graphics with SAS/GRAPH® and Output Delivery System (ODS) Graphics with the best of them. But how do you handle the Final Mile problem--getting your images generated in SAS® sized just right and positioned just so in Microsoft Excel? This paper presents a method of doing so that employs SAS Integration Technologies and Excel Visual Basic for Applications (VBA) to produce SAS graphics and automatically embed them in Excel worksheets. This technique might be of interest to all skill levels. It uses Base SAS®, SAS/GRAPH, ODS Graphics, the SAS macro facility, SAS® Integration Technologies, Microsoft Excel, and VBA.
Ted Conway, Self
Investment portfolios and investable indexes determine their holdings according to stated mandate and methodology. Part of that process involves compliance with certain allocation constraints. These constraints are developed internally by portfolio managers and index providers, imposed externally by regulations, or both. An example of the latter is the U.S. Internal Revenue Code (25/50) concentration constraint, which relates to a regulated investment company (RIC). These codes state that at the end of each quarter of a RIC's tax year, the following constraints should be met: 1) No more than 25 percent of the value of the RIC's assets might be invested in a single issuer. 2) The sum of the weights of all issuers representing more than 5 percent of the total assets should not exceed 50 percent of the fund's total assets. While these constraints result in a non-continuous model, compliance with concentration constraints can be formalized by reformulating the model as a series of continuous non-linear optimization problems solved using PROC OPTMODEL. The model and solution are presented in this paper. The approach discussed has been used in constructing investable equity indexes.
Taras Zlupko, CRSP, University of Chicago
Robert Spatz
SAS® Simulation Studio, a component of SAS/OR® software for Microsoft Windows environments, provides powerful and versatile capabilities for building, executing, and analyzing discrete-event simulation models in a graphical environment. Its object-oriented, drag-and-drop modeling makes building and working with simulation models accessible to novice users, and its broad range of model configuration options and advanced capabilities makes SAS Simulation Studio suitable also for sophisticated, detailed simulation modeling and analysis. Although the number of modeling blocks in SAS Simulation Studio is small enough to be manageable, the number of ways in which they can be combined and connected is almost limitless. This paper explores some of the modeling methods and constructs that have proven most useful in practical modeling with SAS Simulation Studio. SAS has worked with customers who have applied SAS Simulation Studio to measure, predict, and improve system performance in many different industries, including banking, public utilities, pharmaceuticals, manufacturing, prisons, hospitals, and insurance. This paper looks at some discrete-event simulation modeling needs that arise in specific settings and some that have broader applicability, and it considers the ways in which SAS Simulation Studio modeling can meet those needs.
Ed Hughes, SAS
Emily Lada, SAS
Predictions, including regressions and classifications, are the predominant focus of many statistical and machine-learning models. However, in the era of big data, a predictive modeling process contains more than just making the final predictions. For example, a large collection of data often represents a set of small, heterogeneous populations. Identification of these sub groups is therefore an important step in predictive modeling. In addition, big data data sets are often complex, exhibiting high dimensionality. Consequently, variable selection, transformation, and outlier detection are integral steps. This paper provides working examples of these critical stages using SAS® Visual Statistics, including data segmentation (supervised and unsupervised), variable transformation, outlier detection, and filtering, in addition to building the final predictive model using methodology such as linear regressions, decision trees, and logistic regressions. The illustration data was collected from 2010 to 2014, from vehicle emission testing results.
Xiangxiang Meng, SAS
Jennifer Ames, SAS
Wayne Thompson, SAS
This paper presents a methodology developed to define and prioritize feeders with the least satisfactory performances for continuity of energy supply, in order to obtain an efficiency ranking that supports a decision-making process regarding investments to be implemented. Data Envelopment Analysis (DEA) was the basis for the development of this methodology, in which the input-oriented model with variable returns to scale was adopted. To perform the analysis of the feeders, data from the utility geographic information system (GIS) and from the interruption control system was exported to SAS® Enterprise Guide®, where data manipulation was possible. Different continuity variables and physical-electrical parameters were consolidated for each feeder for the years 2011 to 2013. They were separated according to the geographical regions of the concession area, according to their location (urban or rural), and then grouped by physical similarity. Results showed that 56.8% of the feeders could be considered as efficient, based on the continuity of the service. Furthermore, the results enable identification of the assets with the most critical performance and their benchmarks, and the definition of preliminary goals to reach efficiency.
Victor Henrique de Oliveira, Cemig
Iguatinan Monteiro, CEMIG
Dynamic pricing is a real-time strategy where corporations attempt to alter prices based on varying market demand. The hospitality industry has been doing this for quite a while, altering prices significantly during the summer months or weekends when demand for rooms is at a premium. In recent years, the sports industry has started to catch on to this trend, especially within Major League Baseball (MLB). The purpose of this paper is to explore the methodology of applying this type of pricing to the hockey ticketing arena.
Christopher Jones, Deloitte Consulting
Sabah Sadiq, Deloitte Consulting
Jing Zhao, Deloitte Consulting LLP
One of the hallmarks of a good or great SAS® program is that it requires only a minimum of upkeep. Especially for code that produces reports on a regular basis, it is preferable to minimize user and programmer input and instead have the input data drive the outputs of a program. Data-driven SAS programs are more efficient and reliable, require less hardcoding, and result in less downtime and fewer user complaints. This paper reviews three ways of building a SAS program to create regular Microsoft Excel reports; one method using hardcoded variables, another using SAS keyword macros, and the last using metadata to drive the reports.
Andrew Clapson, MD Financial Management
Retrospective case-control studies are frequently used to evaluate health care programs when it is not feasible to randomly assign members to a respective cohort. Without randomization, observational studies are more susceptible to selection bias where the characteristics of the enrolled population differ from those of the entire population. When the participant sample is different from the comparison group, the measured outcomes are likely to be biased. Given this issue, this paper discusses how propensity score matching and random effects techniques can be used to reduce the impact selection bias has on observational study outcomes. All results shown are drawn from an ROI analysis using a participant (cases) versus non-participant (controls) observational study design for a fitness reimbursement program aiming to reduce health care expenditures of participating members.
Jess Navratil-Strawn, Optum
Risk managers and traders know that some knowledge loses its value quickly. Unfortunately, due to the computationally intensive nature of risk, most risk managers use stale data. Knowing your positions and risk intraday can provide immense value. Imagine knowing the portfolio risk impact of a trade before you execute. This paper shows you a path to doing real-time risk analysis leveraging capabilities from SAS® Event Stream Processing Engine and SAS® High-Performance Risk. Event stream processing (ESP) offers the ability to process large amounts of data with high throughput and low latency, including streaming real-time trade data from front-office systems into a centralized risk engine. SAS High-Performance Risk enables robust, complex portfolio valuations and risk calculations quickly and accurately. In this paper, we present techniques and demonstrate concepts that enable you to more efficiently use these capabilities together. We also show techniques for analyzing SAS High-Performance data with SAS® Visual Analytics.
Albert Hopping, SAS
Arvind Kulkarni, SAS
Ling Xiang, SAS
As a consequence of the financial crisis, banks are required to stress test their balance sheet and earnings based on prescribed macroeconomic scenarios. In the US, this exercise is known as the Comprehensive Capital Analysis and Review (CCAR) or Dodd-Frank Act Stress Testing (DFAST). In order to assess capital adequacy under these stress scenarios, banks need a unified view of their projected balance sheet, incomes, and losses. In addition, the bar for these regulatory stress tests is very high regarding governance and overall infrastructure. Regulators and auditors want to ensure that the granularity and quality of data, model methodology, and assumptions reflect the complexity of the banks. This calls for close internal collaboration and information sharing across business lines, risk management, and finance. Currently, this process is managed in an ad hoc, manual fashion. Results are aggregated from various lines of business using spreadsheets and Microsoft SharePoint. Although the spreadsheet option provides flexibility, it brings ambiguity into the process and makes the process error prone and inefficient. This paper introduces a new SAS® stress testing solution that can help banks define, orchestrate and streamline the stress-testing process for easier traceability, auditability, and reproducibility. The integrated platform provides greater control, efficiency, and transparency to the CCAR process. This will enable banks to focus on more value-added analysis such as scenario exploration, sensitivity analysis, capital planning and management, and model dependencies. Lastly, the solution was designed to leverage existing in-house platforms that banks might already have in place.
Wei Chen, SAS
Shannon Clark
Erik Leaver, SAS
John Pechacek
Retailers, amongst nearly every other consumer business are under more pressure and competition than ever before. Today 's consumer is more connected, informed and empowered and the pace of innovation is rapidly changing the way consumers shop. Retailers are expected to sift through and implement digital technology, make sense of their Big Data with analytics, change processes and cut costs all at the same time. Today 's session, SRetail 2015 the landscape, trends, and technology will cover major issues retailers are facing today as well as both business and technology trends that will shape their future.
Lori Schafer
Join us for lunch as we discuss the benefits of being part of the elite group that is SAS Certified Professionals. The SAS Global Certification program has awarded more than 79,000 credentials to SAS users across the globe. Come listen to Terry Barham, Global Certification Manager, give an overview of the SAS Certification program, explain the benefits of becoming SAS certified and discuss exam preparation tips. This session will also include a Q&A section where you can get answers to your SAS Certification questions.
The latest release of SAS/STAT® software brings you powerful techniques that will make a difference in your work, whether your data are massive, missing, or somewhere in the middle. New imputation software for survey data adds to an expansive array of methods in SAS/STAT for handling missing data, as does the production version of the GEE procedure, which provides the weighted generalized estimating equation approach for longitudinal studies with dropouts. An improved quadrature method in the GLIMMIX procedure gives you accelerated performance for certain classes of models. The HPSPLIT procedure provides a rich set of methods for statistical modeling with classification and regression trees, including cross validation and graphical displays. The HPGENSELECT procedure adds support for spline effects and lasso model selection for generalized linear models. And new software implements generalized additive models by using an approach that handles large data easily. Other updates include key functionality for Bayesian analysis and pharmaceutical applications.
Maura Stokes, SAS
Bob Rodriguez, SAS
In today's competitive job market, both recent graduates and experienced professionals are looking for ways to set themselves apart from the crowd. SAS® certification is one way to do that. SAS Institute Inc. offers a range of exams to validate your knowledge level. In writing this paper, we have drawn upon our personal experiences, remarks shared by new and longtime SAS users, and conversations with experts at SAS. We discuss what certification is and why you might want to pursue it. Then we share practical tips you can use to prepare for an exam and do your best on exam day.
Andra Northup, Advanced Analytic Designs, Inc.
Susan Slaughter, Avocet Solutions
The Query Builder in SAS® Enterprise Guide® is an excellent point-and-click tool that generates PROC SQL code and creates queries in SAS®. This hands-on workshop will give an overview of query options, sorting, simple and complex filtering, and joining tables. It is a great workshop for programmers and non-programmers alike.
Jennifer First-Kluge, Systems Seminar Consultants
A good system should embody the following characteristics: planned, maintainable, flexible, simple, accurate, restartable, reliable, reusable, automated, documented, efficient, modular, and validated. This is true of any system, but how to implement this in SAS® Enterprise Guide® is a unique endeavor. We provide a brief overview of these characteristics and then dive deeper into how a SAS Enterprise Guide user should approach developing both ad hoc and production systems.
Steven First, Systems Seminar Consultants
Jennifer First-Kluge, Systems Seminar Consultants
SAS® Model Manager provides an easy way to deploy analytical models into various relational databases or into Hadoop using either scoring functions or the SAS® Embedded Process publish methods. This paper gives a brief introduction of both the SAS Model Manager publishing functionality and the SAS® Scoring Accelerator. It describes the major differences between using scoring functions and the SAS Embedded Process publish methods to publish a model. The paper also explains how to perform in-database processing of a published model by using SAS applications as well as SQL code outside of SAS. In addition to Hadoop, SAS also supports these databases: Teradata, Oracle, Netezza, DB2, and SAP HANA. Examples are provided for publishing a model to a Teradata database and to Hadoop. After reading this paper, you should feel comfortable using a published model in your business environment.
Jifa Wei, SAS
Kristen Aponte, SAS
As a risk management unit, our team has invested countless hours in developing the processes and infrastructure necessary to produce large, multipurpose analytical base tables. These tables serve as the source for our key reporting and loss provisioning activities, the output of which (reports and GL bookings) is disseminated throughout the corporation. Invariably, questions arise and further insight is desired. Traditionally, any inquiries were returned to the original analyst for further investigation. But what if there was a way for the less technical user base to gain insights independently? Now there is with SAS® Visual Analytics. SAS Visual Analytics is often thought of as a big data tool, and while it is certainly capable in this space, its usefulness in regard to leveraging the value in your existing data sets should not be overlooked. By using these tried-and-true analytical base tables, you are guaranteed to achieve one version of the truth since traditional reports match perfectly to the data being explored. SAS Visual Analytics enables your organization to share these proven data assets with an entirely new population of data consumers--people with less 'traditional data skills but with questions that need to be answered. Finally, all this is achieved without any additional data preparation effort and testing. This paper explores our experience with SAS Visual Analytics and the benefits realized.
Shaun Kaufmann, Farm Credit Canada
This workshop provides hands-on experience using SAS® Enterprise Miner. Workshop participants will learn to: open a project, create and explore a data source, build and compare models, and produce and examine score code that can be used for deployment.
Chip Wells, SAS
This workshop provides hands-on experience using SAS® Forecast Server. Workshop participants will learn to: create a project with a hierarchy, generate multiple forecast automatically, evaluate the forecasts accuracy, and build a custom model.
Catherine Truxillo, SAS
George Fernandez, SAS
Terry Woodfield, SAS
This workshop provides hands-on experience with SAS® Data Loader for Hadoop. Workshop participants will configure SAS Data Loader for Hadoop and use various directives inside SAS Data Loader for Hadoop to interact with data in the Hadoop cluster.
Kari Richardson, SAS
This workshop provides hands-on experience with SAS® Visual Analytics. Workshop participants will explore data with SAS® Visual Analytics Explorer and design reports with SAS® Visual Analytics Designer.
Nicole Ball, SAS
This workshop provides hands-on experience with SAS® Visual Statistics. Workshop participants will learn to: move between the Visual Analytics Explorer interface and Visual Statistics, fit automatic statistical models, create exploratory statistical analysis, compare models using a variety of metrics, and create score code.
Catherine Truxillo, SAS
Xiangxiang Meng, SAS
Mike Jenista, SAS
This workshop provides hands-on experience performing statistical analysis with SAS University Edition and SAS Studio. Workshop participants will learn to: install and setup, perform basic statistical analyses using tasks, connect folders to SAS Studio for data access and results storage, invoke code snippets to import CSV data into SAS, and create a code snippet.
Danny Modlin, SAS
This workshop provides hands-on experience using SAS® Text Miner. Workshop participants will learn to: read a collection of text documents and convert them for use by SAS Text Miner using the Text Import node, use the simple query language supported by the Text Filter node to extract information from a collection of documents, use the Text Topic node to identify the dominant themes and concepts in a collection of documents, and use the Text Rule Builder node to classify documents having pre-assigned categories.
Terry Woodfield, SAS
Several U.S. Federal agencies conduct national surveys to monitor health status of residents. Many of these agencies release their survey data to the public. Investigators might be able to address their research objectives by conducting secondary statistical analyses with these available data sources. This paper describes the steps in using the SAS SURVEY procedures to analyze publicly released data from surveys that use probability sampling to make statistical inference to a carefully defined population of elements (the target population).
Donna Brogan, Emory University, Atlanta, GA
Product affinity segmentation is a powerful technique for marketers and sales professionals to gain a good understanding of customers' needs, preferences, and purchase behavior. Performing product affinity segmentation is quite challenging in practice because product-level data usually has high skewness, high kurtosis, and a large percentage of zero values. The Doughnut Clustering method has been shown to be effective using real data, and was presented at SAS® Global Forum 2013 in the paper titled Product Affinity Segmentation Using the Doughnut Clustering Approach.' However, the Doughnut Clustering method is not a panacea for addressing the product affinity segmentation problem. There is a clear need for a comprehensive evaluation of this method in order to be able to develop generic guidelines for practitioners about when to apply it. In this paper, we meet the need by evaluating the Doughnut Clustering method on simulated data with different levels of skewness, kurtosis, and percentage of zero values. We developed a five-step approach based on Fleishman's power method to generate synthetic data with prescribed parameters. Subsequently, we designed and conducted a set of experiments to run the Doughnut Clustering method as well as the traditional K-means method as a benchmark on simulated data. We draw conclusions on the performance of the Doughnut Clustering method by comparing the clustering validity metric the ratio of between-cluster variance to within-cluster variance as well as the relative proportion of cluster sizes against those of K-means.
Darius Baer, SAS
Goutam Chakraborty, Oklahoma State University
In today's omni-channel world, consumers expect retailers to deliver the product they want, where they want it, when they want it, at a price they accept. A major challenge many retailers face in delighting their customers is successfully predicting consumer demand. Business decisions across the enterprise are affected by these demand estimates. Forecasts used to inform high-level strategic planning, merchandising decisions (planning assortments, buying products, pricing, and allocating and replenishing inventory) and operational execution (labor planning) are similar in many respects. However, each business process requires careful consideration of specific input data, modeling strategies, output requirements, and success metrics. In this session, learn how leading retailers are increasing sales and profitability by operationalizing forecasts that improve decisions across their enterprise.
Alex Chien, SAS
Elizabeth Cubbage, SAS
Wanda Shive, SAS
Surveys are designed to elicit information about population characteristics. A survey design typically combines stratification and multistage sampling of intact clusters, sub-clusters, and individual units with specified probabilities of selection. A survey sample can produce valid and reliable estimates of population parameters at a fraction of the cost of carrying out a census of the entire population, with clear logistical efficiencies. For analyses of survey data, SAS®software provides a suite of procedures from SURVEYMEANS and SURVEYFREQ for generating descriptive statistics and conducting inference on means and proportions to regression-based analysis through SURVEYREG and SURVEYLOGISTIC. For longitudinal surveys and follow-up studies, SURVEYPHREG is designed to incorporate aspects of the survey design for analysis of time-to-event outcomes based on the Cox proportional hazards model, allowing for time-varying explanatory variables.We review the salient features of the SURVEYPHREG procedure with application to survey data from the National Health and Nutrition Examination Survey (NHANES III) Linked Mortality File.
JOSEPH GARDINER, MICHIGAN STATE UNIVERSITY
This presentation details the steps involved in using SAS® Enterprise Miner™ to text mine a sample of member complaints. Specifically, it describes how the Text Parsing, Text Filtering, and Text Topic nodes were used to generate topics that described the complaints. Text mining results are reviewed (slightly modified for confidentiality), as well as conclusions and lessons learned from the project.
Amanda Pasch, Kaiser Permanenta
PROC MIXED is one of the most popular SAS procedures to perform longitudinal analysis or multilevel models in epidemiology. Model selection is one of the fundamental questions in model building. One of the most popular and widely used strategies is model selection based on information criteria, such as Akaike Information Criterion (AIC) and Sawa Bayesian Information Criterion (BIC). This strategy considers both fit and complexity, and enables multiple models to be compared simultaneously. However, there is no existing SAS procedure to perform model selection automatically based on information criteria for PROC MIXED, given a set of covariates. This paper provides information about using the SAS %ic_mixed macro to select a final model with the smallest value of AIC and BIC. Specifically, the %ic_mixed macro will do the following: 1) produce a complete list of all possible model specifications given a set of covariates, 2) use do loop to read in one model specification every time and save it in a macro variable, 3) execute PROC MIXED and use the Output Delivery System (ODS) to output AICs and BICs, 4) append all outputs and use the DATA step to create a sorted list of information criteria with model specifications, and 5) run PROC REPORT to produce the final summary table. Based on the sorted list of information criteria, researchers can easily identify the best model. This paper includes the macro programming language, as well as examples of the macro calls and outputs.
Qinlei Huang, St Jude Children's Research Hospital
For the past two academic school years, our SAS® Programming 1 class had a classroom discussion about the Charlotte Bobcats. We wondered aloud If the Bobcats changed their team name would the dwindling fan base return? As a class, we created a survey that consisted of 10 questions asking people if they liked the name Bobcats, did they attend basketball games, and if they bought merchandise. Within a one-hour class period, our class surveyed 981 out of 1,733 students at Phillip O. Berry Academy of Technology. After collecting the data, we performed advanced analytics using Base SAS® and concluded that 75% of students and faculty at Phillip O. Berry would prefer any other name except the Bobcats. In other results, 80% percent of the student body liked basketball, and the most preferred name was the Hornets, followed by the Royals, Flight, Dragons, and finally the Bobcats, The following school year, we conducted another survey to discover if people's opinions had changed since the previous survey and if people were happy with the Bobcats changing their name. During this time period, the Bobcats had recently reported that they were granted the opportunity to change the team name to the Hornets. Once more, we collected and analyzed the data and concluded that 77% percent of people surveyed were thrilled with the name change. In addition, around 50% percent of surveyors were interested in purchasing merchandise. Through the work of this project, SAS® Analytics was applied in the classroom to a real world scenario. The ability to see how SAS® could be applied to a question of interest and create change inspired the students in our class. This project is significantly important to show the economic impact that sports can have on a city. This project in particular, focused on the nostalgia that people of the city of Charlotte felt for the name Hornets. The project opened the door for more analysis and questions and continues to spa
rk interest. This is the case because when people have a connection to the team and the more the team flourishes, the more Charlotte benefits.
Lauren Cook, Charlotte Mecklenburg School System
In this session, I discuss an overall approach to governing Big Data. I begin with an introduction to Big Data governance and the governance framework. Then I address the disciplines of Big Data governance: data ownership, metadata, privacy, data quality, and master and reference data management. Finally, I discuss the reference architecture of Big Data, and how SAS® tools can address Big Data governance.
Sunil Soares, Information Asset
This unique culture has access to lots of data, unstructured and structured; is innovative, experimental, groundbreaking, and doesn't follow convention; and has access to powerful new infrastructure technologies and scalable, industry-standard computing power like never seen before. The convergence of data, and innovative spirit, and the means to process it is what makes this a truly unique culture. In response to that, SAS® proposes The New Analytics Experience. Attend this session to hear more about the New Analytics Experience and the latest Intel technologies that make it possible.
Mark Pallone, Intel
The bookBot Identity: January 2013. With no memory of it from the past, students and faculty at NC State awake to find the Hunt Library just opened, and inside it, the mysterious and powerful bookBot. A true physical search engine, the bookBot, without thinking, relentlessly pursues, captures, and delivers to the patron any requested book (those things with paper pages--remember?) from the Hunt Library. The bookBot Supremacy: Some books were moved from the central campus library to the new Hunt Library. Did this decrease overall campus circulation or did the Hunt Library and its bookBot reign supreme in increasing circulation? The bookBot Ultimatum: To find out if the opening of the Hunt Library decreased or increased overall circulation. To address the bookBot Ultimatum, the Circulation Statistics Investigation (CSI) team uses the power of SAS® analytics to model library circulation before and after the opening of the Hunt Library. The bookBot Legacy: Join us for the adventure-filled story. Filled with excitement and mystery, this talk is bound to draw a much bigger crowd than had it been more honestly titled Intervention Analysis for Library Data. Tools used are PROC ARIMA, PROC REG, and PROC SGPLOT.
David Dickey, NC State University
John Vickery, North Carolina State University
This presentation provides an in-depth analysis, with example SAS® code, of the health care use and expenditures associated with depression among individuals with heart disease using the 2012 Medical Expenditure Panel Survey (MEPS) data. A cross-sectional study design was used to identify differences in health care use and expenditures between depressed (n = 601) and nondepressed (n = 1,720) individuals among patients with heart disease in the United States. Multivariate regression analyses using the SAS survey analysis procedures were conducted to estimate the incremental health services and direct medical costs (inpatient, outpatient, emergency room, prescription drugs, and other) attributable to depression. The prevalence of depression among individuals with heart disease in 2012 was estimated at 27.1% (6.48 million persons) and their total direct medical costs were estimated at approximately $110 billion in 2012 U.S. dollars. Younger adults (< 60 years), women, unmarried, poor, and sicker individuals with heart disease were more likely to have depression. Patients with heart disease and depression had more hospital discharges (relative ratio (RR) = 1.06, 95% confidence interval (CI) [1.02 to 1.09]), office-based visits (RR = 1.27, 95% CI [1.15 to 1.41]), emergency room visits (RR = 1.08, 95% CI [1.02 to 1.14]), and prescribed medicines (RR = 1.89, 95% CI [1.70, 2.11]) than their counterparts without depression. Finally, among individuals with heart disease, overall health care expenditures for individuals with depression was 69% higher than that for individuals without depression (RR = 1.69, 95% CI [1.44, 1.99]). The conclusion is that depression in individuals with heart disease is associated with increased health care use and expenditures, even after adjusting for differences in age, gender, race/ethnicity, marital status, poverty level, and medical comorbidity.
Seungyoung Hwang, Johns Hopkins University Bloomberg School of Public Health
Currently, there are several methods for reading JSON formatted files into SAS® that depend on the version of SAS and which products are licensed. These methods include user-defined macros, visual analytics, PROC GROOVY, and more. The user-defined macro %GrabTweet, in particular, provides a simple way to directly read JSON-formatted tweets into SAS® 9.3. The main limitation of %GrabTweet is that it requires the user to repeatedly run the macro in order to download large amounts of data over time. Manually downloading tweets while conforming to the Twitter rate limits might cause missing observations and is time-consuming overall. Imagine having to sit by your computer the entire day to continuously grab data every 15 minutes, just to download a complete data set of tweets for a popular event. Fortunately, the %GrabTweet macro can be modified to automate the retrieval of Twitter data based on the rate that the tweets are coming in. This paper describes the application of the %GrabTweet macro combined with batch processing to download tweets without manual intervention. Users can specify the phrase parameters they want, run the batch processing macro, leave their computer to automatically download tweets overnight, and return to a complete data set of recent Twitter activity. The batch processing implements an automated retrieval of tweets through an algorithm that assesses the rate of tweets for the specified topic in order to make downloading large amounts of data simpler and effortless for the user.
Isabel Litton, California Polytechnic State University, SLO
Rebecca Ottesen, City of Hope and Cal Poly SLO
If you are looking for ways to make your graphs more communication-effective, this tutorial can help. It covers both the new ODS Graphics SG (Statistical Graphics) procedures and the traditional SAS/GRAPH® software G procedures. The focus is on management reporting and presentation graphs, but the principles are relevant for statistical graphs as well. Important features unique to SAS® 9.4 are included, but most of the designs and construction methods apply to earlier versions as well. The principles of good graphic design are actually independent of your choice of software.
LeRoy Bessler, Bessler Consulting and Research
This paper explores feature extraction from unstructured text variables using Term Frequency-Inverse Document Frequency (TF-IDF) weighting algorithms coded in Base SAS®. Data sets with unstructured text variables can often hold a lot of potential to enable better predictive analysis and document clustering. Each of these unstructured text variables can be used as inputs to build an enriched data set-specific inverted index, and the most significant terms from this index can be used as single word queries to weight the importance of the term to each document from the corpus. This paper also explores the usage of hash objects to build the inverted indices from the unstructured text variables. We find that hash objects provide a considerable increase in algorithm efficiency, and our experiments show that a novel weighting algorithm proposed by Paik (2013) best enables meaningful feature extraction. Our TF-IDF implementations are tested against a publicly available data breach data set to understand patterns specific to insider threats to an organization.
Ila Gokarn, Singapore Management University
Clifton Phua, SAS
The NH Citizens Health Initiative and the University of New Hampshire Institute for Health Policy and Practice, in collaboration with Accountable Care Project (ACP) participants, have developed a set of analytic reports to provide systems undergoing transformation a capacity to compare performance on the measures of quality, utilization, and cost across systems and regions. The purpose of these reports is to provide data and analysis on which our ACP learning collaborative can share knowledge and develop action plans that can be adopted by health-care innovators in New Hampshire. This breakout session showcases the claims-based reports, powered by SAS® Visual Analytics and driven by the New Hampshire Comprehensive Health Care Information System (CHIS), which includes commercial, Medicaid, and Medicare populations. With the power of SAS Visual Analytics, hundreds of pages of PDF files were distilled down to a manageable, dynamic, web-based portal that allows users to target information most appealing to them. This streamlined approach reduces barriers to obtaining information, offers that information in a digestible medium, and creates a better user experience. For more information about the ACP or to access the public reports, visit http://nhaccountablecare.org/.
Danna Hourani, SAS
Interactive Voice Response (IVR) systems are likely one of the best and worst gifts to the world of communication, depending on who you ask. Businesses love IVR systems because they take out hundreds of millions of dollars of call center costs in automation of routine tasks, while consumers hate IVRs because they want to talk to an agent! It is a delicate balancing act to manage an IVR system that saves money for the business, yet is smart enough to minimize consumer abrasion by knowing who they are, why they are calling, and providing an easy automated solution or a quick route to an agent. There are many aspects to designing such IVR systems, including engineering, application development, omni-channel integration, user interface design, and data analytics. For larger call volume businesses, IVRs generate terabytes of data per year, with hundreds of millions of rows per day that track all system and customer- facing events. The data is stored in various formats and is often unstructured (lengthy character fields that store API return information or text fields containing consumer utterances). The focus of this talk is the development of a data mining framework based on SAS® that is used to parse and analyze IVR data in order to provide insights into usability of the application across various customer segments. Certain use cases are also provided.
Dmitriy Khots, West Corp
Becoming one of the best memorizers in the world doesn't happen overnight. With hard work, dedication, a bit of obsession, and with the assistance of some clever analytics metrics, Nelson Dellis was able to climb himself up to the top of the memory rankings in under a year to become the now 3x USA Memory Champion. In this talk, he explains what it takes to become the best at memory, what is involved in such grueling memory competitions, and how analytics helped him get there.
Nelson Dellis, Climb for Memory
Categorization hierarchies are ubiquitous in big data. Examples include the MEDLINE's Medical Subject Headings (MeSH) taxonomy, United Nations Standard Products and Services Code (UNSPSC) product codes, and the Medical Dictionary for Regulatory Activities (MedDRA) hierarchy for adverse reaction coding. A key issue is that in most taxonomies the probability of any particular example being in a category is very small at lower levels of the hierarchy. Blindly applying a standard categorization model is likely to perform poorly if this fact is not taken into consideration. This paper introduce a novel technique for text categorization, Boolean rule extraction, which enables you to effectively address this situation. In addition, models that are generated by a rule-based technique have good interpretability and can be easily modified by a human expert, enabling better human-machine interaction. The paper demonstrates how to use SAS® Text Miner macros and procedures to obtain effective predictive models at all hierarchy levels in a taxonomy.
Zheng Zhao, SAS
Russ Albright, SAS
James Cox, SAS
Ning Jin, SAS
SAS® PROC FASTCLUS generates five clusters for the group of repeat clients of Ontario's Remedial Measures program. Heat map tables are shown for selected variables such as demographics, scales, factor, and drug use to visualize the difference between clusters.
Rosely Flam-Zalcman, CAMH
Robert Mann, CAM
Rita Thomas, CAMH
Real-time web content personalization has come into its teen years, but recently a spate of marketing solutions have enabled marketers to finely personalize web content for visitors based on browsing behavior, geo-location, preferences, and so on. In an age where the attention span of a web visitor is measured in seconds, marketers hope that tailoring the digital experience will pique each visitor's interest just long enough to increase corporate sales. The range of solutions spans the entire spectrum of completely cloud-based installations to completely on-premises installations. Marketers struggle to find the most optimal solution that would meet their corporation's marketing objectives, provide them the highest agility and time-to-market, and still keep a low marketing budget. In the last decade or so, marketing strategies that involved personalizing using purely on-premises customer data quickly got replaced by ones that involved personalizing using only web-browsing behavior (a.k.a, clickstream data). This was possible because of a spate of cloud-based solutions that enabled marketers to de-couple themselves from the underlying IT infrastructure and the storage issues of capturing large volumes of data. However, this new trend meant that corporations weren't using much of their treasure trove of on-premises customer data. Of late, however, enterprises have been trying hard to find solutions that give them the best of both--the ease of gathering clickstream data using cloud-based applications and on-premises customer data--to perform analytics that lead to better web content personalization for a visitor. This paper explains a process that attempts to address this rapidly evolving need. The paper assumes that the enterprise already has tools for capturing clickstream data, developing analytical models, and for presenting the content. It provides a roadmap to implementing a phased approach where enterprises continue to capture clickstream data, but they bring that data in-house to be merg
ed with customer data to enable their analytics team to build sophisticated predictive models that can be deployed into the real-time web-personalization application. The final phase requires enterprises to continuously improve their predictive models on a periodic basis.
Mahesh Subramanian, SAS Institute Inc.
Suneel Grover, SAS
A Chinese wind energy company designs several hundred wind farms each year. An important step in its design process is micrositing, in which it creates a layout of turbines for a wind farm. The amount of energy that a wind farm generates is affected by geographical factors (such as elevation of the farm), wind speed, and wind direction. The types of turbines and their positions relative to each other also play a critical role in energy production. Currently the company is using an open-source software package to help with its micrositing. As the size of wind farms increases and the pace of their construction speeds up, the open-source software is no longer able to support the design requirements. The company wants to work with a commercial software vendor that can help resolve scalability and performance issues. This paper describes the use of the OPTMODEL and OPTLSO procedures on the SAS® High-Performance Analytics infrastructure together with the FCMP procedure to model and solve this highly nonlinear optimization problem. Experimental results show that the proposed solution can meet the company's requirements for scalability and performance. A Chinese wind energy company designs several hundred wind farms each year. An important step of their design process is micro-siting, which creates a layout of turbines for a wind farm. The amount of energy generated from a wind farm is affected by geographical factors (such as elevation of the farm), wind speed, and wind direction. The types of turbines and their positions relative to each other also play critical roles in the energy production. Currently the company is using an open-source software package to help them with their micro-siting. As the size of wind farms increases and the pace of their construction speeds up, the open-source software is no longer able to support their design requirements. The company wants to work with a commercial software vendor that can help them resolve scalability and performance issues. This pap
er describes the use of the FCMP, OPTMODEL, and OPTLSO procedures on the SAS® High-Performance Analytics infrastructure to model and solve this highly nonlinear optimization problem. Experimental results show that the proposed solution can meet the company's requirements for scalability and performance.
Sherry (Wei) Xu, SAS
Steven Gardner, SAS
Joshua Griffin, SAS
Baris Kacar, SAS
Jinxin Yi, SAS
Abalone is a common name given to sea snails or mollusks. These creatures are highly iridescent, with shells of strong changeable colors. This characteristic makes the shells attractive to humans as decorative objects and jewelry. The abalone structure is being researched to build body armor. The value of a shell varies by its age and the colors it displays. Determining the number of rings on an abalone is a tedious and cumbersome task and is usually done by cutting the shell through the cone, staining it, and counting the number of rings on it through a microscope. In this poster, I aim to predict the number of any rings on any abalone by using the physical characteristics. This data was obtained from UCI Machine Learning Repository, which consists of 4,177 observations with 8 attributes. I considered the number of rings to be my target variable. The abalone's age can be reasonably approximated as being 1.5 times the number of rings on its shell. Using SAS® Enterprise Miner™, I have built regression models and neural network models to determine the physical measurements responsible for determining the number of rings on the abalone. While I have obtained a coefficient of determination of 54.01%, my aim is to improve and expand the analysis using the power of SAS Enterprise Miner. The current initial results indicate that the height, the shucked weight, and the viscera weight of the shell are the three most influential variables in predicting the number of rings on an abalone.
Ganesh Kumar Gangarajula, Oklahoma State University
Yogananda Domlur Seetharam
Many epidemiological studies use medical claims to identify and describe a population. But finding out who was diagnosed, and who received treatment, isn't always simple. Each claim can have dozens of medical codes, with different types of codes for procedures, drugs, and diagnoses. Even a basic definition of treatment could require a search for any one of 100 different codes. A SAS® macro may come to mind, but generalizing the macro to work with different codes and types allows it to be reused in a variety of different scenarios. We look at a number of examples, starting with a single code type and variable. Then we consider multiple code variables, multiple code types, and multiple flag variables. We show how these macros can be combined and customized for different data with minimal rework. Macro flexibility and reusability are also discussed, along with ways to keep our list of medical codes separate from our program. Finally, we discuss time-dependent medical codes, codes requiring database lookup, and macro performance.
Andy Karnopp, Fred Hutchinson Cancer Research Center
In today's society, where seemingly unlimited information is just a mouse click away, many turn to social media, forums, and medical websites to research and understand how mothers feel about the birthing process. Mining the data in these resources helps provide an understanding of what mothers value and how they feel. This paper shows the use of SAS® Text Analytics to gather, explore, and analyze reports from mothers to determine their sentiment about labor and delivery topics. Results of this analysis could aid in the design and development of a labor and delivery survey and be used to understand what characteristics of the birthing process yield the highest levels of importance. These resources can then be used by labor and delivery professionals to engage with mothers regarding their labor and delivery preferences.
Michael Wallis, SAS
An essential part of health services research is describing the use and sequencing of a variety of health services. One of the most frequently examined health services is hospitalization. A common problem in describing hospitalizations is that a patient might have multiple hospitalizations to treat the same health problem. Specifically, a hospitalized patient might be (1) sent to and returned from another facility in a single day for testing, (2) transferred from one hospital to another, and/or (3) discharged home and re-admitted within 24 hours. In all cases, these hospitalizations are treating the same underlying health problem and should be considered as a single episode. If examined without regard for the episode, a patient would be identified as having 4 hospitalizations (the initial hospitalization, the testing hospitalization, the transfer hospitalization, and the readmission hospitalization). In reality, they had one hospitalization episode spanning multiple facilities. IMPORTANCE: Failing to account for multiple hospitalizations in the same episode has implications for many disciplines including health services research, health services planning, and quality improvement for patient safety. HEALTH SERVICES RESEARCH: Hospitalizations will be counted multiple times, leading to an overestimate of the number of hospitalizations a person had. For example, a person can be identified as having 4 hospitalizations when in reality they had one episode of hospitalization. This will result in a person appearing to be a higher user of health care than is true. RESOURCE PLANNING FOR HEALTH SERVICES. The average time and resources needed to treat a specific health problem may be underestimated. To illustrate, if a patient spends 10 days each in 3 different hospitals in the same episode, the total number of days needed to treat the health problem is 30 days, but each hospital will believe it is only 10, and planned resourcing may be inadequate. QUALITY IMPROVEMENT FOR PATIENT SAFETY. Hospital-acquir
ed infections are a serious concern and a major cause of extended hospital stays, morbidity, and death. As a result, many hospitals have quality improvement programs that monitor the occurrence of infections in order to identify ways to reduce them. If episodes of hospitalizations are not considered, an infection acquired in a hospital that does not manifest until a patient is transferred to a different hospital will incorrectly be attributed to the receiving hospital. PROPOSAL: We have developed SAS® code to identify episodes of hospitalizations, the sequence of hospitalizations within each episode, and the overall duration of the episode. The output clearly displays the data in an intuitive and easy-to-understand format. APPLICATION: The method we will describe and the associated SAS code will be useful to not only health services researchers, but also anyone who works with temporal data that includes nested, overlapping, and subsequent events.
Meriç Osman, Health Quality Council
Jacqueline Quail, Saskatchewan Health Quality Council
Nianping Hu, Saskatchewan Health Quality Council
Nedeene Hudema, Saskatchewan Health Quality Council
Developing a quality product or service, while at the same time improving cost management and maximizing profit, are challenging goals for any company. Finding the optimal balance between efficiency and profitability is not easy. The same can be said in regards to the development of a predictive statistical model. On the one hand, the model should predict as accurately as possible. On the other hand, having too many predictors can end up costing company money. One of the purposes of this project is to explore the cost of simplicity. When is it worth having a simpler model, and what are some of the costs of using a more complex one? The answer to that question leads us to another one: How can a predictive statistical model be maximized in order to increase a company's profitability? Using data from the consumer credit risk domain provided from CompuCredit (now Atlanticus), we used logistic regression to build binary classification models to predict the likelihood of default. This project compares two of these models. Although the original data set had several hundred predictor variables and more than a million observations, I chose to use rather simple models. My goal was to develop a model with as few predictors as possible, while not going lower than a concordant level of 80%. Two models were evaluated and compared based on efficiency, simplicity, and profitability. Using the selected model, cluster analysis was then performed in order to maximize the estimated profitability. Finally, the analysis was taken one step further through a supervised segmentation process, in order to target the most profitable segment of the best cluster.
Sherrie Rodriguez, Kennesaw State University
In this era of bigdata, the use of text analytics to discover insights is rapidly gainingpopularity in businesses. On average, more than 80 percent of the data inenterprises may be unstructured. Text analytics can help discover key insightsand extract useful topics and terms from the unstructured data. The objectiveof this paper is to build a model using textual data that predicts the factorsthat contribute to downtime of a truck. This research analyzes the data of over200,000 repair tickets of a leading truck manufacturing company. After theterms were grouped into fifteen key topics using text topic node of SAS® TextMiner, a regression model was built using these topics to predict truckdowntime, the target variable. Data was split into training and validation fordeveloping the predictive models. Knowledge of the factors contributing todowntime and their associations helped the organization to streamline theirrepair process and improve customer satisfaction.
Ayush Priyadarshi, Oklahoma State University
Goutam Chakraborty, Oklahoma State University
Many SAS® procedures can be used to analyze longitudinal data. This study employed a multisite randomized controlled trial design to demonstrate the effectiveness of two SAS procedures, GLIMMIX and GENMOD, to analyze longitudinal data from five Department of Veterans Affairs Medical Centers (VAMCs). Older male veterans (n = 1222) seen in VAMC primary care clinics were randomly assigned to two behavioral health models, integrated (n = 605) and enhanced referral (n = 617). Data was collected at baseline, and at 3-, 6-, and 12- month follow-up. A mixed-effects repeated measures model was used to examine the dependent variable, problem drinking, which was defined as count and dichotomous from baseline to 12 month follow-up. Sociodemographics and depressive symptoms were included as covariates. First, bivariate analyses included general linear model and chi-square tests to examine covariates by group and group by problem drinking outcomes. All significant covariates were included in the GLIMMIX and GENMOD models. Then, multivariate analysis included mixed models with Generalized Estimation Equations (GEEs). The effect of group, time, and the interaction effect of group by time were examined after controlling for covariates. Multivariate results were inconsistent for GLIMMIX and GENMOD using Lognormal, Gaussian, Weibull, and Gamma distributions. SAS is a powerful statistical program in data analyses for longitudinal study.
Abbas Tavakoli, University of South Carolina/College of Nursing
Marlene Al-Barwani, University of South Carolina
Sue Levkoff, University of South Carolina
Selina McKinney, University of South Carolina
Nikki Wooten, University of South Carolina
Mathematical optimization is a powerful paradigm for modeling and solving business problems that involve interrelated decisions about resource allocation, pricing, routing, scheduling, and similar issues. The OPTMODEL procedure in SAS/OR® software provides unified access to a wide range of optimization solvers and supports both standard and customized optimization algorithms. This paper illustrates PROC OPTMODEL's power and versatility in building and solving optimization models and describes the significant improvements that result from PROC OPTMODEL's many new features. Highlights include the recently added support for the network solver, the constraint programming solver, and the COFOR statement, which allows parallel execution of independent solver calls. Best practices for solving complex problems that require access to more than one solver are also demonstrated.
Rob Pratt, SAS
A bank that wants to use the Internal Ratings Based (IRB) methods to calculate minimum Basel capital requirements has to calculate default probabilities (PDs) for all its obligors. Supervisors are mainly concerned about the credit risk being underestimated. For high-quality exposures or groups with an insufficient number of obligors, calculations based on historical data may not be sufficiently reliable due to infrequent or no observed defaults. In an effort to solve the problem of default data scarcity, modeling assumptions are made, and to control the possibility of model risk, a high level of conservatism is applied. Banks, on the other hand, are more concerned about PDs that are too pessimistic, since this has an impact on their pricing and economic capital. In small samples or where we have little or no defaults, the data provides very little information about the parameters of interest. The incorporation of prior information or expert judgment and using Bayesian parameter estimation can potentially be a very useful approach in a situation like this. Using PROC MCMC, we show that a Bayesian approach can serve as a valuable tool for validation and monitoring of PD models for low default portfolios (LDPs). We cover cases ranging from single-period, zero correlation, and zero observed defaults to multi-period, non-zero correlation, and few observed defaults.
Machiel Kruger, North-West University
When you are analyzing your data and building your models, you often find out that the data cannot be used in the intended way. Systematic pattern, incomplete data, and inconsistencies from a business point of view are often the reason. You wish you could get a complete picture of the quality status of your data much earlier in the analytic lifecycle. SAS® analytics tools like SAS® Visual Analytics help you to profile and visualize the quality status of your data in an easy and powerful way. In this session, you learn advanced methods for analytic data quality profiling. You will see case studies based on real-life data, where we look at time series data from a bird's-eye-view and interactively profile GPS trackpoint data from a sail race.
Gerhard Svolba, SAS
We demonstrate a method of using SAS® 9.4 to supplement the interpretation of dimensions of a Multidimensional Scaling (MDS) model, a process that could be difficult without SAS®. In our paper, we examine why do people choose to drive to work (over other means of travel),' a question that transportation researchers need to answer in order to encourage drivers to switch to more environmentally-friendly travel modes. We applied the MDS approach on a travel survey data set because MDS has the advantage of extracting drivers' motivations in multiple dimensions.To overcome the challenges of dimension interpretation with MDS, we used the logistic regression function of SAS 9.4 to identify the variables that are strongly associated with each dimension, thus greatly aiding our interpretation procedure. Our findings are important to transportation researchers, practitioners, and MDS users.
Jun Neoh, University of Southampton
Jun Neoh, University of Southampton
Everyone likes getting a raise, and using ARRAYs in SAS® can help you do just that! Using ARRAYs simplifies processing, allowing for reading and analyzing of repetitive data with minimum coding. Improving the efficiency of your coding and in turn, your SAS productivity, is easier than you think! ARRAYs simplify coding by identifying a group of related variables that can then be referred to later in a DATA step. In this quick tip, you learn how to define an ARRAY using an array statement that establishes the ARRAY name, length, and elements. You also learn the criteria and limitations for the ARRAY name, the requirements for array elements to be processed as a group, and how to call an ARRAY and specific array elements within a DATA step. This quick tip reveals useful functions and operators, such as the DIM function and using the OF operator within existing SAS functions, that make using ARRAYs an efficient and productive way to process data. This paper takes you through an example of how to do the same task with and without using ARRAYs in order to illustrate the ease and benefits of using them. Your coding will be more efficient and your data analysis will be more productive, meaning you will deserve ARRAYs!
Kate Burnett-Isaacs, Statistics Canada
We all know there are multiple ways to use SAS® language components to generate the same values in data sets and output (for example, using the DATA step versus PROC SQL, If-Then-Elses versus Format table conversions, PROC MEANS versus PROC SQL summarizations, and so on). However, do you compare those different ways to determine which are the most efficient in terms of computer resources used? Do you ever consider the time a programmer takes to develop or maintain code? In addition to determining efficient syntax, do you validate your resulting data sets? Do you ever check data values that must remain the same after being processed by multiple steps and verify that they really don't change? We share some simple coding techniques that have proven to save computer and human resources. We also explain some data validation and comparison techniques that ensure data integrity. In our distributed computing environment, we show a quick way to transfer data from a SAS server to a local client by using PROC DOWNLOAD and then PROC EXPORT on the client to convert the SAS data set to a Microsoft Excel file.
Jason Beene, WellsFargo
Mary Katz, Wells Fargo Bank