Imbalanced data are frequently seen in fraud detection, direct marketing, disease prediction, and many other areas. Rare events are sometimes of primary interest. Classifying them correctly is the challenge that many predictive modelers face today. In this paper, we use SAS® Enterprise Miner™ on a marketing data set to demonstrate and compare several approaches that are commonly used to handle imbalanced data problems in classification models. The approaches are based on cost-sensitive measures and sampling measures. A rather novel technique called SMOTE (Synthetic Minority Over-sampling TEchnique), which has achieved the best result in our comparison, will be discussed.
Ruizhe Wang, GuideWell Connect
Novik Lee, Guidewell Connect
Yun Wei, Guidewell Connect
This session is intended to assist analysts in generating the best variables, such as monthly amount paid, daily number of received customer service calls, weekly worked hours on a project, or annual number total sales for a specific product, by using simple arithmetic operators (square root, log, loglog, exp, and rcp). During a statistical data modeling process, analysts are often confronted with the task of computing derived variables using the existing variables. The advantage of this methodology is that the new variables might be more significant than the original ones. This paper provides a new way to compute all the possible variables using a set of math transformations. The code includes many SAS® features that are very useful tools for SAS programmers to incorporate in their future code such as %SYSFUNC, SQL, %INCLUDE, CALL SYMPUT, %MACRO, SORT, CONTENTS, MERGE, MACRO _NULL_, as well as %DO &%TO & and many more
Nancy Hu, Discover
Companies that offer subscription-based services (such as telecom and electric utilities) must evaluate the tradeoff between month-to-month (MTM) customers, who yield a high margin at the expense of lower lifetime, and customers who commit to a longer-term contract in return for a lower price. The objective, of course, is to maximize the Customer Lifetime Value (CLV). This tradeoff must be evaluated not only at the time of customer acquisition, but throughout the customer's tenure, particularly for fixed-term contract customers whose contract is due for renewal. In this paper, we present a mathematical model that optimizes the CLV against this tradeoff between margin and lifetime. The model is presented in the context of a cohort of existing customers, some of whom are MTM customers and others who are approaching contract expiration. The model optimizes the number of MTM customers to be swapped to fixed-term contracts, as well as the number of contract renewals that should be pursued, at various term lengths and price points, over a period of time. We estimate customer life using discrete-time survival models with time varying covariates related to contract expiration and product changes. Thereafter, an optimization model is used to find the optimal trade-off between margin and customer lifetime. Although we specifically present the contract expiration case, this model can easily be adapted for customer acquisition scenarios as well.
Atul Thatte, TXU Energy
Goutam Chakraborty, Oklahoma State University
Non-Gaussian outcomes are often modeled using members of the so-called exponential family. Notorious members are the Bernoulli model for binary data, leading to logistic regression, and the Poisson model for count data, leading to Poisson regression. Two of the main reasons for extending this family are (1) the occurrence of overdispersion, meaning that the variability in the data is not adequately described by the models, which often exhibit a prescribed mean-variance link, and (2) the accommodation of hierarchical structure in the data, stemming from clustering in the data which, in turn, might result from repeatedly measuring the outcome, for various members of the same family, and so on. The first issue is dealt with through a variety of overdispersion models such as the beta-binomial model for grouped binary data and the negative-binomial model for counts. Clustering is often accommodated through the inclusion of random subject-specific effects. Though not always, one conventionally assumes such random effects to be normally distributed. While both of these phenomena might occur simultaneously, models combining them are uncommon. This paper proposes a broad class of generalized linear models accommodating overdispersion and clustering through two separate sets of random effects. We place particular emphasis on so-called conjugate random effects at the level of the mean for the first aspect and normal random effects embedded within the linear predictor for the second aspect, even though our family is more general. The binary, count, and time-to-event cases are given particular emphasis. Apart from model formulation, we present an overview of estimation methods, and then settle for maximum likelihood estimation with analytic-numerical integration. Implications for the derivation of marginal correlations functions are discussed. The methodology is applied to data from a study of epileptic seizures, a clinical trial for a toenail infection named onychomycosis, and survival data in children
with asthma.
Geert Molenberghs, Universiteit Hasselt & KU Leuven
In big data, many variables are polytomous with many levels. The common method to deal with polytomous independent variables is to use a series of design variables, which correspond to the option class or by in the polytomous independent variable in PROC LOGISTIC, if the outcome is binary. If big data has many polytomous independent variables with many levels, using design variables makes the analysis processing very complicated in both computation time and result, which might provide little help on the prediction of outcome. This paper presents a new simple method for logistic regression with polytomous independent variables in big data analysis when analysis of big data is required. In the proposed method, the first step is to conduct an iteration statistical analysis from a SAS® macro program. Similar to an algorithm in the creation of spline variables, this analysis searches for the proper aggregation groups with a statistical significant difference from all levels in a polytomous independent variable. In the SAS macro program for an iteration, processing of searching new level groups with statistical significant differences has been developed. The first is from level 1 with the smallest value of the outcome means. Then we can conduct a statistical test for the level 1 group with the level 2 group with the second smallest value of outcome mean. If these two groups have a statistical significant difference, we can start to test the level 2 group with the level 3 group. If level 1 and level 2 do not have a statistical significant difference, we can combine them into a new level group 1. Then we are going to test the new level group 1 with level 3. The processing continues until all the levels have been tested. Then we can replace the original level values of the polytomous variable by the new level values with the statistical significant difference. In this situation, the polytomous variable with new levels can be described by these means of all new levels because of the 1
to 1 equivalence relationship of a piecewise function in logit from the polytomous's levels to outcome means. It is very easy to approve that the conditional mean of an outcome y given a polytomous variable x is a very good approximation based on the maximum likelihood analysis. Compared with design variables, the new piecewise variable based on the information of all levels as a single independent variable can capture the impact of all levels in a much simpler way. We have used this method in the predictive models of customer attrition on the polytomous variables: state, business type, customer claim type, and so on. All of these polytomous variables show significant improvement on the prediction of customer attrition than without using them or using design variables in the model development.
jian gao, constant contact
jesse harriot, constant contact
lisa Pimentel, constant contact
The Centers for Medicare & Medicaid Services (CMS) uses the Proportion of Days Covered (PDC) to measure medication adherence. There is also some PDC-related research based on Medicare Part D Event (PDE) Data. However, Under Medicare rules, beneficiaries who receive care at an Inpatient (IP) [facility] may receive Medicare covered medications directly from the IP, rather than by filling prescriptions through their Part D contracts; thus, their medication fills during an IP stay would not be included in the PDE claims used to calculate the Patient Safety adherence measures. (Medicare 2014 Part C&D star rating technical notes). Therefore, the previous PDC calculation method underestimated the true PDC value. Starting with 2013 Star rating, PDC calculation was adjusted with IP stays. This is, when a patient has an inpatient admission during the measurement period, the inpatient stays are censored for the PDC calculation. If the patient also has measured drug coverage during the inpatient stay, the drug supplied during inpatient stay will be shifted after the inpatient stay. This shifting also causes a chain of shifting. This paper presents a SAS R Macro using the SAS Hash Object to match inpatient stays, censoring the inpatient stays, shifting the drug starting and ending dates, and calculating the adjusted PDC.
anping Chang, IHRC Inc.
Medical tests are used for various purposes including diagnosis, prognosis, risk assessment and screening. Statistical methodology is used often to evaluate such types of tests, most frequent measures used for binary data being sensitivity, specificity, positive and negative predictive values. An important goal in diagnostic medicine research is to estimate and compare the accuracies of such tests. In this paper I give a gentle introduction to measures of diagnostic test accuracy and introduce a SAS® macro to calculate generalized score statistic and weighted generalized score statistic for comparison of predictive values using formula's generalized and proposed by Andrzej S. Kosinski.
Lovedeep Gondara, University of Illinois Springfield
The paper introduces users to how they can use a set of SAS® macros, %LIFETEST and %LIFETESTEXPORT, to generate survival analysis reports for data with or without competing risks. The macros provide a wrapper of PROC LIFETEST and an enhanced version of the SAS autocall macro %CIF to give users an easy-to-use interface to report both survival estimates and cumulative incidence estimates in a unified way. The macros also provide a number of parameters to enable users to flexibly adjust how the final reports should look without the need to manually input or format the final reports.
Zhen-Huan Hu, Medical College of Wisconsin
Based on work by Thall et al. (2012), we implement a method for randomizing patients in a Phase II trial. We accumulate evidence that identifies which dose(s) of a cancer treatment provide the most desirable profile, per a matrix of efficacy and toxicity combinations rated by expert oncologists (0-100). Experts also define the region of Good utility scores and criteria of dose inclusion based on toxicity and efficacy performance. Each patient is rated for efficacy and toxicity at a specified time point. Simulation work is done mainly using PROC MCMC in which priors and likelihood function for joint outcomes of efficacy and toxicity are defined to generate posteriors. Resulting joint probabilities for doses that meet the inclusion criteria are used to calculate the mean utility and probability of having Good utility scores. Adaptive randomization probabilities are proportional to the probabilities of having Good utility scores. A final decision of the optimal dose will be made at the end of the Phase II trial.
Qianyi Huang, McDougall Scientific Ltd.
John Amrhein, McDougall Scientific Ltd.
Fitting mixed models to complicated data, such as data that include multiple sources of variation, can be a daunting task. SAS/STAT® software offers several procedures and approaches for fitting mixed models. This paper provides guidance on how to overcome obstacles that commonly occur when you fit mixed models using the MIXED and GLIMMIX procedures. Examples are used to showcase procedure options and programming techniques that can help you overcome difficult data and modeling situations.
Kathleen Kiernan, SAS
Are we alone in this universe? This is a question that undoubtedly passes through every mind several times during a lifetime. We often hear a lot of stories about close encounters, Unidentified Flying Object (UFO) sightings and other mysterious things, but we lack the documented evidence for analysis on this topic. UFOs have been a matter of interest in the public for a long time. The objective of this paper is to analyze one database that has a collection of documented reports of UFO sightings to uncover any fascinating story related to the data. Using SAS® Enterprise Miner™ 13.1, the powerful capabilities of text analytics and topic mining are leveraged to summarize the associations between reported sightings. We used PROC GEOCODE to convert addresses of sightings to the locations on the map. Then we used PROC GMAP procedure to produce a heat map to represent the frequency of the sightings in various locations. The GEOCODE procedure converts address data to geographic coordinates (latitude and longitude values). These geographic coordinates can then be used on a map to calculate distances or to perform spatial analysis. On preliminary analysis of the data associated with sightings, it was found that the most popular words associated with UFOs tell us about their shapes, formations, movements, and colors. The Text Profiler node in SAS Enterprise Miner 13.1 was leveraged to build a model and cluster the data into different levels of segment variable. We also explain how the opinions about the UFO sightings change over time using Text Profiling. Further, this analysis uses the Text Profile node to find interesting terms or topics that were used to describe the UFO sightings. Based on the feedback received at SAS® analytics conference, we plan to incorporate a technique to filter duplicate comments and include weather in that location.
Pradeep Reddy Kalakota, Federal Home Loan Bank of Desmoines
Naresh Abburi, Oklahoma State University
Goutam Chakraborty, Oklahoma State University
Zabiulla Mohammed, Oklahoma State University
Since Maine established the first All Payer Claims Database (APCD) in 2003, 10 additional states have established APCDs and 30 others are in development or show strong interest in establishing APCDs. APCDs are generally mandated by legislation, though voluntary efforts exist. They are administered through various agencies, including state health departments or other governmental agencies and private not-for-profit organizations. APCDs receive funding from various sources, including legislative appropriations and private foundations. To ensure sustainability, APCDs must also consider the sale of data access and reports as a source of revenue. With the advent of the Affordable Care Act, there has been an increased interest in APCDs as a data source to aid in health care reform. The call for greater transparency in health care pricing and quality, development of Patient-Centered Medical Homes (PCMHs) and Accountable Care Organizations (ACOs), expansion of state Medicaid programs, and establishment of health insurance and health information exchanges have increased the demand for the type of administrative claims data contained in an APCD. Data collection, management, analysis, and reporting issues are examined with examples from implementations of live APCDs. Developing data intake, processing, warehousing, and reporting standards are discussed in light of achieving the triple aim of improving the individual experience of care; improving the health of populations; and reducing the per capita costs of care. APCDs are compared and contrasted with other sources of state-level health care data, including hospital discharge databases, state departments of insurance records, and institutional and consumer surveys. The benefits and limitations of administrative claims data are reviewed. Specific issues addressed with examples include implementing transparent reporting of service prices and provider quality, maintaining master patient and provider identifiers, validating APCD data and comparison with o
ther state health care data available to researchers and consumers, defining data suppression rules to ensure patient confidentiality and HIPAA-compliant data release and reporting, and serving multiple end users, including policy makers, researchers, and consumers with appropriately consumable information.
Paul LaBrec, 3M Health Information Systems
Ordinary least squares regression is one of the most widely used statistical methods. However, it is a parametric model and relies on assumptions that are often not met. Alternative methods of regression for continuous dependent variables relax these assumptions in various ways. This paper explores procedures such as QUANTREG, ADAPTIVEREG, and TRANSREG for these kinds of data.
Peter Flom, Peter Flom Consulting
In Scotia - Colpatria Bank, the retail segment is very important. The quantity of lending applications makes it necessary to use statistical models and analytic tools in order to do an initial selection of good customers, who our credit analyst will study in depth to finally approve or deny a credit application. The construction of target vintages using the Cox model will generate past-due alerts in a shorter time, so the mitigation measures can be applied one or two months earlier than currently. This can reduce the losses by 100 bps in the new vintages. This paper makes the estimation of a proportional hazard model of Cox and compares the results with a logit model for a specific product of the bank. Additionally, we will estimate the objective vintage for the product.
Ivan Atehortua Rojas, Scotia - Colpatria Bank
In our management and collection area, there was no methodology that provided the optimal number of collection calls to get the customer to make the minimum payment of his or her financial obligation. We wanted to determine the optimal number of calls using the data envelopment analysis (DEA) optimization methodology. Using this methodology, we obtained results that positively impacted the way our customers were contacted. We can maintain a healthy bank and customer relationship, keep management and collection at an operational level, and obtain a more effective and efficient portfolio recovery. The DEA optimization methodology has been successfully used in various fields of manufacturing production. It has solved multi-criteria optimization problems, but it has not been commonly used in the financial sector, especially in the collection area. This methodology requires specialized software, such as SAS® Enterprise Guide® and its robust optimization. In this presentation, we present the PROC OPTMODEL and show how to formulate the optimization problem, create the programming, and process the data available.
Jenny Lancheros, Banco Colpatria Of ScotiaBank Group
Ana Nieto, Banco Colpatria of Scotiabank Group
Missing data are a common and significant problem that researchers and data analysts encounter in applied research. Because most statistical procedures require complete data, missing data can substantially affect the analysis and the interpretation of results if left untreated. Methods to treat missing data have been developed so that missing values are imputed and analyses can be conducted using standard statistical procedures. Among these missing data methods, multiple imputation has received considerable attention and its effectiveness has been explored (for example, in the context of survey and longitudinal research). This paper compares four multiple imputation approaches for treating missing continuous covariate data under MCAR, MAR, and NMAR assumptions, in the context of propensity score analysis and observational studies. The comparison of the four MI approaches in terms of bias in parameter estimates, Type I error rates, and statistical power is presented. In addition, complete case analysis (listwise deletion) is presented as the default analysis that would be conducted if missing data are not treated. Issues are discussed, and conclusions and recommendations are provided.
Patricia Rodriguez de Gil, University of South Florida
Shetay Ashford, University of South Florida
Chunhua Cao, University of South Florida
Eun-Sook Kim, University of South Florida
Rheta Lanehart, University of South Florida
Reginald Lee, University of South Florida
Jessica Montgomery, University of South Florida
Yan Wang, University of South Florida
This session will describe an innovative way to identify groupings of customer offerings using SAS® software. The authors investigated the customer enrollments in nine different programs offered by a large energy utility. These programs included levelized billing plans, electronic payment options, renewable energy, energy efficiency programs, a home protection plan, and a home energy report for managing usage. Of the 640,788 residential customers, 374,441 had been solicited for a program and had adequate data for analysis. Nearly half of these eligible customers (49.8%) enrolled in some type of program. To examine the commonality among programs based on characteristics of customers who enroll, cluster analysis procedures and correlation matrices are often used. However, the value of these procedures was greatly limited by the binary nature of enrollments (enroll or no enroll), as well as the fact that some programs are mutually exclusive (limiting cross-enrollments for correlation measures). To overcome these limitations, PROC LOGISTIC was used to generate predicted scores for each customer for a given program. Then, using the same predictor variables, PROC LOGISTIC was used on each program to generate predictive scores for all customers. This provided a broad range of scores for each program, under the assumption that customers who are likely to join similar programs would have similar predicted scores for these programs. PROC FASTCLUS was used to build k-means cluster models based on these predicted logistic scores. Two distinct clusters were identified from the nine programs. These clusters not only aligned with the hypothesized model, but were generally supported by correlations (using PROC CORR) among program predicted scores as well as program enrollments.
Brian Borchers, PhD, Direct Options
Ashlie Ossege, Direct Options
Testing for unit roots and determining whether a data set is nonstationary is important for the economist who does empirical work. SAS® enables the user to detect unit roots using an array of tests: the Dickey-Fuller, Augmented Dickey-Fuller, Phillips-Perron, and the Kwiatkowski-Phillips-Schmidt-Shin test. This paper presents a brief overview of unit roots and shows how to test for a unit root using the example of U.S. national health expenditure data.
Don McCarthy, Kaiser Permanente
A complex survey data set is one characterized by any combination of the following four features: stratification, clustering, unequal weights, or finite population correction factors. In this paper, we provide context for why these features might appear in data sets produced from surveys, highlight some of the formulaic modifications they introduce, and outline the syntax needed to properly account for them. Specifically, we explain why you should use the SURVEY family of SAS/STAT® procedures, such as PROC SURVEYMEANS or PROC SURVEYREG, to analyze data of this type. Although many of the syntax examples are drawn from a fictitious expenditure survey, we also discuss the origins of complex survey features in three real-world survey efforts sponsored by statistical agencies of the United States government--namely, the National Ambulatory Medical Care Survey, the National Survey of Family of Growth, and the Consumer Building Energy Consumption Survey.
Taylor Lewis, University of Maryland
The importance of econometrics in the analytics toolkit is increasing every day. Econometric modeling helps uncover structural relationships in observational data. This paper highlights the many recent changes to the SAS/ETS® portfolio that increase your power to explain the past and predict the future. Examples show how you can use Bayesian regression tools for price elasticity modeling, use state space models to gain insight from inconsistent time series, use panel data methods to help control for unobserved confounding effects, and much more.
Mark Little, SAS
Kenneth Sanford, SAS
This paper provides an overview of analysis of data derived from complex sample designs. General discussion of how and why analysis of complex sample data differs from standard analysis is included. In addition, a variety of applications are presented using PROC SURVEYMEANS, PROC SURVEYFREQ, PROC SURVEYREG, PROC SURVEYLOGISTIC, and PROC SURVEYPHREG, with an emphasis on correct usage and interpretation of results.
Patricia Berglund, University of Michigan
This presentation provides an overview of the advancement of analytics in National Hockey League (NHL) hockey, including how it applies to areas such as player performance, coaching, and injuries. The speaker discusses his analysis on predicting concussions that was featured in the New York Times, as well as other examples of statistical analysis in hockey.
Peter Tanner, Capital One
At the Multibanca Colpatria of Scotiabank, we offer a broad range of financial services and products in Colombia. In collection management, we currently manage more than 400,000 customers each month. In the call center, agents collect answers from each contact with the customer, and this information is saved in databases. However, this information has not been explored to know more about our customers and our own operation. The objective of this paper is to develop a classification model using the words in the answers from each customer from the call about receiving payment. Using a combination of text mining and cluster methodologies, we identify the possible conversations that can occur in each stage of delinquency. This knowledge makes developing specialized scripts for collection management possible.
Oscar Ayala, Colpatria
Jenny Lancheros, Banco Colpatria Of ScotiaBank Group
Data mining and predictive models are extensively used to find the optimal customer targets in order to maximize the return on investment. Direct marketing techniques target all the customers who are likely to buy regardless of the customer classification. In a real sense, this mechanism couldn't classify the customers who are going to buy even without a marketing contact, thereby resulting in a loss on investment. This paper focuses on the Incremental Lift modeling approach using Weight of Evidence Coding and Information Value followed by Incremental Response and Outcome model Diagnostics. This model identifies the additional purchases that would not have taken place without a marketing campaign. Modeling work was conducted using a combined model. The research work is carried out on Travel Center data. This data identifies the increase in average response rate by 2.8% and the number of fuel gallons by 244 when compared with the results from the traditional campaign, which targeted everyone. This paper discusses in detail the implementation of the 'Incremental Response' node to direct the marketing campaigns and its Incremental Revenue and Profit analysis.
Sravan Vadigepalli, Best Buy
Approximately 80% of world trade at present uses the seaways, with around 110,000 merchant vessels and 1.25 million marine farers transported and almost 6 billion tons of goods transferred every year. Marine piracy stands as a serious challenge to sea trade. Understanding how the pirate attacks occur is crucial in effectively countering marine piracy. Predictive modeling using the combination of textual data with numeric data provides an effective methodology to derive insights from both structured and unstructured data. 2,266 text descriptions about pirate incidents that occurred over the past seven years, from 2008 to the second quarter of 2014, were collected from the International Maritime Bureau (IMB) website. Analysis of the textual data using SAS® Enterprise Miner™ 12.3, with the help of concept links, answered questions on certain aspects of pirate activities, such as the following: 1. What are the arms used by pirates for attacks? 2. How do pirates steal the ships? 3. How do pirates escape after the attacks? 4. What are the reasons for occasional unsuccessful attacks? Topics are extracted from the text descriptions using a text topic node, and the varying trends of these topics are analyzed with respect to time. Using the cluster node, attack descriptions are classified into different categories based on attack style and pirate behavior described by a set of terms. A target variable called Attack Type is derived from the clusters and is combined with other structured input variables such as Ship Type, Status, Region, Part of Day, and Part of Year. A Predictive model is built with Attact Type as the target variable and other structured data variables as input predictors. The Predictive model is used to predict the possible type of attack given the details of the ship and its travel. Thus, the results of this paper could be very helpful for the shipping industry to become more aware of possible attack types for different vessel types when traversing different routes
, and to devise counter-strategies in reducing the effects of piracy on crews, vessels, and cargo.
Raghavender Reddy Byreddy, Oklahoma State University
Nitish Byri, Oklahoma State University
Goutam Chakraborty, Oklahoma State University
Tejeshwar Gurram, Oklahoma State University
Anvesh Reddy Minukuri, Oklahoma State University
Data comes from a rich variety of sources in a rich variety of types, shapes, sizes, and properties. The analysis can be challenged by data that is too tall or too wide; too full of miscodings, outliers, or holes; or that contains funny data types. Wide data, in particular, has many challenges, requiring the analysis to adapt with different methods. Making covariance matrices with 2.5 billion elements is just not practical. JMP® 12 will address these challenges.
John Sall, SAS
This analysis is based on data for all transactions at four parking meters within a small area in central Copenhagen for a period of four years. The observations show the exact minute parking was bought and the amount of time for which parking was bought in each transaction. These series of at most 80,000 transactions are aggregated to the hour, day, week, and month using PROC TIMESERIES. The aggregated series of parking times and the number of transactions are analyzed for seasonality and interdependence by PROC X12, PROC UCM, and PROC VARMAX.
Anders Milhoj, Copenhagen University
In many spatial analysis applications (including crime analysis, epidemiology, ecology, and forestry), spatial point process modeling can help you study the interaction between different events and help you model the process intensity (the rate of event occurrence per unit area). For example, crime analysts might want to estimate where crimes are likely to occur in a city and whether they are associated with locations of public features such as bars and bus stops. Forestry researchers might want to estimate where trees grow best and test for association with covariates such as elevation and gradient. This paper describes the SPP procedure, new in SAS/STAT® 13.2, for exploring and modeling spatial point pattern data. It describes methods that PROC SPP implements for exploratory analysis of spatial point patterns and for log-linear intensity modeling that uses covariates. It also shows you how to use specialized functions for studying interactions between points and how to use specialized analytical graphics to diagnose log-linear models of spatial intensity. Crime analysis, forestry, and ecology examples demonstrate key features of PROC SPP.
Pradeep Mohan, SAS
Randy Tobias, SAS
The Ebola virus outbreak is producing some of the most significant and fastest trending news throughout the globe today. There is a lot of buzz surrounding the deadly disease and the drastic consequences that it potentially poses to mankind. Social media provides the basic platforms for millions of people to discuss the issue and allows them to openly voice their opinions. There has been a significant increase in the magnitude of responses all over the world since the death of an Ebola patient in a Dallas, Texas hospital. In this paper, we aim to analyze the overall sentiment that is prevailing in the world of social media. For this, we extracted the live streaming data from Twitter at two different times using the Python scripting language. One instance relates to the period before the death of the patient, and the other relates to the period after the death. We used SAS® Text Miner nodes to parse, filter, and analyze the data and to get a feel for the patterns that exist in the tweets. We then used SAS® Sentiment Analysis Studio to further analyze and predict the sentiment of the Ebola outbreak in the United States. In our results, we found that the issue was not taken very seriously until the death of the Ebola patient in Dallas. After the death, we found that prominent personalities across the globe were talking about the disease and then raised funds to fight it. We are continuing to collect tweets. We analyze the locations of the tweets to produce a heat map that corresponds to the intensity of the varying sentiment across locations.
Dheeraj Jami, Oklahoma State University
Goutam Chakraborty, Oklahoma State University
Shivkanth Lanka, Oklahoma State University
Twitter is a powerful form of social media for sharing information about various issues and can be used to raise awareness and collect pointers about associated risk factors and preventive measures. Type-2 diabetes is a national problem in the US. We analyzed twitter feeds about Type-2 diabetes in order to suggest a rational use of social media with respect to an assertive study of any ailment. To accomplish this task, 900 tweets were collected using Twitter API v1.1 in a Python script. Tweets, follower counts, and user information were extracted via the scripts. The tweets were segregated into different groups on the basis of their annotations related to risk factors, complications, preventions and precautions, and so on. We then used SAS® Text Miner to analyze the data. We found that 70% of the tweets stated the status quo, based on marketing and awareness campaigns. The remaining 30% of tweets contained various key terms and labels associated with type-2 diabetes. It was observed that influential users tweeted more about precautionary measures whereas non-influential people gave suggestions about treatments as well as preventions and precautions.
Shubhi Choudhary, Oklahoma State University
Goutam Chakraborty, Oklahoma State University
Vijay Singh, Oklahoma State University
Improving tourists' satisfaction and intention to revisit the festival is an ongoing area of interest to the tourism industry. Many organizers at the festival site strive very hard to attract and retain attendees by investing heavily in their marketing and promotion strategies for the festival. To meet this challenge, the advanced analytical model though data mining approach is proposed to answer the following research question: What are the most important factors that influence tourists' intentions to revisit the festival site? Cluster analysis, neural network, decision tree, stepwise regression, polynomial regression, and support vector machine are applied in this study. The main goal is to determine what it takes not only to retain the loyalty attendees, but also attract and encourage new attendees to be back at the site.
Jongsawas Chongwatpol, NIDA Business School, National Institute of Development Administration
Thanathorn Vajirakachorn, Dept. of Tourism Management, School of Business, University of the Thai Chamber of Commerce
Managing the large-scale displacement of people and communities caused by a natural disaster has historically been reactive rather than proactive. Following a disaster, data is collected to inform and prompt operational responses. In many countries prone to frequent natural disasters such as the Philippines, large amounts of longitudinal data are collected and available to apply to new disaster scenarios. However, because of the nature of natural disasters, it is difficult to analyze all of the data until long after the emergency has passed. For this reason, little research and analysis have been conducted to derive deeper analytical insight for proactive responses. This paper demonstrates the application of SAS® analytics to this data and establishes predictive alternatives that can improve conventional storm responses. Humanitarian organizations can use this data to understand displacement patterns and trends and to optimize evacuation routing and planning. Identifying the main contributing factors and leading indicators for the displacement of communities in a timely and efficient manner prevents detrimental incidents at disaster evacuation sites. Using quantitative and qualitative methods, responding organizations can make data-driven decisions that innovate and improve approaches to managing disaster response on a global basis. The benefits of creating a data-driven analytical model can help reduce response time, improve the health and safety of displaced individuals, and optimize scarce resources in a more effective manner. The International Organization for Migration (IOM), an intergovernmental organization, is one of the first-response organizations on the ground that responds to most emergencies. IOM is the global co-load for the Camp Coordination and Camp Management (CCCM) cluster in natural disasters. This paper shows how to use SAS® Visual Analytics and SAS® Visual Statistics for the Philippines in response to Super Typhoon Haiyan in Nove
mber 2013 to develop increasingly accurate models for better emergency-preparedness. Using data collected from IOM's Displacement Tracking Matrix (DTM), the final analysis shows how to better coordinate service delivery to evacuation centers sheltering large numbers of displaced individuals, applying accurate hindsight to develop foresight on how to better respond to emergencies and disasters. Predictive models build on patterns found in historical and transactional data to identify risks and opportunities. The capacity to predict trends and behavior patterns related to displacement and mobility has the potential to enable the IOM to respond in a more timely and targeted manner. By predicting the locations of displacement, numbers of persons displaced, number of vulnerable groups, and sites at most risk of security incidents, humanitarians can respond quickly and more effectively with the appropriate resources (material and human) from the outset. The end analysis uses the SAS® Storm Optimization model combined with human mobility algorithms to predict population movement.
Lorelle Yuen, International Organization for Migration
Kathy Ball, Devon Energy
With the increase in government and commissions incentivizing electric utilities to get consumers to save energy, there has been a large increase in the number of energy saving programs. Some are structural, incentivizing consumers to make improvements to their home that result in energy savings. Some, called behavioral programs, are designed to get consumers to change their behavior to save energy. Within behavioral programs, Home Energy Reports are a good method to achieve behavioral savings as well as to educate consumers on structural energy savings. This paper examines the different Home Energy Report communication channels (direct mail and e-mail) and the marketing channel effect on energy savings, using SAS® for linear models. For consumer behavioral change, we often hear the questions: 1) Are the people that responded via direct mail solicitation saving at a higher rate than people who responded via an e-mail solicitation? 1a) Hypothesis: Because e-mail is easy to respond to, the type of customers that enroll through this channel will exert less effort for the behavior changes that require more time and investment toward energy efficiency changes and thus will save less. 2) Does the mode of that ongoing dialog (mail versus e-mail) impact the amount of consumer savings? 2a) Hypothesis: E-mail is more likely to be ignored and thus these recipients will save less. As savings is most often calculated by comparing the treatment group to a control group (to account for weather and economic impact over time), and by definition you cannot have a dialog with a control group, the answers are not a simple PROC FREQ away. Also, people who responded to mail look very different demographically than people who responded to e-mail. So, is the driver of savings differences the channel, or is it the demographics of the customers that happen to use those chosen channels? This study used clustering (PROC FASTCLUS) to segment the consumers by mail versus e-mail and append cluster assignment
s to the respective control group. This study also used DID (Difference-in-Differences) as well as Billing Analysis (PROC GLM) to calculate the savings of these groups.
Angela Wells, Direct Options
Ashlie Ossege, Direct Options
Census data, such as education and income, has been extensively used for various purposes. The data is usually collected in percentages of census unit levels, based on the population sample. Such presentation of the data makes it hard to interpret and compare. A more convenient way of presenting the data is to use the geocoded percentage to produce counts for a pseudo-population. We developed a very flexible SAS® macro to automatically generate the descriptive summary tables for the census data as well as to conduct statistical tests to compare the different levels of the variable by groups. The SAS macro is not only useful for census data but can be used to generate summary tables for any data with percentages in multiple categories.
Janet Lee, Kaiser Permanente Southern California
In observational data analyses, it is often helpful to use patients as their own controls by comparing their outcomes before and after some signal event, such as the initiation of a new therapy. It might be useful to have a control group that does not have the event but that is instead evaluated before and after some arbitrary point in time, such as their birthday. In this context, the change over time is a continuous outcome that can be modeled as a (possibly discontinuous) line, with the same or different slope before and after the event. Mixed models can be used to estimate random slopes and intercepts and compare patients between groups. A specific example published in a peer-reviewed journal is presented.
David Pasta, ICON Clinical Research
Kaiser Permanente Northwest is contractually obligated for regulatory submissions to Oregon Health Authority, Health Share of Oregon, and Molina Healthcare in Washington. The submissions consist of Medicaid Encounter data for medical and pharmacy claims. SAS® programs are used to extract claims data from Kaiser's claims data warehouse, process the data, and produce output files in HIPAA ASC X12 and NCPDP format. Prior to April 2014, programs were written in SAS® 8.2 running on a VAX server. Several key drivers resulted in the conversion of the existing system to SAS® Enterprise Guide® 5.1 running on UNIX. These drivers were: the need to have a scalable system in preparation for the Affordable Care Act (ACA); performance issues with the existing system; incomplete process reporting and notification to business owners; and a highly manual, labor-intensive process of running individual programs. The upgraded system addressed these drivers. The estimated cost reduction was from $1.30 per reported encounter to $0.13 per encounter. The converted system provides for better preparedness for the ACA. One expected result of ACA is significant Medicaid membership growth. The program has already increased in size by 50% in the preceding 12 months. The updated system allows for the expected growth in membership.
Eric Sather, Kaiser Permanente
Your data analysis projects can leverage the new HYPERGROUP action to mine relationships using Graph Theory. Discover which data entities are related and, conversely, sets of values are disjoint. In cases when the sets of values are not completely disjoint, HYPERGROUP can identify data that is strongly connected and identify neighboring data is weakly connected, or data that is a greater distance away. Each record is assigned a hypergroup number, and within hypergroups a color, community, or both. The GROUPBY facility, WHERE clauses, or both, can act on hypergroup number, color, or community, to conduct analytics using data that is close , related or more relevant . The algorithms used to determine hypergroups are based on Graph Theory. We show how the results of Hypergroup allow equivalent graphs to be displayed, useful information to be seen, and aid you in controlling what data is required to perform your analytics. Crucial data structure can be unearthed and seen.
Yue Qi, SAS
Trevor Kearney, SAS
To bring order to the wild world of big data, EMC and its partners have joined forces to meet customer challenges and deliver a modern analytic architecture. This unified approach encompasses big data management, analytics discovery and deployment via end-to-end solutions that solve your big data problems. They are also designed to free up more time for innovation, deliver faster deployments, and help you find new insights from secure and properly managed data. The EMC Business Data Lake is a fully-engineered, enterprise-grade data lake built on a foundation of core data technologies. It provides pre-configured building blocks that enable self-service, end-to-end integration, management and provisioning of the entire big data environment. Major benefits include the ability to make more timely and informed business decisions and realize the vision of analytics in weeks instead of months.SAS enhances the Federation Business Data Lake by providing superior breadth and depth of analytics to tackle any big data analytics problem an organization might have, whether it's fraud detection, risk management, customer intelligence, predictive assets maintenance and others. SAS and EMC work together to deliver a robust and comprehensive big data solution with reduced risk, automated provisioning and configuration and is purpose-built for big data analytics workloads.
Casey James, EMC
Learn how a new product from SAS enables you to easily build and compare multiple candidate models for all your business segments.
Steve Sparano, SAS
Many certification programs classify candidates into performance levels. For example, the SAS® Certified Base Programmer breaks down candidates into two performance levels: Pass and Fail. It is important to note that because all test scores contain measurement error, the performance level categorizations based on those test scores are also subject to measurement error. An important part of psychometric analysis is to estimate the decision consistency of the classifications. This study helps fill a gap in estimating decision consistency statistics for a single administration of a test using SAS.
Fan Yang, The University of Iowa
Yi Song, University of Illinois at Chicago
The success of an experimental study almost always hinges on how you design it. Does it provide estimates for everything you're interested in? Does it take all the experimental constraints into account? Does it make efficient use of limited resources? The OPTEX procedure in SAS/QC® software enables you to focus on specifying your interests and constraints, and it takes responsibility for handling them efficiently. With PROC OPTEX, you skip the step of rifling through tables of standard designs to try to find the one that's right for you. You concentrate on the science and the analytics and let SAS® do the computing. This paper reviews the features of PROC OPTEX and shows them in action using examples from field trials and food science experimentation. PROC OPTEX is a useful tool for all these situations, doing the designing and freeing the scientist to think about the food and the biology.
Cliff Pereira, Dept of Statistics, Oregon State University
Randy Tobias, SAS
This session is an introduction to predictive analytics and causal analytics in the context of improving outcomes. The session covers the following topics: 1) Basic predictive analytics vs. causal analytics; 2) The causal analytics framework; 3) Testing whether the outcomes improve because of an intervention; 4) Targeting the cases that have the best improvement in outcomes because of an intervention; and 5) Tweaking an intervention in a way that improves outcomes further.
Jason Pieratt, Humana
Data analysis begins with cleaning up data, calculating descriptive statistics, and examining variable distributions. Before more rigorous statistical analysis begins, many statisticians perform basic inferential statistical tests such as chi-square and t tests to assess unadjusted associations. These tests help guide the direction of the more rigorous statistical analysis. How to perform chi-square and t tests is presented. We explain how to interpret the output and where to look for the association or difference based on the hypothesis being tested. We propose the next steps for further analysis using example data.
Maribeth Johnson, Georgia Regents University
Jennifer Waller, Georgia Regents University
In Bayesian statistics, Markov chain Monte Carlo (MCMC) algorithms are an essential tool for sampling from probability distributions. PROC MCMC is useful for these algorithms. However, it is often desirable to code an algorithm from scratch. This is especially present in academia where students are expected to be able to understand and code an MCMC. The ability of SAS® to accomplish this is relatively unknown yet quite straightforward. We use SAS/IML® to demonstrate methods for coding an MCMC algorithm with examples of a Gibbs sampler and Metropolis-Hastings random walk.
Chelsea Lofland, University of California Santa Cruz
Competing risk arise in time to event data when the event of interest cannot be observed because of a preceding event i.e. a competing event occurring before. An example can be of an event of interest being a specific cause of death where death from any other cause can be termed as a competing event, if focusing on relapse, death before relapse would constitute a competing event. It is well studied and pointed out that in presence of competing risks, the standard product limit methods yield biased results due to violation of their basic assumption. The effect of competing events on parameter estimation depends on their distribution and frequency. Fine and Gray's sub-distribution hazard model can be used in presence of competing events which is available in PROC PHREG with the release of version 9.4 of SAS® software.
Lovedeep Gondara, University of Illinois Springfield
Survey research can provide a straightforward and effective means of collecting input on a range of topics. Survey researchers often like to group similar survey items into construct domains in order to make generalizations about a particular area of interest. Confirmatory Factor Analysis is used to test whether this pre-existing theoretical model underlies a particular set of responses to survey questions. Based on Structural Equation Modeling (SEM), Confirmatory Factor Analysis provides the survey researcher with a means to evaluate how well the actual survey response data fits within the a priori model specified by subject matter experts. PROC CALIS now provides survey researchers the ability to perform Confirmatory Factor Analysis using SAS®. This paper provides a survey researcher with the steps needed to complete Confirmatory Factor Analysis using SAS. We discuss and demonstrate the options available to survey researchers in the handling of missing and not applicable survey responses using an ARRAY statement within a DATA step and imputation of item non-response. A simple demonstration of PROC CALIS is then provided with interpretation of key portions of the SAS output. Using recommendations provided by SAS from the PROC CALIS output, the analysis is then modified to provide a better fit of survey items into survey domains.
Lindsey Brown Philpot, Baylor Scott & White Health
Sunni Barnes, Baylor Scott&White Health
Crystal Carel, BaylorScott&White Health Care System
Many organizations need to forecast large numbers of time series that are discretely valued. These series, called count series, fall approximately between continuously valued time series, for which there are many forecasting techniques (ARIMA, UCM, ESM, and others), and intermittent time series, for which there are a few forecasting techniques (Croston's method and others). This paper proposes a technique for large-scale automatic count series forecasting and uses SAS® Forecast Server and SAS/ETS® software to demonstrate this technique.
Michael Leonard, SAS
In today's competitive world, acquiring new customers is crucial for businesses but what if most of the acquired customers turn out to be defaulters? This decision would backfire on the business and might lead to losses. The extant statistical methods have enabled businesses to identify good risk customers rather than intuitively judging them. The objective of this paper is to build a credit risk scorecard using the Credit Risk Node inside SAS® Enterprise Miner™ 12.3, which can be used by a manager to make an instant decision on whether to accept or reject a customer's credit application. The data set used for credit scoring was extracted from UCI Machine Learning repository and consisted of 15 variables that capture details such as status of customer's existing checking account, purpose of the credit, credit amount, employment status, and property. To ensure generalization of the model, the data set has been partitioned using the data partition node in two groups of 70:30 as training and validation respectively. The target is a binary variable, which categorizes customers into good risk and bad risk group. After identifying the key variables required to generate the credit scorecard, a particular score was assigned to each of its sub groups. The final model generating the scorecard has a prediction accuracy of about 75%. A cumulative cut-off score of 120 was generated by SAS to make the demarcation between good and bad risk customers. Even in case of future variations in the data, model refinement is easy as the whole process is already defined and does not need to be rebuilt from scratch.
Ayush Priyadarshi, Oklahoma State University
Kushal Kathed, Oklahoma State University
Shilpi Prasad, Oklahoma State University
Statistical analysis that uses data from clinical or epidemiological studies include continuous variables such as patient's age, blood pressure, and various biomarkers. Over the years, there has been an increase in studies that focus on assessing associations between biomarkers and disease of interest. Many of the biomarkers are measured as continuous variables. Investigators seek to identify the possible cutpoint to classify patients as high risk versus low risk based on the value of the biomarker. Several data-oriented techniques such as median and upper quartile, and outcome-oriented techniques based on score, Wald, and likelihood ratio tests are commonly used in the literature. Contal and O'Quigley (1999) presented a technique that used log rank test statistic in order to estimate the cutpoint. Their method was computationally intensive and hence was overlooked due to the unavailability of built-in options in standard statistical software. In 2003, we provided the %FINDCUT macro that used Contal and O'Quigley's approach to identify a cutpoint when the outcome of interest was measured as time to event. Over the past decade, demand for this macro has continued to grow, which has led us to consider updating the %FINDCUT macro to incorporate new tools and procedures from SAS® such as array processing, Graph Template Language, and the REPORT procedure. New and updated features include: results presented in a much cleaner report format, user-specified cutpoints, macro parameter error checking, temporary data set cleanup, preserving current option settings, and increased processing speed. We present the utility and added options of the revised %FINDCUT macro using a real-life data set. In addition, we critically compare this method to some of the existing methods and discuss the use and misuse of categorizing a continuous covariate.
Jay Mandrekar, Mayo Clinic
Jeffrey Meyers, Mayo Clinic
Texas is one of about 30 states that has recently passed laws requiring voters to produce valid IDs in an effort to avoid illegal voters. This new regulation, however, might negatively affect voting opportunities for students, low-income people, and minorities. To determine the actual effects of the regulation in Dallas County, voters were surveyed when exiting the polling offices during the November midterm election about difficulties that they might have encountered in the voting process. The database of the voting history of each registered voter in the county was examined, and the data set was converted into an analyzable structure prior to stratification. All of the polling offices were stratified by the residents' degrees of involvement in the past three general elections, namely, the proportion of people who have used early election and who have at least voted once. A two-phase sampling design was adopted for stratification. On election day, pollsters were sent to select polling offices and interviewed 20 voters at a selected time period. The number of people having difficulties was estimated when data was collected.
Yusun Xia, Southern Methodist University
A common problem when developing classifications models is the imbalance of classes in the classification variable. This imbalance means that a class is represented by a large number of cases while the other class is represented by very few. When this happens, the predictive power of the developed model could be biased. This is the case because classification methods tend to favor the majority class. And the classification methods are designed to minimize the error on the total data set regardless of the proportions or balance of the classes. Due to this problem, there are several techniques used to balance the distribution of the classification variable. One method is to reduce the size of the majority class (under-sampling), another is to increase the number of cases in the minority class (over-sampling); or a third method is to combine these two methods. There is also a more complex technique called SMOTE (Synthetic Minority Over-sampling Technique) that consists of intelligently generating new synthetic registers of the minority class using a closest-neighbors approach. In this paper, we present the development in SAS® of a combination of SMOTE and under-sampling techniques as applied to a churn model. Then, we compare the predictive power of the model using this proposed balancing technique against other models developed with different data sampling techniques.
Lina Maria Guzman Cartagena, DIRECTV
Graduate students often need to explore data and summarize multiple statistical models into tables for a dissertation. The challenges of data summarization include coding multiple, similar statistical models, and summarizing these models into meaningful tables for review. The default method is to type (or copy and paste) results into tables. This often takes longer than creating and running the analyses. Students might spend hours creating tables, only to have to start over when a change or correction in the underlying data requires the analyses to be updated. This paper gives graduate students the tools to efficiently summarize the results of statistical models in tables. These tools include a macro-based SAS/STAT® analysis and ODS OUTPUT statement to summarize statistics into meaningful tables. Specifically, we summarize PROC GLM and PROC LOGISTIC output. We convert an analysis of hospital-acquired delirium from hundreds of pages of output into three formatted Microsoft Excel files. This paper is appropriate for users familiar with basic macro language.
Elisa Priest, Texas A&M University Health Science Center
Ashley Collinsworth, Baylor Scott & White Health/Tulane University
Deep Learning is one of the most exciting research areas in machine learning today. While Deep Learning algorithms are typically very sophisticated, you may be surprised how much you can understand about the field with just a basic knowledge of neural networks. Come learn the fundamentals of this exciting new area and see some of SAS' newest technologies for neural networks.
Patrick Hall, SAS
There is a widely forecast skills gap developing between the numbers of Big Data Analytics (BDA) graduates and the predicted jobs market. Many universities are developing innovative programs to increase the numbers of BDA graduates and postgraduates. The University of Derby has recently developed two new programs that aim to be unique and offer the applicants highly attractive and career-enhancing programs of study. One program is an undergraduate Joint Honours program that pairs analytics with a range of alternative subject areas; the other is a Master's program that has specific emphasis on governance and ethics. A critical aspect of both programs is the synthesis of a Personal Development Planning Framework that enables the students to evaluate their current status, identifies the steps needed to develop toward their career goals, and that provides a means of recording their achievements with evidence that can then be used in job applications. In the UK, we have two sources of skills frameworks that can be synthesized to provide a self-assessment matrix for the students to use as their Personal Development Planning (PDP) toolkit. These are the Skills Framework for the Information Age (SFIA-Plus) framework developed by the SFIA Foundation, and the Student Employability Profiles developed by the Higher Education Academy. A new set of National Occupational Skills (NOS) frameworks (Data Science, Data Management, and Data Analysis) have recently been released by the organization e-Skills UK for consultation. SAS® UK has had significant input to this new set of NOSs. This paper demonstrates how curricula have been developed to meet the Big Data Analytics skills shortfall by using these frameworks and how these frameworks can be used to guide students in their reflective development of their career plans.
Richard Self, University of Derby
Analyzing the key success factors for hit songs in the Billboard music charts is an ongoing area of interest to the music industry. Although there have been many studies over the past decades on predicting whether a song has the potential to become a hit song, the following research question remains, Can hit songs be predicted? And, if the answer is yes, what are the characteristics of those hit songs? This study applies data mining techniques using SAS® Enterprise Miner™ to understand why some music is more popular than other music. In particular, certain songs are considered one-hit wonders, which are in the Billboard music charts only once. Meanwhile, other songs are acknowledged as masterpieces. With 2,139 data records, the results demonstrate the practical validity of our approach.
Piboon Banpotsakun, National Institute of Development Administration
Jongsawas Chongwatpol, NIDA Business School, National Institute of Development Administration
Data scientists and analytic practitioners have become obsessed with quantifying the unknown. Through text mining third-person posthumous narratives in SAS® Enterprise Miner™ 12.1, we measured tangible aspects of personalities based on the broadly accepted big-five characteristics: extraversion, agreeableness, conscientiousness, neuroticism, and openness. These measurable attributes are linked to common descriptive terms used throughout our data to establish statistical relationships. The data set contains over 1,000 obituaries from newspapers throughout the United States, with individuals who vary in age, gender, demographic, and socio-economic circumstances. In our study, we leveraged existing literature to build the ontology used in the analysis. This literature suggests that a third person's perspective gives insight into one's personality, solidifying the use of obituaries as a source for analysis. We statistically linked target topics such as career, education, religion, art, and family to the five characteristics. With these taxonomies, we developed multivariate models in order to assign scores to predict an individual's personality type. With a trained model, this study has implications for predicting an individual's personality, allowing for better decisions on human capital deployment. Even outside the traditional application of personality assessment for organizational behavior, the methods used to extract intangible characteristics from text enables us to identify valuable information across multiple industries and disciplines.
Mark Schneider, Deloitte & Touche
Andrew Van Der Werff, Deloitte & Touche, LLP
Most manuscripts in medical journals contain summary tables that combine simple summaries and between-group comparisons. These tables typically combine estimates for categorical and continuous variables. The statistician generally summarizes the data using the FREQ procedure for categorical variables and compares percentages between groups using a chi-square or a Fisher's exact test. For continuous variables, the MEANS procedure is used to summarize data as either means and standard deviation or medians and quartiles. Then these statistics are generally compared between groups by using the GLM procedure or NPAR1WAY procedure, depending on whether one is interested in a parametric test or a non-parametric test. The outputs from these different procedures are then combined and presented in a concise format ready for publications. Currently there is no straightforward way in SAS® to build these tables in a presentable format that can then be customized to individual tastes. In this paper, we focus on presenting summary statistics and results from comparing categorical variables between two or more independent groups. The macro takes the dataset, the number of treatment groups, and the type of test (either chi-square or Fisher's exact) as input and presents the results in a publication-ready table. This macro automates summarizing data to a certain extent and minimizes risky typographical errors when copying results or typing them into a table.
Jeff Gossett, University of Arkansas for Medical Sciences
Mallikarjuna Rettiganti, UAMS
It has always been a million-dollar question, What inhibits a donor to donate? Many successful universities have deep roots in annual giving. We know donor sentiment is a key factor in drawing attention to engage donors. This paper is a summary of findings about donor behaviors using textual analysis combined with the power of predictive modeling. In addition to identifying the characteristics of donors, the paper focuses on identifying the characteristics of a first-time donor. It distinguishes the features of the first-time donor from the general donor pattern. It leverages the variations in data to provide deeper insights into behavioral patterns. A data set containing 247,000 records was obtained from the XYZ University Foundation alumni database, Facebook, and Twitter. Solicitation content such as email subject lines sent to the prospect base was considered. Time-dependent data and time-independent data were categorized to make unbiased predictions about the first-time donor. The predictive models use input such as age, educational records, scholarships, events, student memberships, and solicitation methods. Models such as decision trees, Dmine regression, and neural networks were built to predict the prospects. SAS® Sentiment Analysis Studio and SAS® Enterprise Miner™ were used to analyze the sentiment.
Ramcharan Kakarla, Comcast
Goutam Chakraborty, Oklahoma State University
The purpose of this paper is to introduce a SAS® macro named %DOUBLEGLM that enables users to model the mean and dispersion jointly using double generalized linear models described in Nelder (1991) and Lee (1998). The R functions FITJOINT and DGLM (R Development Core Team, 2011) were used to verify the suitability of the %DOUBLEGLM macro estimates. The results showed that estimates were closer than the R functions.
Paulo Silva, Universidade de Brasilia
Alan Silva, Universidade de Brasilia
During the cementing and pumps-off phase of oil drilling, drilling operations need to know, in real time, about any loss of hydrostatic or mechanical well integrity. This phase involves not only big data, but also high-velocity data. Today's state-of-the-art drilling rigs have tens of thousands of sensors. These sensors and their data output must be correlated and analyzed in real time. This paper shows you how to leverage SAS® Asset Performance Analytics and SAS® Enterprise Miner™ to build a model for drilling and well control anomalies, fingerprint key well control measures of the transienct fluid properties, and how to operationalize these analytics on the drilling assets with SAS® event stream processing. We cover the implementation and results from the Deepwater Horizon case study, demonstrating how SAS analytics enables the rapid differentiation between safe and unsafe modes of operation.
Jim Duarte, SAS
Keith Holdaway, SAS
Moray Laing, SAS
At NC State University, our motto is Think and Do. When it comes to educating students in the Poole College of Management, that means that we want them to not only learn to think critically but also to gain hands-on experience with the tools that will enable them to be successful in their careers. And, in the era of big data, we want to ensure that our students develop skills that will help them to think analytically in order to use data to drive business decisions. One method that lends itself well to thinking and doing is the case study approach. In this paper, we discuss the case study approach for teaching analytical skills and highlight the use of SAS® software for providing practical, hands-on experience with manipulating and analyzing data. The approach is illustrated with examples from specific case studies that have been used for teaching introductory and intermediate courses in business analytics.
Tonya Balan, NC State University
Respondent Driven Sampling (RDS) is both a sampling method and a data analysis technique. As a sampling method, RDS is a chain referral technique with strategic recruitment quotas and specific data gathering requirements. Like other chain referral techniques (for example, snowball sampling), the chains and waves are the start point to conduct analysis. But building the chains and waves still would be a daunting task because it involves too many transpositions and merges. This paper provides an efficient method of using Base SAS® to build up chains and waves.
Wen Song, ICF International
The use of Bayesian methods has become increasingly popular in modern statistical analysis, with applications in numerous scientific fields. In recent releases, SAS® software has provided a wealth of tools for Bayesian analysis, with convenient access through several popular procedures in addition to the MCMC procedure, which is designed for general Bayesian modeling. This paper introduces the principles of Bayesian inference and reviews the steps in a Bayesian analysis. It then uses examples from the GENMOD and PHREG procedures to describe the built-in Bayesian capabilities, which became available for all platforms in SAS/STAT® 9.3. Discussion includes how to specify prior distributions, evaluate convergence diagnostics, and interpret the posterior summary statistics.
Maura Stokes, SAS
My SAS® Global Forum 2013 paper 'Variable Reduction in SAS® by Using Weight of Evidence (WOE) and Information Value (IV)' has become the most sought-after online article on variable reduction in SAS since its publication. But the methodology provided by the paper is limited to reduction of numeric variables for logistic regression only. Built on a similar process, the current paper adds several major enhancements: 1) The use of WOE and IV has been expanded to the analytics and modeling for continuous dependent variables. After the standardization of a continuous outcome, all records can be divided into two groups: positive performance (outcome y above sample average) and negative performance (outcome y below sample average). This treatment is rigorously consistent with the concept of entropy in Information Theory: the juxtaposition of two opposite forces in one equation, and a stronger contrast between the two suggests a higher intensity , that is, more information delivered by the variable in question. As the standardization keeps the outcome variable continuous and quantified, the revised formulas for WOE and IV can be used in the analytics and modeling for continuous outcomes such as sales volume, claim amount, and so on. 2) Categorical and ordinal variables can be assessed together with numeric ones. 3) Users of big data usually need to evaluate hundreds or thousands of variables, but it is not uncommon that over 90% of variables contain little useful information. We have added a SAS macro that trims these variables efficiently in a broad-brushed manner without a thorough examination. Afterward, we examine the retained variables more carefully on their behaviors to the target outcome. 4) We add Chi-Square analysis for categorical/ordinal variables and Gini coefficients for numeric variable in order to provide additional suggestions for segmentation and regression. With the above enhancements added, a SAS macro program is provided at the end of the paper as a
complete suite for variable reduction/selection that efficiently evaluates all variables together. The paper provides a detailed explanation for how to use the SAS macro and how to read the SAS outputs that provide useful insights for subsequent linear regression, logistic regression, or scorecard development.
Alec Zhixiao Lin, PayPal Credit
Proving difference is the point of most statistical testing. In contrast, the point of equivalence and noninferiority tests is to prove that results are substantially the same, or at least not appreciably worse. An equivalence test can show that a new treatment, one that is less expensive or causes fewer side effects, can replace a standard treatment. A noninferiority test can show that a faster manufacturing process creates no more product defects or industrial waste than the standard process. This paper reviews familiar and new methods for planning and analyzing equivalence and noninferiority studies in the POWER, TTEST, and FREQ procedures in SAS/STAT® software. Techniques that are discussed range from Schuirmann's classic method of two one-sided tests (TOST) for demonstrating similar normal or lognormal means in bioequivalence studies, to Farrington and Manning's noninferiority score test for showing that an incidence rate (such as a rate of mortality, side effects, or product defects) is no worse. Real-world examples from clinical trials, drug development, and industrial process design are included.
John Castelloe, SAS
Donna Watts, SAS
Design of experiments (DOE) is an essential component of laboratory, greenhouse, and field research in the natural sciences. It has also been an integral part of scientific inquiry in diverse social science fields such as education, psychology, marketing, pricing, and social works. The principle and practices of DOE are among the oldest and the most advanced tools within the realm of statistics. DOE classification schemes, however, are diverse and, at times, confusing. In this presentation, we provide a simple conceptual classification framework in which experimental methods are grouped into classical and statistical approaches. The classical approach is further divided into pre-, quasi-, and true-experiments. The statistical approach is divided into one, two, and more than two factor experiments. Within these broad categories, we review several contemporary and widely used designs and their applications. The optimal use of Base SAS® and SAS/STAT® to analyze, summarize, and report these diverse designs is demonstrated. The prospects and challenges of such diverse and critically important analytics tools on business insight extraction in marketing and pricing research are discussed.
Max Friedauer
Jason Greenfield, Cardinal Health
Yuhan Jia, Cardinal Health
Joseph Thurman, Cardinal Health
Long before analysts began mining large data sets, mathematicians sought truths hidden within the set of natural numbers. This exploration was formalized into the mathematical subfield known as number theory. Though this discipline has proven fruitful for many applied fields, number theory delights in numerical truth for its own sake. The austere and abstract beauty of number theory prompted nineteenth century mathematician Carl Friedrich Gauss to dub it The Queen of Mathematics.' This session and the related paper demonstrate that the analytical power of the SAS® engine is well-suited for exploring concepts in number theory. In Base SAS®, students and educators will find a powerful arsenal for exploring, illustrating, and visualizing the following: prime numbers, perfect numbers, the Euclidean algorithm, the prime number theorem, Euler's totient function, Chebyshev's theorem, the Chinese remainder theorem, Gauss circle problem, and more! The paper delivers custom SAS procedures and code segments that generate data sets relevant to the exploration of topics commonly found in elementary number theory texts. The efficiency of these algorithms is discussed and an emphasis is placed on structuring data sets to maximize flexibility and ease in exploring new conjectures and illustrating known theorems. Last, the power of SAS plotting is put to use in unexpected ways to visualize and convey number-theoretic facts. This session and the related paper are geared toward the academic user or SAS user who welcomes and revels in mathematical diversions.
Matthew Duchnowski, Educational Testing Service
As pollution and population continue to increase, new concepts of eco-friendly commuting evolve. One of the emerging concepts is the bicycle sharing system. It is a bike rental service on a short-term basis at a moderate price. It provides people the flexibility to rent a bike from one location and return it to another location. This business is quickly gaining popularity all over the globe. In May 2011, there were only 375 bike rental schemes consisting of nearly 236,000 bikes. However, this number jumped to 535 bike sharing programs with approximately 517,000 bikes in just a couple of years. It is expected that this trend will continue to grow at a similar pace in the future. Most of the businesses involved in this system of bike rental are faced with the challenge of balancing supply and inconsistent demand. The number of bikes needed on a particular day can vary on several factors such as season, time, temperature, wind speed, humidity, holiday and day of the week. In this paper, we have tried to solve this problem using SAS® Forecast Studio. Incorporating the effects of all the above factors and analyzing the demand trends of the last two years, we have been able to precisely forecast the number of bikes needed on any day in the future. Also, we are able to do the scenario analysis to observe the effect of particular variables on the demand.
Kushal Kathed, Oklahoma State University
Goutam Chakraborty, Oklahoma State University
Ayush Priyadarshi, Oklahoma State University
A powerful tool for visually analyzing regression analysis is the forest plot. Model estimates, ratios, and rates with confidence limits are graphically stacked vertically in order to show how they overlap with each other and to show values of significance. The ability to see whether two values are significantly different from each other or whether a covariate has a significant meaning on its own is made much simpler in a forest plot rather than sifting through numbers in a report table. The amount of data preparation needed in order to build a high-quality forest plot in SAS® can be tremendous because the programmer needs to run analyses, extract the estimates to be plotted, structure the estimates in a format conducive to generating a forest plot, and then run the correct plotting procedure or create a graph template using the Graph Template Language (GTL). While some SAS procedures can produce forest plots using Output Delivery System (ODS) Graphics automatically, the plots are not generally publication-ready and are difficult to customize even if the programmer is familiar with GTL. The macro %FORESTPLOT is designed to perform all of the steps of building a high-quality forest plot in order to save time for both experienced and inexperienced programmers, and is currently set up to perform regression analyses common to the clinical oncology research areas, Cox proportional hazards and logistic, as well as calculate Kaplan-Meier event-free rates. To improve flexibility, the user can specify a pre-built data set to transform into a forest plot if the automated analysis options of the macro do not fit the user's needs.
Jeffrey Meyers, Mayo Clinic
Qian Shi, Mayo Clinic
During grad school, students learn SAS® in class or on their own for a research project. Time is limited, so faculty have to focus on what they know are the fundamental skills that students need to successfully complete their coursework. However, real-world research projects are often multifaceted and require a variety of SAS skills. When students transition from grad school to a paying job, they might find that in order to be successful, they need more than the basic SAS skills that they learned in class. This paper highlights 10 insights that I've had over the past year during my transition from grad school to a paying SAS research job. I hope this paper will help other students make a successful transition. Top 10 insights: 1. You still get graded, but there is no syllabus. 2. There isn't time for perfection. 3. Learn to use your resources. 4. There is more than one solution to every problem. 5. Asking for help is not a weakness. 6. Working with a team is required. 7. There is more than one type of SAS®. 8. The skills you learned in school are just the basics. 9. Data is complicated and often frustrating. 10. You will continue to learn both personally and professionally.
Lauren Hall, Baylor Scott & White Health
Elisa Priest, Texas A&M University Health Science Center
In many studies, a continuous response variable is repeatedly measured over time on one or more subjects. The subjects might be grouped into different categories, such as cases and controls. The study of resulting observation profiles as functions of time is called functional data analysis. This paper shows how you can use the SSM procedure in SAS/ETS® software to model these functional data by using structural state space models (SSMs). A structural SSM decomposes a subject profile into latent components such as the group mean curve, the subject-specific deviation curve, and the covariate effects. The SSM procedure enables you to fit a rich class of structural SSMs, which permit latent components that have a wide variety of patterns. For example, the latent components can be different types of smoothing splines, including polynomial smoothing splines of any order and all L-splines up to order 2. The SSM procedure efficiently computes the restricted maximum likelihood (REML) estimates of the model parameters and the best linear unbiased predictors (BLUPs) of the latent components (and their derivatives). The paper presents several real-life examples that show how you can fit, diagnose, and select structural SSMs; test hypotheses about the latent components in the model; and interpolate and extrapolate these latent components.
Rajesh Selukar, SAS
Quality measurement is increasingly important in the health-care sphere for both performance optimization and reimbursement. Treatment of chronic conditions is a key area of quality measurement. However, medication compendiums change frequently, and health-care providers often free text medications into a patient's record. Manually reviewing a complete medications database is time consuming. In order to build a robust medications list, we matched a pharmacist-generated list of categorized medications to a raw medications database that contained names, name-dose combinations, and misspellings. The matching procedure we used is called PROC COMPGED. We were able to combine a truncation function and an upcase function to optimize the output of PROC COMPGED. Using these combinations and manipulating the scoring metric of PROC COMPGED enabled us to narrow the database list to medications that were relevant to our categories. This process transformed a tedious task for PROC COMPARE or an Excel macro into a quick and efficient method of matching. The task of sorting through relevant matches was still conducted manually, but the time required to do so was significantly decreased by the fuzzy match in our application of PROC COMPGED.
Arti Virkud, NYC Department of Health
Effect sizes are strongly encouraged to be reported in addition to statistical significance and should be considered in evaluating the results of a study. The choice of an effect size for ANOVA models can be confusing because indices might differ depending on the research design as well as the magnitude of the effect. Olejnik and Algina (2003) proposed the generalized eta-squared and omega-squared effect sizes, which are comparable across a wide variety of research designs. This paper provides a SAS® macro for computing the generalized omega-squared effect size associated with analysis of variance models by using data from PROC GLM ODS tables. The paper provides the macro programming language, as well as results from an executed example of the macro.
Anh Kellermann, University of South Florida
Yi-hsin Chen, USF
Anh Kellermann, University of South Florida
Jeffrey Kromrey, University of South Florida
Thanh Pham, USF
Patrice Rasmussen, USF
Patricia Rodriguez de Gil, University of South Florida
Jeanine Romano, USF
This presentation provides a brief introduction to logistic regression analysis in SAS. Learn differences between Linear Regression and Logistic Regression, including ordinary least squares versus maximum likelihood estimation. Learn to: understand LOGISTIC procedure syntax, use continuous and categorical predictors, and interpret output from ODS Graphics.
Danny Modlin, SAS
For decades, mixed models been used by researchers to account for random sources of variation in regression-type models. Now they are gaining favor in business statistics to give better predictions for naturally occurring groups of data, such as sales reps, store locations, or regions. Learn about how predictions based on a mixed model differ from predictions in ordinary regression, and see examples of mixed models with business data.
Catherine Truxillo, SAS
Text data constitutes more than half of the unstructured data held in organizations. Buried within the narrative of customer inquiries, the pages of research reports, and the notes in servicing transactions are the details that describe concerns, ideas and opportunities. The historical manual effort needed to develop a training corpus is now no longer required, making it simpler to gain insight buried in unstructured text. With the ease of machine learning refined with the specificity of linguistic rules, SAS Contextual Analysis helps analysts identify and evaluate the meaning of the electronic written word. From a single, point-and-click GUI interface the process of developing text models is guided and visually intuitive. This presentation will walk through the text model development process with SAS Contextual Analysis. The results are in SAS format, ready for text-based insights to be used in any other SAS application.
George Fernandez, SAS
SAS/ETS provides many tools to improve the productivity of the analyst who works with time series data. This tutorial will take an analyst through the process of turning transaction-level data into a time series. The session will then cover some basic forecasting techniques that use past fluctuations to predict future events. We will then extend this modeling technique to include explanatory factors in the prediction equation.
Kenneth Sanford, SAS
Graduate students encounter many challenges when conducting health services research using real world data obtained from electronic health records (EHRs). These challenges include cleaning and sorting data, summarizing and identifying present-on-admission diagnosis codes, identifying appropriate metrics for risk-adjustment, and determining the effectiveness and cost effectiveness of treatments. In addition, outcome variables commonly used in health service research are not normally distributed. This necessitates the use of nonparametric methods in statistical analyses. This paper provides graduate students with the basic tools for the conduct of health services research with EHR data. We will examine SAS® tools and step-by-step approaches used in an analysis of the effectiveness and cost-effectiveness of the ABCDE (Awakening and Breathing Coordination, Delirium monitoring/management, and Early exercise/mobility) bundle in improving outcomes for intensive care unit (ICU) patients. These tools include the following: (1) ARRAYS; (2) lookup tables; (3) LAG functions; (4) PROC TABULATE; (5) recycled predictions; and (6) bootstrapping. We will discuss challenges and lessons learned in working with data obtained from the EHR. This content is appropriate for beginning SAS users.
Ashley Collinsworth, Baylor Scott & White Health/Tulane University
Elisa Priest, Texas A&M University Health Science Center
Learning a new programming language is not an easy task, especially for someone who does not have any programming experience. Learning the SAS® programming language can be even more challenging. One of the reasons is that the SAS System consists of a variety of languages, such as the DATA step language, SAS macro language, Structured Query Language for the SQL procedure, and so on. Furthermore, each of these languages has its own unique characteristics and simply learning the syntax is not sufficient to grasp the language essence. Thus, it is not unusual to hear about someone who has learned SAS for several years and has never become a SAS programming expert. By using the DATA step language as an example, I would like to share some of my experiences on effectively teaching the SAS language.
Arthur Li, City of Hope National Medical Center
This study looks at several ways to investigate latent variables in longitudinal surveys and their use in regression models. Three different analyses for latent variable discovery are briefly reviewed and explored. The procedures explored in this paper are PROC LCA, PROC LTA, PROC CATMOD, PROC FACTOR, PROC TRAJ, and PROC SURVEYLOGISTIC. The analyses defined through these procedures are latent profile analyses, latent class analyses, and latent transition analyses. The latent variables are included in three separate regression models. The effect of the latent variables on the fit and use of the regression model compared to a similar model using observed data is briefly reviewed. The data used for this study was obtained via the National Longitudinal Study of Adolescent Health, a study distributed and collected by Add Health. Data was analyzed using SAS® 9.3. This paper is intended for any level of SAS® user. This paper is also aimed at an audience with a background in behavioral science or statistics.
Deanna Schreiber-Gregory, National University
With the constant need to inform researchers about neighborhood health data, the Santa Clara County Health Department created socio-demographic and health profiles for 109 neighborhoods in the county. Data was pulled from many public and county data sets, compiled, analyzed, and automated using SAS®. With over 60 indicators and 109 profiles, an efficient set of macros was used to automate the calculation of percentages, rates, and mean statistics for all of the indicators. Macros were also used to automate individual census tracts into pre-decided neighborhoods to avoid data entry errors. Simple SQL procedures were used to calculate and format percentages within the macros, and output was pushed out using Output Delivery System (ODS) Graphics. This output was exported to Microsoft Excel, which was used to create a sortable database for end users to compare cities and/or neighborhoods. Finally, the automated SAS output was used to map the demographic data using geographic information system (GIS) software at three geographies: city, neighborhood, and census tract. This presentation describes the use of simple macros and SAS procedures to reduce resources and time spent on checking data for quality assurance purposes. It also highlights the simple use of ODS Graphics to export data to an Excel file, which was used to mail merge the data into 109 unique profiles. The presentation is aimed at intermediate SAS users at local and state health departments who might be interested in finding an efficient way to run and present health statistics given limited staff and resources.
Roshni Shah, Santa Clara County
Is it a better business decision to determine profitability of all business units/kiosks and then decide to prune the nonprofitable ones? Or does model performance improve if we decide to first find the units that meet the break-even point and then try to calculate their profits? In our project, we did a two-stage regression process due to highly skewed distribution of the variables. First, we performed logistic regression to predict which kiosks would be profitable. Then, we used linear regression to predict the average monthly revenue at each kiosk. We used SAS® Enterprise Guide® and SAS® Enterprise Miner™ for the modeling process. The effectiveness of the linear regression model is much more for predicting the target variable at profitable kiosks as compared to unprofitable kiosks. The two-phase regression model seemed to perform better than simply performing a linear regression, particularly when the target variable has too many levels. In real-life situations, the dependent and independent variables can have highly skewed distributions, and two-phase regression can help improve model performance and accuracy. Some results: The logistic regression model has an overall accuracy of 82.9%, sensitivity of 92.6%, and specificity of 61.1% with comparable figures for the training data set at 81.8%, 90.7%, and 63.8% respectively. This indicates that the regression model seems to be consistently predicting the profitable kiosks at a reasonably good level. Linear regression model: For the training data set, the MAPE (mean absolute percentage errors in prediction) is 7.2% for the kiosks that earn more than $350 whereas the MAPE (mean absolute percentage errors in prediction) for kiosks that earn less than $350 is -102% for the predicted values (not log-transformed) of the target versus the actual value of the target respectively. For the validation data set, the MAPE (mean absolute percentage errors in prediction) is 7.6% for the kiosks that earn more
than $350 whereas the MAPE (mean absolute percentage errors in prediction) for kiosks that earn less than $350 is -142% for the predicted values (not log-transformed) of the target versus the actual value of the target respectively. This means that the average monthly revenue figures seem to be better predicted for the model where the kiosks were earning higher than the threshold value of $350--that is, for those kiosk variables with a flag variable of 1. The model seems to be predicting the target variable with lower APE for higher values of the target variable for both the training data set above and the entire data set below. In fact, if the threshold value for the kiosks is moved to even say $500, the predictive power of the model in terms of APE will substantially increase. The validation data set (Selection Indicator=0) has fewer data points, and, therefore, the contrast in APEs is higher and more varied.
Shrey Tandon, Sobeys West
In longitudinal data, it is important to account for the correlation due to repeated measures and time-dependent covariates. Generalized method of moments can be used to estimate the coefficients in longitudinal data, although there are currently limited procedures in SAS® to produce GMM estimates for correlated data. In a recent paper, Lalonde, Wilson, and Yin provided a GMM model for estimating the coefficients in this type of data. SAS PROC IML was used to generate equations that needed to be solved to determine which estimating equations to use. In addition, this study extended classifications of moment conditions to include a type IV covariate. Two data sets were evaluated using this method, including re-hospitalization rates from a Medicare database as well as body mass index and future morbidity rates among Filipino children. Both examples contain binary responses, repeated measures, and time-dependent covariates. However, while this technique is useful, it is tedious and can also be complicated when determining the matrices necessary to obtain the estimating equations. We provide a concise and user-friendly macro to fit GMM logistic regression models with extended classifications.
Katherine Cai, Arizona State University
SAS® University Edition is a great addition to the world of freely available analytic software, and this 'how-to' presentation shows you how to implement a discrete event simulation using Base SAS® to model future US Veterans population distributions. Features include generating a slideshow using ODS output to PowerPoint.
Michael Grierson
Just as research is built on existing research, the references section is an important part of a research paper. The purpose of this study is to find the differences between professionals and academicians with respect to the references section of a paper. Data is collected from SAS® Global Forum 2014 Proceedings. Two research hypotheses are supported by the data. First, the average number of references in papers by academicians is higher than those by professionals. Second, academicians follow standards for citing references more than professionals. Text mining is performed on the references to understand the actual content. This study suggests that authors of SAS Global Forum papers should include more references to increase the quality of the papers.
Vijay Singh, Oklahoma State University
Pankush Kalgotra, Oklahoma State University
In data mining modelling, data preparation is the most crucial, most difficult, and longest part of the mining process. A lot of steps are involved. Consider the simple distribution analysis of the variables, the diagnosis and reduction of the influence of variables' multicollinearity, the imputation of missing values, and the construction of categories in variables. In this presentation, we use data mining models in different areas like marketing, insurance, retail and credit risk. We show how to implement data preparation through SAS® Enterprise Miner™, using different approaches. We use simple code routines and complex processes involving statistical insights, cluster variables, transform variables, graphical analysis, decision trees, and more.
Ricardo Galante, SAS
Over the years, very few published studies have discussed ways to improve the performance of two-stage predictive models. This study, based on 10 years (1999-2008) of data from 130 US hospitals and integrated delivery networks, is an attempt to demonstrate how we can leverage the Association node in SAS® Enterprise Miner™ to improve the classification accuracy of the two-stage model. We prepared the data with imputation operations and data cleaning procedures. Variable selection methods and domain knowledge were used to choose 43 key variables for the analysis. The prominent association rules revealed interesting relationships between prescribed medications and patient readmission/no-readmission. The rules with lift values greater than 1.6 were used to create dummy variables for use in the subsequent predictive modeling. Next, we used two-stage sequential modeling, where the first stage predicted if the diabetic patient was readmitted and the second stage predicted whether the readmission happened within 30 days. The backward logistic regression model outperformed competing models for the first stage. After including dummy variables from an association analysis, many fit indices improved, such as the validation ASE to 0.228 from 0.238, cumulative lift to 1.56 from 1.40. Likewise, the performance of the second stage was improved after including dummy variables from an association analysis. Fit indices such as the misclassification rate improved to 0.240 from 0.243 and the final prediction error to 0.17 from 0.18.
Girish Shirodkar, Oklahoma State University
Goutam Chakraborty, Oklahoma State University
Ankita Chaudhari, Oklahoma State University
Missing data is an unfortunate reality of statistics. However, there are various ways to estimate and deal with missing data. This paper explores the pros and cons of traditional imputation methods versus maximum likelihood estimation as well as singular versus multiple imputation. These differences are displayed through comparing parameter estimates of a known data set and simulating random missing data of different severity. In addition, this paper uses PROC MI and PROC MIANALYZE and shows how to use these procedures in a longitudinal data set.
Christopher Yim, Cal Poly San Luis Obispo
Since the financial crisis of 2008, banks and bank holding companies in the United States have faced increased regulation. One of the recent changes to these regulations is known as the Comprehensive Capital Analysis and Review (CCAR). At the core of these new regulations, specifically under the Dodd-Frank Wall Street Reform and Consumer Protection Act and the stress tests it mandates, are a series of what-if or scenario analyses requirements that involve a number of scenarios provided by the Federal Reserve. This paper proposes frequentist and Bayesian time series methods that solve this stress testing problem using a highly practical top-down approach. The paper focuses on the value of using univariate time series methods, as well as the methodology behind these models.
Kenneth Sanford, SAS
Christian Macaro, SAS
This paper explains how to build a linear regression model using the variable transformation method. Testing the assumptions, which is required for linear modeling and testing the fit of a linear model, is included. This paper is intended for analysts who have limited exposure to building linear models. This paper uses the REG, GLM, CORR, UNIVARIATE, and GPLOT procedures.
Nancy Hu, Discover
Generalized linear models are highly useful statistical tools in a broad array of business applications and scientific fields. How can you select a good model when numerous models that have different regression effects are possible? The HPGENSELECT procedure, which was introduced in SAS/STAT® 12.3, provides forward, backward, and stepwise model selection for generalized linear models. In SAS/STAT 14.1, the HPGENSELECT procedure also provides the LASSO method for model selection. You can specify common distributions in the family of generalized linear models, such as the Poisson, binomial, and multinomial distributions. You can also specify the Tweedie distribution, which is important in ratemaking by the insurance industry and in scientific applications. You can run the HPGENSELECT procedure in single-machine mode on the server where SAS/STAT is installed. With a separate license for SAS® High-Performance Statistics, you can also run the procedure in distributed mode on a cluster of machines that distribute the data and the computations. This paper shows you how to use the HPGENSELECT procedure both for model selection and for fitting a single model. The paper also explains the differences between the HPGENSELECT procedure and the GENMOD procedure.
Gordon Johnston, SAS
Bob Rodriguez, SAS
This paper introduces Jeffreys interval for one-sample proportion using SAS® software. It compares the credible interval from a Bayesian approach with the confidence interval from a frequentist approach. Different ways to calculate the Jeffreys interval are presented using PROC FREQ, the QUANTILE function, a SAS program of the random walk Metropolis sampler, and PROC MCMC.
Wu Gong, The Children's Hospital of Philadelphia
The research areas of pharmaceuticals and oncology clinical trials greatly depend on time-to-event endpoints such as overall survival and progression-free survival. One of the best graphical displays of these analyses is the Kaplan-Meier curve, which can be simple to generate with the LIFETEST procedure but difficult to customize. Journal articles generally prefer that statistics such as median time-to-event, number of patients, and time-point event-free rate estimates be displayed within the graphic itself, and this was previously difficult to do without an external program such as Microsoft Excel. The macro %NEWSURV takes advantage of the Graph Template Language (GTL) that was added with the SG graphics engine to create this level of customizability without the need for back-end manipulation. Taking this one step further, the macro was improved to be able to generate a lattice of multiple unique Kaplan-Meier curves for side-by-side comparisons or for condensing figures for publications. This paper describes the functionality of the macro and describes how the key elements of the macro work.
Jeffrey Meyers, Mayo Clinic
The cyclical coordinate descent method is a simple algorithm that has been used for fitting generalized linear models with lasso penalties by Friedman et al. (2007). The coordinate descent algorithm can be implemented in Base SAS® to perform efficient variable selection and shrinkage for GLMs with the L1 penalty (the lasso).
Robert Feyerharm, Beacon Health Options
There are many pedagogic theories and practices that academics research and follow as they strive to ensure excellence in their students' achievements. In order to validate the impact of different approaches, there is a need to apply analytical techniques to evaluate the changing levels of achievements that occur as a result of changes in applied pedagogy. The analytics used should be easily accessible to all academics with minimal overhead in terms of the collection of new data. This paper is based on a case study of the changing pedagogical approaches of the author over the past five years, using grade profiles from a wide range of modules taught by the author in both the School of Computing and Maths and the Business School at the University of Derby. Base SAS® and SAS® Studio were used to evaluate and demonstrate the impact of the change from a pedagogical position of Academic as Domain Expert to a pedagogical position of Academic as Learning-to-Learn Expert . This change resulted in greater levels of research that supported learning along with better writing skills. The application of Learning Analytics in this case study demonstrates a very significant improvement in grade profiles of all students of between 15% and 20%. More surprisingly, it demonstrates that it also eliminates a significant grade deficit in the black and minority ethnic student population, which is typically about 15% in a large number of UK universities.
Richard Self, University of Derby
There are various economic factors that affect retail sales. One important factor that is expected to correlate is overall customer sentiment toward a brand. In this paper, we analyze how location-specific customer sentiment could vary and correlate with sales at retail stores. In our attempt to find any dependency, we have used location-specific Twitter feeds related to a national-brand chain retail store. We opinion-mine their overall sentiment using SAS® Sentiment Analysis Studio. We estimate correlation between the opinion index and retail sales within the studied geographic areas. Later in the analysis, using ArcGIS Online from Esri, we estimate whether other location-specific variables that could potentially correlate with customer sentiment toward the brand are significantly important to predict a brand's retail sales.
Asish Satpathy, University of California, Riverside
Goutam Chakraborty, Oklahoma State University
Tanvi Kode, Oklahoma State University
The need to measure slight changes in healthcare costs and utilization patterns over time is vital in predictive modeling, forecasting, and other advanced analytics. At BlueCross BlueShield of Tennessee, a method for developing member-level regression slopes creates a better way of identifying these changes across various time spans. The goal is to create multiple metrics at the member level that will indicate when an individual is seeking more or less medical or pharmacy services. Significant increases or decreases in utilization and cost are used to predict the likelihood of acquiring certain conditions, seeking services at particular facilities, and self-engaging in health and wellness. Data setup and compilation consists of calculating a member's eligibility with the health plan and then aggregating cost and utilization of particular services (for example, primary care visits, Rx costs, ER visits, and so on). A member must have at least six months of eligibility for a valid regression slope to be calculated. Linear regression is used to build single-factor models for 6, 12, 18 and 24 month time spans if the appropriate amount of data is available for the member. Models are built at the member-metric time period resulting in the possibility of over 75 regression coefficients per member per monthly run. The computing power needed to execute such a vast amount of calculations requires in-database processing of various macro processes. SAS® Enterprise Guide® is used to structure the data and SAS® Forecast Studio is used to forecast trends at a member level. Algorithms are run the first of each month. Data is stored so that each metric and corresponding slope is appended on a monthly basis. Because the data is setup up for the member regression algorithm, slopes are interpreted in the following manner: a positive value for -1*slope indicates an increase in utilization/cost; a negative value for -1*slope indicates a decrease in utilization/cost. The ac
tual slope value indicates the intensity of the change in cost in utilization. The insight provided by this member-level regression methodology replaces subjective methods that used arbitrary thresholds of change to measure differences in cost and utilization.
Leigh McCormack, BCBST
Prudhvidhar Perati, BlueCross BlueShield of TN
With the move to value-based benefit and reimbursement models, it is essential toquantify the relative cost, quality, and outcome of a service. Accuratelymeasuring the cost and quality of doctors, practices, and health systems iscritical when you are developing a tiered network, a shared savings program, ora pay-for-performance incentive. Limitations in claims payment systems requiredeveloping methodological and statistical techniques to improve the validityand reliability of provider's scores on cost and quality of care. This talkdiscusses several key concepts in the development of a measurement systemfor provider performance, including measure selection, risk adjustment methods,and peer group benchmark development.
Daryl Wansink, Qualmetrix, Inc.
The goal of this session is to describe the whole process of model creation from the business request through model specification, data preparation, iterative model creation, model tuning, implementation, and model servicing. Each mentioned phase consists of several steps in which we describe the main goal of the step, the expected outcome, the tools used, our own SAS codes, useful nodes, and settings in SAS® Enterprise Miner™, procedures in SAS® Enterprise Guide®, measurement criteria, and expected duration in man-days. For three steps, we also present deep insights with examples of practical usage, explanations of used codes, settings, and ways of exploring and interpreting the output. During the actual model creation process, we suggest using Microsoft Excel to keep all input metadata along with information about transformations performed in SAS Enterprise Miner. To get faster information about model results, we combine an automatic SAS® code generator implemented in Excel, and then we input this code to SAS Enterprise Guide and create a specific profile of results directly from the nodes output tables of SAS Enterprise Miner. This paper also focuses on an example of a binary model stability check-in time performed in SAS Enterprise Guide through measuring optimal cut-off percentage and lift. These measurements are visualized and automatized using our own codes. By using this methodology, users would have direct contact with transformed data along with the possibility to analyze and explore any semi-results. Furthermore, the proposed approach could be used for several types of modeling (for example, binary and nominal predictive models or segmentation models). Generally, we have summarized our best practices of combining specific procedures performed in SAS Enterprise Guide, SAS Enterprise Miner, and Microsoft Excel to create and interpret models faster and more effectively.
Peter Kertys, VÚB a.s.
Effect modification occurs when the association between a predictor of interest and the outcome is differential across levels of a third variable--the modifier. Effect modification is statistically tested as the interaction effect between the predictor and the modifier. In repeated measures studies (with more than two time points), higher-order (three-way) interactions must be considered to test effect modification by adding time to the interaction terms. Custom fitting and constructing these repeated measures models are difficult and time consuming, especially with respect to estimating post-fitting contrasts. With the advancement of the LSMESTIMATE statement in SAS®, a simplified approach can be used to custom test for higher-order interactions with post-fitting contrasts within a mixed model framework. This paper provides a simulated example with tips and techniques for using an application of the nonpositional syntax of the LSMESTIMATE statement to test effect modification in repeated measures studies. This approach, which is applicable to exploring modifiers in randomized controlled trials (RCTs), goes beyond the treatment effect on outcome to a more functional understanding of the factors that can enhance, reduce, or change this relationship. Using this technique, we can easily identify differential changes for specific subgroups of individuals or patients that subsequently impact treatment decision making. We provide examples of conventional approaches to higher-order interaction and post-fitting tests using the ESTIMATE statement and compare and contrast this to the nonpositional syntax of the LSMESTIMATE statement. The merits and limitations of this approach are discussed.
Pronabesh DasMahapatra, PatientsLikeMe Inc.
Ryan Black, NOVA Southeastern University
Electricity is an extremely important product for society. In Brazil, the electric sector is regulated by ANEEL (Ag ncia Nacional de Energia El trica), and one of the regulated aspects is power loss in the distribution system. In 2013, 13.99% of all injected energy was lost in the Brazilian system. Commercial loss is one of the power loss classifications, which can be countered by inspections of the electrical installation in a search for irregularities in power meters. CEMIG (Companhia Energ tica de Minas Gerais) currently serves approximately 7.8 million customers, which makes it unfeasible (in financial and logistic terms) to inspect all customer units. Thus, the ability to select potential inspection targets is essential. In this paper, logistic regression models, decision tree models, and the Ensemble model were used to improve the target selection process in CEMIG. The results indicate an improvement in the positive predictive value from 35% to 50%.
Sergio Henrique Ribeiro, Cemig
Iguatinan Monteiro, CEMIG
Operational risk losses are heavy tailed and likely to be asymmetric and extremely dependent among business lines and event types. We propose a new methodology to assess, in a multivariate way, the asymmetry and extreme dependence between severity distributions and to calculate the capital for operational risk. This methodology simultaneously uses several parametric distributions and an alternative mix distribution (the lognormal for the body of losses and the generalized Pareto distribution for the tail) via the extreme value theory using SAS®; the multivariate skew t-copula applied for the first time to operational losses; and the Bayesian inference theory to estimate new n-dimensional skew t-copula models via Markov chain Monte Carlo (MCMC) simulation. This paper analyzes a new operational loss data set, SAS® Operational Risk Global Data (SAS OpRisk Global Data), to model operational risk at international financial institutions. All of the severity models are constructed in SAS® 9.2. We implement PROC SEVERITY and PROC NLMIXED and this paper describes this implementation.
Betty Johanna Garzon Rozo, The University of Edinburgh
SAS® Forecast Server provides easy and automatic large-scale forecasting, which enables organizations to commit fewer resources to the process, reduce human touch interaction and minimize the biases that contaminate forecasts. SAS Forecast Server Client represents the modernization of the graphical user interface for SAS Forecast Server. This session will describe and demonstrate this new client, including new features, such as demand classification, and overall functionality.
Udo Sglavo, SAS
Multilevel models (MLMs) are frequently used in social and health sciences where data are typically hierarchical in nature. However, the commonly used hierarchical linear models (HLMs) are appropriate only when the outcome of interest is normally distributed. When you are dealing with outcomes that are not normally distributed (binary, categorical, ordinal), a transformation and an appropriate error distribution for the response variable needs to be incorporated into the model. Therefore, hierarchical generalized linear models (HGLMs) need to be used. This paper provides an introduction to specifying HGLMs using PROC GLIMMIX, following the structure of the primer for HLMs previously presented by Bell, Ene, Smiley, and Schoeneberger (2013). A brief introduction into the field of multilevel modeling and HGLMs with both dichotomous and polytomous outcomes is followed by a discussion of the model-building process and appropriate ways to assess the fit of these models. Next, the paper provides a discussion of PROC GLIMMIX statements and options as well as concrete examples of how PROC GLIMMIX can be used to estimate (a) two-level organizational models with a dichotomous outcome and (b) two-level organizational models with a polytomous outcome. These examples use data from High School and Beyond (HS&B), a nationally representative longitudinal study of American youth. For each example, narrative explanations accompany annotated examples of the GLIMMIX code and corresponding output.
Mihaela Ene, University of South Carolina
Bethany Bell, University of South Carolina
Genine Blue, University of South Carolina
Elizabeth Leighton, University of South Carolina
Customer Long-Term Value (LTV) is a concept that is readily explained at a high level to marketing management of a company, but its analytic development is complex. This complexity involves the need to forecast customer behavior well into the future. This behavior includes the timing, frequency, and profitability of a customer's future purchases of products and services. This paper describes a method for computing LTV. First, a multinomial logistic regression provides probabilities for time-of-first-purchase, time-of-second-purchase, and so on, for each customer. Then the profits for the first purchase, second purchase, and so on, are forecast but only after adjustment for non-purchaser selection bias. Finally, these component models are combined in the LTV formula.
Bruce Lund, Marketing Associates, LLC
This presentation emphasizes use of SAS® 9.4 to perform multiple imputation of missing data using the PROC MI Fully Conditional Specification (FCS) method with subsequent analysis using PROC SURVEYLOGISTIC and PROC MIANALYZE. The data set used is based on a complex sample design. Therefore, the examples correctly incorporate the complex sample features and weights. The demonstration is then repeated in Stata, IVEware, and R for a comparison of major software applications that are capable of multiple imputation using FCS or equivalent methods and subsequent analysis of imputed data sets based on complex sample design data.
Patricia Berglund, University of Michigan
Retailers proactively seek a data-driven approach to provide customized product recommendations to guarantee sales increase and customer loyalty. Product affinity models have been recognized as one of the vital tools for this purpose. The algorithm assigns a customer to a product affinity group when the likelihood of purchasing is the highest and the likelihood meets the minimum and absolute requirement. However, in practice, valuable customers, up to 30% of the total universe, who buy across multiple product categories with two or more balanced product affinity likelihoods, are undefined and unable to be effectively product recommended. This paper presents multiple product affinity models that are developed using SAS® macro language to address the problem. In this paper, we demonstrate how the innovative assignment algorithm successfully assigns the undefined customers to appropriate multiple product affinity groups using nationwide retailer transactional data. In addition, the result shows that potential customers establish loyalty through migration from a single to multiple product affinity groups. This comprehensive and insightful business solution will be shared in this paper. Also, this paper provides a clustering algorithm and nonparametric tree model for model building. The customer assignment for using SAS macro code is provided in an appendix.
Hsin-Yi Wang, Alliance Data Systems
Differential item functioning (DIF), as an assessment tool, has been widely used in quantitative psychology, educational measurement, business management, insurance, and health care. The purpose of DIF analysis is to detect response differences of items in questionnaires, rating scales, or tests across different subgroups (for example, gender) and to ensure the fairness and validity of each item for those subgroups. The goal of this paper is to demonstrate several ways to conduct DIF analysis by using different SAS® procedures (PROC FREQ, PROC LOGISITC, PROC GENMOD, PROC GLIMMIX, and PROC NLMIXED) and their applications. There are three general methods to examine DIF: generalized Mantel-Haenszel (MH), logistic regression, and item response theory (IRT). The SAS® System provides flexible procedures for all these approaches. There are two types of DIF: uniform DIF, which remains consistent across ability levels, and non-uniform DIF, which varies across ability levels. Generalized MH is a nonparametric method and is often used to detect uniform DIF while the other two are parametric methods and examine both uniform and non-uniform DIF. In this study, I first describe the underlying theories and mathematical formulations for each method. Then I show the SAS statements, input data format, and SAS output for each method, followed by a detailed demonstration of the differences among the three methods. Specifically, PROC FREQ is used to calculate generalized MH only for dichotomous items. PROC LOGISITIC and PROC GENMOD are used to detect DIF by using logistic regression. PROC NLMIXED and PROC GLIMMIX are used to examine DIF by applying an exploratory item response theory model. Finally, I use SAS/IML® to call two R packages (that is, difR and lordif) to conduct DIF analysis and then compare the results between SAS procedures and R packages. An example data set, the Verbal Aggression assessment, which includes 316 subjects and 24 items, is used in this stud
y. Following the general DIF analysis, the male group is used as the reference group, and the female group is used as the focal group. All the analyses are conducted by SAS® 9.3 and R 2.15.3. The paper closes with the conclusion that the SAS System provides different flexible and efficient ways to conduct DIF analysis. However, it is essential for SAS users to understand the underlying theories and assumptions of different DIF methods and apply them appropriately in their DIF analyses.
Yan Zhang, Educational Testing Service
There are times when the objective is to provide a summary table and graph for several quality improvement measures on a single page to allow leadership to monitor the performance of measures over time. The challenges were to decide which SAS® procedures to use, how to integrate multiple SAS procedures to generate a set of plots and summary tables within one page, and how to determine whether to use box plots or series plots of means or medians. We considered the SGPLOT and SGPANEL procedures, and Graph Template Language (GTL). As a result, given the nature of the request, the decision led us to use GTL and the SGRENDER procedure in the %BXPLOT2 macro. For each measure, we used the BOXPLOTPARM statement to display a series of box plots and the BLOCKPLOT statement for a summary table. Then we used the LAYOUT OVERLAY statement to combine the box plots and summary tables on one page. The results display a summary table (BLOCKPLOT) above each box plot series for each measure on a single page. Within each box plot series, there is an overlay of a system-level benchmark value and a series line connecting the median values of each box plot. The BLOCKPLOT contains descriptive statistics per time period illustrated in the associated box plot. The discussion points focus on techniques for nesting the lattice overlay with box plots and BLOCKPLOTs in GTL and some reasons for choosing box plots versus series plots of medians or means.
Greg Stanek, Fannie Mae
Walt Disney World Resort is home to four theme parks, two water parks, five golf courses, 26 owned-and-operated resorts, and hundreds of merchandise and dining experiences. Every year millions of guests stay at Disney resorts to enjoy the Disney Experience. Assigning physical rooms to resort and hotel reservations is a key component to maximizing operational efficiency and guest satisfaction. Solutions can range from automation to optimization programs. The volume of reservations and the variety and uniqueness of guest preferences across the Walt Disney World Resort campus pose an opportunity to solve a number of reasonably difficult room assignment problems by leveraging operations research techniques. For example, a guest might prefer a room with specific bedding and adjacent to certain facilities or amenities. When large groups, families, and friends travel together, they often want to stay near each other using specific room configurations. Rooms might be assigned to reservations in advance and upon request at check-in. Using mathematical programming techniques, the Disney Decision Science team has partnered with the SAS® Advanced Analytics R&D team to create a room assignment optimization model prototype and implement it in SAS/OR®. We describe how this collaborative effort has progressed over the course of several months, discuss some of the approaches that have proven to be productive for modeling and solving this problem, and review selected results.
HAINING YU, Walt Disney Parks & Resorts
Hai Chu, Walt Disney Parks & Resorts
Tianke Feng, Walt Disney Parks & Resorts
Matthew Galati, SAS
Ed Hughes, SAS
Ludwig Kuznia, Walt Disney Parks & Resorts
Rob Pratt, SAS
Faced with diminishing forecast returns from the forecast engine within the existing replenishment application, Tractor Supply Company (TSC) engaged SAS® Institute to deliver a fully integrated forecasting solution that promised a significant improvement of chain-wide forecast accuracy. The end-to-end forecast implementation including problems faced, solutions delivered, and results realized will be explored.
Chris Houck, SAS
SAS/QC® provides procedures, such as PROC SHEWHART, to produce control charts with centerlines and control limits. When quality improvement initiatives create an out-of-control process of improvement, centerlines and control limits need to be recalculated. While this is not a complicated process, producing many charts with multiple centerline shifts can quickly become difficult. This paper illustrates the use of a macro to efficiently compute centerlines and control limits when one or more recalculations are needed for multiple charts.
Jesse Pratt, Cincinnati Children's Hospital Medical Center
If your data do not meet the assumptions for a standard parametric test, you might want to consider using a permutation test. By randomly shuffling the data and recalculating a test statistic, a permutation test can calculate the probability of getting a value equal to or more extreme than an observed test statistic. With the power of matrices, vectors, functions, and user-defined modules, the SAS/IML® language is an excellent option. This paper covers two examples of permutation tests: one for paired data and another for repeated measures analysis of variance. For those new to SAS/IML® software, this paper offers a basic introduction and examples of how effective it can be.
John Vickery, North Carolina State University
Evaluation of the impact of critical or high-risk events or periods in longitudinal studies of growth might provide clues to the long-term effects of life events and efficacies of preventive and therapeutic interventions. Conventional linear longitudinal models typically involve a single growth profile to represent linear changes in an outcome variable across time, which sometimes does not fit the empirical data. The piecewise linear mixed-effects models allow different linear functions of time corresponding to the pre- and post-critical time point trends. This presentation shows: 1) how to perform piecewise linear mixed effects models using SAS step by step, in the context of a clinical trial with two-arm interventions and a predictive covariate of interest; 2) how to obtain the slopes and corresponding p-values for intervention and control groups during pre- and post-critical periods, conditional on different values of the predictive covariate; and 3) explains how to make meaningful comparisons and present results in a scientific manuscript. A SAS macro to generate the summary tables assisting the interpretation of the results is also provided.
Qinlei Huang, St Jude Children's Research Hospital
Investment portfolios and investable indexes determine their holdings according to stated mandate and methodology. Part of that process involves compliance with certain allocation constraints. These constraints are developed internally by portfolio managers and index providers, imposed externally by regulations, or both. An example of the latter is the U.S. Internal Revenue Code (25/50) concentration constraint, which relates to a regulated investment company (RIC). These codes state that at the end of each quarter of a RIC's tax year, the following constraints should be met: 1) No more than 25 percent of the value of the RIC's assets might be invested in a single issuer. 2) The sum of the weights of all issuers representing more than 5 percent of the total assets should not exceed 50 percent of the fund's total assets. While these constraints result in a non-continuous model, compliance with concentration constraints can be formalized by reformulating the model as a series of continuous non-linear optimization problems solved using PROC OPTMODEL. The model and solution are presented in this paper. The approach discussed has been used in constructing investable equity indexes.
Taras Zlupko, CRSP, University of Chicago
Robert Spatz
SAS® Simulation Studio, a component of SAS/OR® software for Microsoft Windows environments, provides powerful and versatile capabilities for building, executing, and analyzing discrete-event simulation models in a graphical environment. Its object-oriented, drag-and-drop modeling makes building and working with simulation models accessible to novice users, and its broad range of model configuration options and advanced capabilities makes SAS Simulation Studio suitable also for sophisticated, detailed simulation modeling and analysis. Although the number of modeling blocks in SAS Simulation Studio is small enough to be manageable, the number of ways in which they can be combined and connected is almost limitless. This paper explores some of the modeling methods and constructs that have proven most useful in practical modeling with SAS Simulation Studio. SAS has worked with customers who have applied SAS Simulation Studio to measure, predict, and improve system performance in many different industries, including banking, public utilities, pharmaceuticals, manufacturing, prisons, hospitals, and insurance. This paper looks at some discrete-event simulation modeling needs that arise in specific settings and some that have broader applicability, and it considers the ways in which SAS Simulation Studio modeling can meet those needs.
Ed Hughes, SAS
Emily Lada, SAS
Researchers, patients, clinicians, and other health-care industry participants are forging new models for data-sharing in hopes that the quantity, diversity, and analytic potential of health-related data for research and practice will yield new opportunities for innovation in basic and translational science. Whether we are talking about medical records (for example, EHR, lab, notes), administrative data (claims and billing), social (on-line activity), behavioral (fitness trackers, purchasing patterns), contextual (geographic, environmental), or demographic data (genomics, proteomics), it is clear that as health-care data proliferates, threats to security grow. Beginning with a review of the major health-care data breeches in our recent history, we highlight some of the lessons that can be gleaned from these incidents. In this paper, we talk about the practical implications of data sharing and how to ensure that only the right people have the right access to the right level of data. To that end, we explore not only the definitions of concepts like data privacy, but we discuss, in detail, methods that can be used to protect data--whether inside our organization or beyond its walls. In this discussion, we cover the fundamental differences between encrypted data, 'de-identified', 'anonymous', and 'coded' data, and methods to implement each. We summarize the landscape of maturity models that can be used to benchmark your organization's data privacy and protection of sensitive data.
Greg Nelson, ThotWave
Diabetes is a chronic condition affecting people of all ages and is prevalent in around 25.8 million people in the U.S. The objective of this research is to predict the probability of a diabetic patient being readmitted. The results from this research will help hospitals design a follow-up protocol to ensure that patients having a higher re-admission probability are doing well in order to promote a healthy doctor-patient relationship. The data was obtained from the Center for Machine Learning and Intelligent Systems at University of California, Irvine. The data set contains over 100,000 instances and 55 variables such as insulin and length of stay, and so on. The data set was split into training and validation to provide an honest assessment of models. Various variable selection techniques such as stepwise regression, forward regression, LARS, and LASSO were used. Using LARS, prominent factors were identified in determining the patient readmission rate. Numerous predictive models were built: Decision Tree, Logistic Regression, Gradient Boosting, MBR, SVM, and others. The model comparison algorithm in SAS® Enterprise Miner™ 13.1 recognized that the High-Performance Support Vector Machine outperformed the other models, having the lowest misclassification rate of 0.363. The chosen model has a sensitivity of 49.7% and a specificity of 75.1% in the validation data.
Hephzibah Munnangi, Oklahoma State University
Goutam Chakraborty, Oklahoma State University
Utility companies in America are always challenged when it comes to knowing when their infrastructure fails. One of the most critical components of a utility company's infrastructure is the transformer. It is important to assess the remaining lifetime of transformers so that the company can reduce costs, plan expenditures in advance, and largely mitigate the risk of failure. It is also equally important to identify the high-risk transformers in advance and to maintain them accordingly in order to avoid sudden loss of equipment due to overloading. This paper uses SAS® to predict the lifetime of transformers, identify the various factors that contribute to their failure, and model the transformer into High, Medium, and Low risk categories based on load for easy maintenance. The data set from a utility company contains around 18,000 observations and 26 variables from 2006 to 2013, and contains the failure and installation dates of the transformers. The data set comprises many transformers that were installed before 2006 (there are 190,000 transformers on which several regression models are built in this paper to identify their risk of failure), but there is no age-related parameter for them. Survival analysis was performed on this left-truncated and right-censored data. The data set has variables such as Age, Average Temperature, Average Load, and Normal and Overloaded Conditions for residential and commercial transformers. Data creation involved merging 12 different tables. Nonparametric models for failure time data were built so as to explore the lifetime and failure rate of the transformers. By building a Cox's regression model, the important factors contributing to the failure of a transformer are also analyzed in this paper. Several risk- based models are then built to categorize transformers into High, Medium, and Low risk categories based on their loads. This categorization can help the utility companies to better manage the risks associated with transformer failures.
Balamurugan Mohan, Oklahoma State University
Goutam Chakraborty, Oklahoma State University
Predictions, including regressions and classifications, are the predominant focus of many statistical and machine-learning models. However, in the era of big data, a predictive modeling process contains more than just making the final predictions. For example, a large collection of data often represents a set of small, heterogeneous populations. Identification of these sub groups is therefore an important step in predictive modeling. In addition, big data data sets are often complex, exhibiting high dimensionality. Consequently, variable selection, transformation, and outlier detection are integral steps. This paper provides working examples of these critical stages using SAS® Visual Statistics, including data segmentation (supervised and unsupervised), variable transformation, outlier detection, and filtering, in addition to building the final predictive model using methodology such as linear regressions, decision trees, and logistic regressions. The illustration data was collected from 2010 to 2014, from vehicle emission testing results.
Xiangxiang Meng, SAS
Jennifer Ames, SAS
Wayne Thompson, SAS
A common complaint of employers is that educational institutions do not prepare students for the types of messy data and multi-faceted requirements that occur on the job. No organization has data that resembles the perfectly scrubbed data sets in the back of a statistics textbook. The objective of the Annual Report Project is to quickly bring new SAS® users to a level of competence where they can use real data to meet real business requirements. Many organizations need annual reports for stockholders, funding agencies, or donors. Or, they need annual reports at the department or division level for an internal audience. Being tapped as part of the team creating an annual report used to mean weeks of tedium, poring over columns of numbers in 8-point font in (shudder) Excel spreadsheets, but no more. No longer painful, using a few SAS procedures and functions, reporting can be easy and, dare I say, fun. All analyses are done using SAS® Studio (formerly SAS® Web Editor) of SAS OnDemand for Academics. This paper uses an example with actual data for a report prepared to comply with federal grant funding requirements as proof that, yes, it really is that simple.
AnnMaria De Mars, AnnMaria De Mars
This paper presents a methodology developed to define and prioritize feeders with the least satisfactory performances for continuity of energy supply, in order to obtain an efficiency ranking that supports a decision-making process regarding investments to be implemented. Data Envelopment Analysis (DEA) was the basis for the development of this methodology, in which the input-oriented model with variable returns to scale was adopted. To perform the analysis of the feeders, data from the utility geographic information system (GIS) and from the interruption control system was exported to SAS® Enterprise Guide®, where data manipulation was possible. Different continuity variables and physical-electrical parameters were consolidated for each feeder for the years 2011 to 2013. They were separated according to the geographical regions of the concession area, according to their location (urban or rural), and then grouped by physical similarity. Results showed that 56.8% of the feeders could be considered as efficient, based on the continuity of the service. Furthermore, the results enable identification of the assets with the most critical performance and their benchmarks, and the definition of preliminary goals to reach efficiency.
Victor Henrique de Oliveira, Cemig
Iguatinan Monteiro, CEMIG
The era of big data and health care reform is an exciting and challenging time for anyone whose work involves data security, analytics, data visualization, or health services research. This presentation examines important aspects of current approaches to quality improvement in health care based on data transparency and patient choice. We look at specific initiatives related to the Affordable Care Act (for example, the qualified entity program of section 10332 that allows the Centers for Medicare and Medicaid Services (CMS) to provide Medicare claims data to organizations for multi-payer quality measurement and reporting, the open payments program, and state-level all-payer claims databases to inform improvement and public reporting) within the context of a core issue in the era of big data: security and privacy versus transparency and openness. In addition, we examine an assumption that underlies many of these initiatives: data transparency leads to improved choices by health care consumers and increased accountability of providers. For example, recent studies of one component of data transparency, price transparency, show that, although health plans generally offer consumers an easy-to-use cost calculator tool, only about 2 percent of plan members use it. Similarly, even patients with high-deductible plans (presumably those with an increased incentive to do comparative shopping) seek prices for only about 10 percent of their services. Anyone who has worked in analytics, reporting, or data visualization recognizes the importance of understanding the intended audience, and that methodological transparency is as important as the public reporting of the output of the calculation of cost or quality metrics. Although widespread use of publicly reported health care data might not be a realistic goal, data transparency does offer a number of potential benefits: data-driven policy making, informed management of cost and use of services, as well as public health benefits through, for example, the rec
ognition of patterns of disease prevalence and immunization use. Looking at this from a system perspective, we can distinguish five main activities: data collection, data storage, data processing, data analysis, and data reporting. Each of these activities has important components (such as database design for data storage and de-identification and aggregation for data reporting) as well as overarching requirements such as data security and quality assurance that are applicable to all activities. A recent Health Affairs article by CMS leaders noted that the big-data revolution could not have come at a better time, but it also recognizes that challenges remain. Although CMS is the largest single payer for health care in the U.S., the challenges it faces are shared by all organizations that collect, store, analyze, or report health care data. In turn, these challenges are opportunities for database developers, systems analysts, programmers, statisticians, data analysts, and those who provide the tools for public reporting to work together to design comprehensive solutions that inform evidence-based improvement efforts.
Paul Gorrell, IMPAQ International
Dynamic pricing is a real-time strategy where corporations attempt to alter prices based on varying market demand. The hospitality industry has been doing this for quite a while, altering prices significantly during the summer months or weekends when demand for rooms is at a premium. In recent years, the sports industry has started to catch on to this trend, especially within Major League Baseball (MLB). The purpose of this paper is to explore the methodology of applying this type of pricing to the hockey ticketing arena.
Christopher Jones, Deloitte Consulting
Sabah Sadiq, Deloitte Consulting
Jing Zhao, Deloitte Consulting LLP
Retrospective case-control studies are frequently used to evaluate health care programs when it is not feasible to randomly assign members to a respective cohort. Without randomization, observational studies are more susceptible to selection bias where the characteristics of the enrolled population differ from those of the entire population. When the participant sample is different from the comparison group, the measured outcomes are likely to be biased. Given this issue, this paper discusses how propensity score matching and random effects techniques can be used to reduce the impact selection bias has on observational study outcomes. All results shown are drawn from an ROI analysis using a participant (cases) versus non-participant (controls) observational study design for a fitness reimbursement program aiming to reduce health care expenditures of participating members.
Jess Navratil-Strawn, Optum
To stay competitive in the marketplace, health-care programs must be capable of reporting the true savings to clients. This is a tall order, because most health-care programs are set up to be available to the client's entire population and thus cannot be conducted as a randomized control trial. In order to evaluate the performance of the program for the client, we use an observational study design that has inherent selection bias due to its inability to randomly assign participants. To reduce the impact of bias, we apply propensity score matching to the analysis. This technique is beneficial to health-care program evaluations because it helps reduce selection bias in the observational analysis and in turn provides a clearer view of the client's savings. This paper explores how to develop a propensity score, evaluate the use of inverse propensity weighting versus propensity matching, and determine the overall impact of the propensity score matching method on the observational study population. All results shown are drawn from a savings analysis using a participant (cases) versus non-participant (controls) observational study design for a health-care decision support program aiming to reduce emergency room visits.
Amber Schmitz, Optum
Replication techniques such as the jackknife and the bootstrap have become increasingly popular in recent years, particularly within the field of complex survey data analysis. The premise of these techniques is to treat the data set as if it were the population and repeatedly sample from it in some systematic fashion. From each sample, or replicate, the estimate of interest is computed, and the variability of the estimate from the full data set is approximated by a simple function of the variability among the replicate-specific estimates. An appealing feature is that there is generally only one variance formula per method, regardless of the underlying quantity being estimated. The entire process can be efficiently implemented after appending a series of replicate weights to the analysis data set. As will be shown, the SURVEY family of SAS/STAT® procedures can be exploited to facilitate both the task of appending the replicate weights and approximating variances.
Taylor Lewis, University of Maryland
Reporting Best Practices
Trina Gladwell, Bealls Inc
Retailers, amongst nearly every other consumer business are under more pressure and competition than ever before. Today 's consumer is more connected, informed and empowered and the pace of innovation is rapidly changing the way consumers shop. Retailers are expected to sift through and implement digital technology, make sense of their Big Data with analytics, change processes and cut costs all at the same time. Today 's session, SRetail 2015 the landscape, trends, and technology will cover major issues retailers are facing today as well as both business and technology trends that will shape their future.
Lori Schafer
Pay-for-performance programs are putting increasing pressure on providers to better manage patient utilization through care coordination, with the philosophy that good preventive services and routine care can prevent the need for some high-resource services. Evaluation of provider performance frequently includes measures such as acute care events (ER and inpatient), imaging, and specialist services, yet rarely are these indicators adjusted for the underlying risk of providers' patient panel. In part, this is because standard patient risk scores are designed to predict costs, not the probability of specific service utilization. As such, Blue Cross Blue Shield of North Carolina has developed a methodology to model our members' risk of these events in an effort to ensure that providers are evaluated fairly and to prevent our providers from adverse selection practices. Our risk modeling takes into consideration members' underlying health conditions and limited demographic factors during the previous 12 month period, and employs two-part regression models using SAS® software. These risk-adjusted measures will subsequently be the basis of performance evaluation of primary care providers for our Accountable Care Organizations and medical home initiatives.
Stephanie Poley, Blue Cross Blue Shield of North Carolina
The latest release of SAS/STAT® software brings you powerful techniques that will make a difference in your work, whether your data are massive, missing, or somewhere in the middle. New imputation software for survey data adds to an expansive array of methods in SAS/STAT for handling missing data, as does the production version of the GEE procedure, which provides the weighted generalized estimating equation approach for longitudinal studies with dropouts. An improved quadrature method in the GLIMMIX procedure gives you accelerated performance for certain classes of models. The HPSPLIT procedure provides a rich set of methods for statistical modeling with classification and regression trees, including cross validation and graphical displays. The HPGENSELECT procedure adds support for spline effects and lasso model selection for generalized linear models. And new software implements generalized additive models by using an approach that handles large data easily. Other updates include key functionality for Bayesian analysis and pharmaceutical applications.
Maura Stokes, SAS
Bob Rodriguez, SAS
Individual investors face a daunting challenge. They must select a portfolio of securities comprised of a manageable number of individual stocks, bonds and/or mutual funds. An investor might initiate her portfolio selection process by choosing the number of unique securities to hold in her portfolio. This is both a practical matter and a matter of risk management. It is practical because there are tens of thousands of actively traded securities from which to choose and it is impractical for an individual investor to own every available security. It is also a risk management measure because investible securities bring with them the potential of financial loss -- to the point of becoming valueless in some cases. Increasing the number of securities in a portfolio decreases the probability that an investor will suffer drastically from corporate bankruptcy, for instance. However, holding too many securities in a portfolio can restrict performance. After deciding the number of securities to hold, the investor must determine which securities she will include in her portfolio and what proportion of available cash she will allocate to each security. Once her portfolio is constructed, the investor must manage the portfolio over time. This generally entails periodically reassessing the proportion of each security to maintain as time advances, but may also involve the elimination of some securities and the initiation of positions in new securities. This paper introduces an analytically driven method for portfolio security selection based on minimizing the mean correlation of returns across the portfolio. It also introduces a method for determining the proportion of each security that should be maintained within the portfolio. The methods for portfolio selection and security weighting described herein work in conjunction to maximize expected portfolio return, while minimizing the probability of loss over time. This involves a re-visioning of Harry Markowitz's Nobel Prize winning concept kno
wn as Efficient Frontier . Resultant portfolios are assessed via Monte Carlo simulation and results are compared to the Standard & Poor's 500 Index and Warren Buffett's Berkshire Hathaway, which has a well-establish history of beating the Standard & Poor's 500 Index over a long period. To those familiar with Dr. Markowitz's Modern Portfolio Theory this paper may appear simply as a repackaging of old ideas. It is not.
Bruce Bedford, Oberweis Dairy
In today's competitive job market, both recent graduates and experienced professionals are looking for ways to set themselves apart from the crowd. SAS® certification is one way to do that. SAS Institute Inc. offers a range of exams to validate your knowledge level. In writing this paper, we have drawn upon our personal experiences, remarks shared by new and longtime SAS users, and conversations with experts at SAS. We discuss what certification is and why you might want to pursue it. Then we share practical tips you can use to prepare for an exam and do your best on exam day.
Andra Northup, Advanced Analytic Designs, Inc.
Susan Slaughter, Avocet Solutions
First introduced in 2013, the Cloudera Data Science Challenge is a rigorous competition in which candidates must provide a solution to a real-world big data problem that surpasses a benchmark specified by some of the world's elite data scientists. The Cloudera Data Science Challenge 2 (in 2014) involved detecting anomalies in the United States Medicare insurance system. Finding anomalous patients, procedures, providers, and regions in the competition's large, complex, and intertwined data sets required industrial-strength tools for data wrangling and machine learning. This paper shows how I did it with SAS®.
Patrick Hall, SAS
SAS® Model Manager provides an easy way to deploy analytical models into various relational databases or into Hadoop using either scoring functions or the SAS® Embedded Process publish methods. This paper gives a brief introduction of both the SAS Model Manager publishing functionality and the SAS® Scoring Accelerator. It describes the major differences between using scoring functions and the SAS Embedded Process publish methods to publish a model. The paper also explains how to perform in-database processing of a published model by using SAS applications as well as SQL code outside of SAS. In addition to Hadoop, SAS also supports these databases: Teradata, Oracle, Netezza, DB2, and SAP HANA. Examples are provided for publishing a model to a Teradata database and to Hadoop. After reading this paper, you should feel comfortable using a published model in your business environment.
Jifa Wei, SAS
Kristen Aponte, SAS
Are you a SAS® software user hoping to convince your organization to move to the latest product release? Has your management team asked how your organization can hire new SAS users familiar with the latest and greatest procedures and techniques? SAS® Studio and SAS® University Edition might provide the answers for you. SAS University Edition was created for teaching and learning. It's a new downloadable package of selected SAS products (Base SAS®, SAS/STAT®, SAS/IML®, SAS/ACCESS® Interface to PC Files, and SAS Studio) that runs on Windows, Linux, and Mac. With the exploding demand for analytical talent, SAS launched this package to grow the next generation of SAS users. Part of the way SAS is helping grow that next generation of users is through the interface to SAS University Edition: SAS Studio. SAS Studio is a developmental web application for SAS that you access through your web browser and--since the first maintenance release of SAS 9.4--is included in Base SAS at no additional charge. The connection between SAS University Edition and commercial SAS means that it's easier than ever to use SAS for teaching, research, and learning, from high schools to community colleges to universities and beyond. This paper describes the product, as well as the intent behind it and other programs that support it, and then talks about some successes in adopting SAS University Edition to grow the next generation of users.
Polly Mitchell-Guthrie, SAS
Amy Peters, SAS
Data visualization can be like a GPS directing us to where in the sea of data we should spend our analytical efforts. In today's big data world, many businesses are still challenged to quickly and accurately distill insights and solutions from ever-expanding information streams. Wells Fargo CEO John Stumpf challenges us with the following: We all work for the customer. Our customers say to us, 'Know me, understand me, appreciate me and reward me.' Everything we need to know about a customer must be available easily, accurately, and securely, as fast as the best Internet search engine. For the Wells Fargo Credit Risk department, we have been focused on delivering more timely, accurate, reliable, and actionable information and analytics to help answer questions posed by internal and external stakeholders. Our group has to measure, analyze, and provide proactive recommendations to support and direct credit policy and strategic business changes, and we were challenged by a high volume of information coming from disparate data sources. This session focuses on how we evaluated potential solutions and created a new go-forward vision using a world-class visual analytics platform with strong data governance to replace manually intensive processes. As a result of this work, our group is on its way to proactively anticipating problems and delivering more dynamic reports.
Ryan Marcum, Wells Fargo Home Mortgage
This workshop provides hands-on experience using SAS® Enterprise Miner. Workshop participants will learn to: open a project, create and explore a data source, build and compare models, and produce and examine score code that can be used for deployment.
Chip Wells, SAS
This workshop provides hands-on experience using SAS® Forecast Server. Workshop participants will learn to: create a project with a hierarchy, generate multiple forecast automatically, evaluate the forecasts accuracy, and build a custom model.
Catherine Truxillo, SAS
George Fernandez, SAS
Terry Woodfield, SAS
This workshop provides hands-on experience with SAS® Visual Statistics. Workshop participants will learn to: move between the Visual Analytics Explorer interface and Visual Statistics, fit automatic statistical models, create exploratory statistical analysis, compare models using a variety of metrics, and create score code.
Catherine Truxillo, SAS
Xiangxiang Meng, SAS
Mike Jenista, SAS
This workshop provides hands-on experience performing statistical analysis with SAS University Edition and SAS Studio. Workshop participants will learn to: install and setup, perform basic statistical analyses using tasks, connect folders to SAS Studio for data access and results storage, invoke code snippets to import CSV data into SAS, and create a code snippet.
Danny Modlin, SAS
This workshop provides hands-on experience using SAS® Text Miner. Workshop participants will learn to: read a collection of text documents and convert them for use by SAS Text Miner using the Text Import node, use the simple query language supported by the Text Filter node to extract information from a collection of documents, use the Text Topic node to identify the dominant themes and concepts in a collection of documents, and use the Text Rule Builder node to classify documents having pre-assigned categories.
Terry Woodfield, SAS
This paper presents an application of the SURVEYSELECT procedure. The objective is to draw a systematic random sample from financial data for review. Topics covered in this paper include a brief review of systematic sampling, variable definitions, serpentine sorting, and an interpretation of the output.
Roger L Goodwin, US Government Printing Office
This paper discusses the selection and transformation of continuous predictor variables for the fitting of binary logistic models. The paper has two parts: (1) A procedure and associated SAS® macro are presented that can screen hundreds of predictor variables and 10 transformations of these variables to determine their predictive power for a logistic regression. The SAS macro passes the training data set twice to prepare the transformations and one more time through PROC TTEST. (2) The FSP (function selection procedure) and a SAS implementation of FSP are discussed. The FSP tests all transformations from among a class of FSP transformations and finds the one with maximum likelihood when fitting the binary target. In a 2008 book, Patrick Royston and Willi Sauerbrei popularized the FSP.
Bruce Lund, Marketing Associates, LLC
Understanding organizational trends in spending can help overseeing government agencies make appropriate modifications in spending to best serve the organization and the citizenry. However, given millions of line items for organizations annually, including free-form text, it is unrealistic for these overseeing agencies to succeed by using only a manual approach to this textual data. Using a publicly available data set, this paper explores how business users can apply text analytics using SAS® Contextual Analysis to assess trends in spending for particular agencies, apply subject matter expertise to refine these trends into a taxonomy, and ultimately, categorize the spending for organizations in a flexible, user-friendly manner. SAS® Visual Analytics enables dynamic exploration, including modeling results from SAS® Visual Statistics, in order to assess areas of potentially extraneous spending, providing actionable information to the decision makers.
Tom Sabo, SAS
A leading killer in the United States is smoking. Moreover, over 8.6 million Americans live with a serious illness caused by smoking or second-hand smoking. Despite this, over 46.6 million U.S. adults smoke tobacco, cigars, and pipes. The key analytic question in this paper is, How would e-cigarettes affect this public health situation? Can monitoring public opinions of e-cigarettes using SAS® Text Analytics and SAS® Visual Analytics help provide insight into the potential dangers of these new products? Are e-cigarettes an example of Big Tobacco up to its old tricks or, in fact, a cessation product? The research in this paper was conducted on thousands of tweets from April to August 2014. It includes API sources beyond Twitter--for example, indicators from the Health Indicators Warehouse (HIW) of the Centers for Disease Control and Prevention (CDC)--that were used to enrich Twitter data in order to implement a surveillance system developed by SAS® for the CDC. The analysis is especially important to The Office of Smoking and Health (OSH) at the CDC, which is responsible for tobacco control initiatives that help states to promote cessation and prevent initiation in young people. To help the CDC succeed with these initiatives, the surveillance system also: 1) automates the acquisition of data, especially tweets; and 2) applies text analytics to categorize these tweets using a taxonomy that provides the CDC with insights into a variety of relevant subjects. Twitter text data can help the CDC look at the public response to the use of e-cigarettes, and examine general discussions regarding smoking and public health, and potential controversies (involving tobacco exposure to children, increasing government regulations, and so on). SAS® Content Categorization helps health care analysts review large volumes of unstructured data by categorizing tweets in order to monitor and follow what people are saying and why they are saying it. Ultimatel
y, it is a solution intended to help the CDC monitor the public's perception of the dangers of smoking and e-cigarettes, in addition, it can identify areas where OSH can focus its attention in order to fulfill its mission and track the success of CDC health initiatives.
Manuel Figallo, SAS
Emily McRae, SAS
Several U.S. Federal agencies conduct national surveys to monitor health status of residents. Many of these agencies release their survey data to the public. Investigators might be able to address their research objectives by conducting secondary statistical analyses with these available data sources. This paper describes the steps in using the SAS SURVEY procedures to analyze publicly released data from surveys that use probability sampling to make statistical inference to a carefully defined population of elements (the target population).
Donna Brogan, Emory University, Atlanta, GA
Product affinity segmentation is a powerful technique for marketers and sales professionals to gain a good understanding of customers' needs, preferences, and purchase behavior. Performing product affinity segmentation is quite challenging in practice because product-level data usually has high skewness, high kurtosis, and a large percentage of zero values. The Doughnut Clustering method has been shown to be effective using real data, and was presented at SAS® Global Forum 2013 in the paper titled Product Affinity Segmentation Using the Doughnut Clustering Approach.' However, the Doughnut Clustering method is not a panacea for addressing the product affinity segmentation problem. There is a clear need for a comprehensive evaluation of this method in order to be able to develop generic guidelines for practitioners about when to apply it. In this paper, we meet the need by evaluating the Doughnut Clustering method on simulated data with different levels of skewness, kurtosis, and percentage of zero values. We developed a five-step approach based on Fleishman's power method to generate synthetic data with prescribed parameters. Subsequently, we designed and conducted a set of experiments to run the Doughnut Clustering method as well as the traditional K-means method as a benchmark on simulated data. We draw conclusions on the performance of the Doughnut Clustering method by comparing the clustering validity metric the ratio of between-cluster variance to within-cluster variance as well as the relative proportion of cluster sizes against those of K-means.
Darius Baer, SAS
Goutam Chakraborty, Oklahoma State University
In today's omni-channel world, consumers expect retailers to deliver the product they want, where they want it, when they want it, at a price they accept. A major challenge many retailers face in delighting their customers is successfully predicting consumer demand. Business decisions across the enterprise are affected by these demand estimates. Forecasts used to inform high-level strategic planning, merchandising decisions (planning assortments, buying products, pricing, and allocating and replenishing inventory) and operational execution (labor planning) are similar in many respects. However, each business process requires careful consideration of specific input data, modeling strategies, output requirements, and success metrics. In this session, learn how leading retailers are increasing sales and profitability by operationalizing forecasts that improve decisions across their enterprise.
Alex Chien, SAS
Elizabeth Cubbage, SAS
Wanda Shive, SAS
Most 'Design of Experiment' textbooks cover Type I, Type II, and Type III sums of squares, but many researchers and statisticians fall into the habit of using one type mindlessly. This breakout session reviews the basics and illustrates the importance of the choice of type as well as the variable definitions in PROC GLM and PROC REG.
Sheila Barron, University of Iowa
Michelle Mengeling, Comprehensive Access & Delivery Research & Evaluation-CADRE, Iowa City VA Health Care System
Sampling for audits and forensics presents special challenges: Each survey/sample item requires examination by a team of professionals, so sample size must be contained. Surveys involve estimating--not hypothesis testing. So power is not a helpful concept. Stratification and modeling is often required to keep sampling distributions from being skewed. A precision of alpha is not required to create a confidence interval of 1-alpha, but how small a sample is supportable? Many times replicated sampling is required to prove the applicability of the design. Given the robust, programming-oriented approach of SAS®, the random selection, stratification, and optimizing techniques built into SAS can be used to bring transparency and reliability to the sample design process. While a sample that is used in a published audit or as a measure of financial damages must endure a special scrutiny, it can be a rewarding process to design a sample whose performance you truly understand and which will stand up under a challenge.
Turner Bond, HUD-Office of Inspector General
Surveys are designed to elicit information about population characteristics. A survey design typically combines stratification and multistage sampling of intact clusters, sub-clusters, and individual units with specified probabilities of selection. A survey sample can produce valid and reliable estimates of population parameters at a fraction of the cost of carrying out a census of the entire population, with clear logistical efficiencies. For analyses of survey data, SAS®software provides a suite of procedures from SURVEYMEANS and SURVEYFREQ for generating descriptive statistics and conducting inference on means and proportions to regression-based analysis through SURVEYREG and SURVEYLOGISTIC. For longitudinal surveys and follow-up studies, SURVEYPHREG is designed to incorporate aspects of the survey design for analysis of time-to-event outcomes based on the Cox proportional hazards model, allowing for time-varying explanatory variables.We review the salient features of the SURVEYPHREG procedure with application to survey data from the National Health and Nutrition Examination Survey (NHANES III) Linked Mortality File.
JOSEPH GARDINER, MICHIGAN STATE UNIVERSITY
Data simulation is a fundamental tool for statistical programmers. SAS® software provides many techniques for simulating data from a variety of statistical models. However, not all techniques are equally efficient. An efficient simulation can run in seconds, whereas an inefficient simulation might require days to run. This paper presents 10 techniques that enable you to write efficient simulations in SAS. Examples include how to simulate data from a complex distribution and how to use simulated data to approximate the sampling distribution of a statistic.
Rick Wicklin, SAS
This session describes our journey from data acquisition to text analytics on clinical, textual data.
Mark Pitts, Highmark Health
This presentation details the steps involved in using SAS® Enterprise Miner™ to text mine a sample of member complaints. Specifically, it describes how the Text Parsing, Text Filtering, and Text Topic nodes were used to generate topics that described the complaints. Text mining results are reviewed (slightly modified for confidentiality), as well as conclusions and lessons learned from the project.
Amanda Pasch, Kaiser Permanenta
PROC MIXED is one of the most popular SAS procedures to perform longitudinal analysis or multilevel models in epidemiology. Model selection is one of the fundamental questions in model building. One of the most popular and widely used strategies is model selection based on information criteria, such as Akaike Information Criterion (AIC) and Sawa Bayesian Information Criterion (BIC). This strategy considers both fit and complexity, and enables multiple models to be compared simultaneously. However, there is no existing SAS procedure to perform model selection automatically based on information criteria for PROC MIXED, given a set of covariates. This paper provides information about using the SAS %ic_mixed macro to select a final model with the smallest value of AIC and BIC. Specifically, the %ic_mixed macro will do the following: 1) produce a complete list of all possible model specifications given a set of covariates, 2) use do loop to read in one model specification every time and save it in a macro variable, 3) execute PROC MIXED and use the Output Delivery System (ODS) to output AICs and BICs, 4) append all outputs and use the DATA step to create a sorted list of information criteria with model specifications, and 5) run PROC REPORT to produce the final summary table. Based on the sorted list of information criteria, researchers can easily identify the best model. This paper includes the macro programming language, as well as examples of the macro calls and outputs.
Qinlei Huang, St Jude Children's Research Hospital
For the past two academic school years, our SAS® Programming 1 class had a classroom discussion about the Charlotte Bobcats. We wondered aloud If the Bobcats changed their team name would the dwindling fan base return? As a class, we created a survey that consisted of 10 questions asking people if they liked the name Bobcats, did they attend basketball games, and if they bought merchandise. Within a one-hour class period, our class surveyed 981 out of 1,733 students at Phillip O. Berry Academy of Technology. After collecting the data, we performed advanced analytics using Base SAS® and concluded that 75% of students and faculty at Phillip O. Berry would prefer any other name except the Bobcats. In other results, 80% percent of the student body liked basketball, and the most preferred name was the Hornets, followed by the Royals, Flight, Dragons, and finally the Bobcats, The following school year, we conducted another survey to discover if people's opinions had changed since the previous survey and if people were happy with the Bobcats changing their name. During this time period, the Bobcats had recently reported that they were granted the opportunity to change the team name to the Hornets. Once more, we collected and analyzed the data and concluded that 77% percent of people surveyed were thrilled with the name change. In addition, around 50% percent of surveyors were interested in purchasing merchandise. Through the work of this project, SAS® Analytics was applied in the classroom to a real world scenario. The ability to see how SAS® could be applied to a question of interest and create change inspired the students in our class. This project is significantly important to show the economic impact that sports can have on a city. This project in particular, focused on the nostalgia that people of the city of Charlotte felt for the name Hornets. The project opened the door for more analysis and questions and continues to spa
rk interest. This is the case because when people have a connection to the team and the more the team flourishes, the more Charlotte benefits.
Lauren Cook, Charlotte Mecklenburg School System
Random Forest (RF) is a trademarked term for an ensemble approach to decision trees. RF was introduced by Leo Breiman in 2001.Due to our familiarity with decision trees--one of the intuitive, easily interpretable models that divides the feature space with recursive partitioning and uses sets of binary rules to classify the target--we also know some of its limitations such as over-fitting and high variance. RF uses decision trees, but takes a different approach. Instead of growing one deep tree, it aggregates the output of many shallow trees and makes a strong classifier model. RF significantly improves the accuracy of classification by growing an ensemble of trees and allowing for the selection of the most popular one. Unlike decision trees, RF has a robustness against over-fitting and high variance, since it randomly selects a subset of variables in each split node. This paper demonstrates this simple yet powerful classification algorithm by building an income-level prediction system. Data extracted from the 1994 Census Bureau database was used for this study. The data set comprises information about 14 key attributes for 45,222 individuals. Using SAS® Enterprise Miner™ 13.1, models such as random forest, decision tree, probability decision tree, gradient boosting, and logistic regression were built to classify the income level( >50K or <50k) of the population. The results showed that the random forest model was the best model for this data, based on the misclassification rate criteria. The RF model predicts the income-level group of the individuals with an accuracy of 85.1%, with the predictors capturing specific characteristic patterns. This demonstration using SAS® can lead to useful insights into RF for solving classification problems.
Narmada Deve Panneerselvam, OSU
This unique culture has access to lots of data, unstructured and structured; is innovative, experimental, groundbreaking, and doesn't follow convention; and has access to powerful new infrastructure technologies and scalable, industry-standard computing power like never seen before. The convergence of data, and innovative spirit, and the means to process it is what makes this a truly unique culture. In response to that, SAS® proposes The New Analytics Experience. Attend this session to hear more about the New Analytics Experience and the latest Intel technologies that make it possible.
Mark Pallone, Intel
Researchers often use longitudinal data analysis to study the development of behaviors or traits. For example, they might study how an elderly person's cognitive functioning changes over time or how a therapeutic intervention affects a certain behavior over a period of time. This paper introduces the structural equation modeling (SEM) approach to analyzing longitudinal data. It describes various types of latent curve models and demonstrates how you can use the CALIS procedure in SAS/STAT® software to fit these models. Specifically, the paper covers basic latent curve models, such as unconditional and conditional models, as well as more complex models that involve multivariate responses and latent factors. All illustrations use real data that were collected in a study that looked at maternal stress and the relationship between mothers and their preterm infants. This paper emphasizes the practical aspects of longitudinal data analysis. In addition to illustrating the program code, it shows how you can interpret the estimation results and revise the model appropriately. The final section of the paper discusses the advantages and disadvantages of the SEM approach to longitudinal data analysis.
Xinming An, SAS
Yiu-Fai Yung, SAS
The unsustainable trend in healthcare costs has led to efforts to shift some healthcare services to less expensive sites of care. In North Carolina, the expansion of urgent care centers introduces the possibility that non-emergent and non-life threatening conditions can be treated at a less intensive care setting. BCBSNC conducted a longitudinal study of density of urgent care centers, primary care providers, and emergency departments, and the differences in how members access care near those locations. This talk focuses on several analytic techniques that were considered for the analysis. The model needed to account for the complex relationship between the changes in the population (including health conditions and health insurance benefits) and the changes in the types of services and supply of services offered by healthcare providers proximal to them. Results for the chosen methodology are discussed.
Laurel Trantham, Blue Cross and Blue Shield North Carolina
Universities often have student data that is difficult to represent, which includes information about the student's home location. Often, student data is represented in tables, and patterns are easily overlooked. This study aimed to represent recruiting and retention data at a county level using SAS® mapping software for a public, land-grant university. Three years of student data from the student records database were used to visually represent enrollment, retention, and other predictors of student success. SAS® Enterprise Guide® was used along with the GMAP procedure to make user-friendly maps. Displaying data using maps on a county level revealed patterns in enrollment, retention, and other factors of interest that might have otherwise been overlooked, which might be beneficial for recruiting purposes.
Allison Lempola, South Dakota State University
Thomas Brandenburger, South Dakota State University
The bookBot Identity: January 2013. With no memory of it from the past, students and faculty at NC State awake to find the Hunt Library just opened, and inside it, the mysterious and powerful bookBot. A true physical search engine, the bookBot, without thinking, relentlessly pursues, captures, and delivers to the patron any requested book (those things with paper pages--remember?) from the Hunt Library. The bookBot Supremacy: Some books were moved from the central campus library to the new Hunt Library. Did this decrease overall campus circulation or did the Hunt Library and its bookBot reign supreme in increasing circulation? The bookBot Ultimatum: To find out if the opening of the Hunt Library decreased or increased overall circulation. To address the bookBot Ultimatum, the Circulation Statistics Investigation (CSI) team uses the power of SAS® analytics to model library circulation before and after the opening of the Hunt Library. The bookBot Legacy: Join us for the adventure-filled story. Filled with excitement and mystery, this talk is bound to draw a much bigger crowd than had it been more honestly titled Intervention Analysis for Library Data. Tools used are PROC ARIMA, PROC REG, and PROC SGPLOT.
David Dickey, NC State University
John Vickery, North Carolina State University
This presentation provides an in-depth analysis, with example SAS® code, of the health care use and expenditures associated with depression among individuals with heart disease using the 2012 Medical Expenditure Panel Survey (MEPS) data. A cross-sectional study design was used to identify differences in health care use and expenditures between depressed (n = 601) and nondepressed (n = 1,720) individuals among patients with heart disease in the United States. Multivariate regression analyses using the SAS survey analysis procedures were conducted to estimate the incremental health services and direct medical costs (inpatient, outpatient, emergency room, prescription drugs, and other) attributable to depression. The prevalence of depression among individuals with heart disease in 2012 was estimated at 27.1% (6.48 million persons) and their total direct medical costs were estimated at approximately $110 billion in 2012 U.S. dollars. Younger adults (< 60 years), women, unmarried, poor, and sicker individuals with heart disease were more likely to have depression. Patients with heart disease and depression had more hospital discharges (relative ratio (RR) = 1.06, 95% confidence interval (CI) [1.02 to 1.09]), office-based visits (RR = 1.27, 95% CI [1.15 to 1.41]), emergency room visits (RR = 1.08, 95% CI [1.02 to 1.14]), and prescribed medicines (RR = 1.89, 95% CI [1.70, 2.11]) than their counterparts without depression. Finally, among individuals with heart disease, overall health care expenditures for individuals with depression was 69% higher than that for individuals without depression (RR = 1.69, 95% CI [1.44, 1.99]). The conclusion is that depression in individuals with heart disease is associated with increased health care use and expenditures, even after adjusting for differences in age, gender, race/ethnicity, marital status, poverty level, and medical comorbidity.
Seungyoung Hwang, Johns Hopkins University Bloomberg School of Public Health
Drawing on the results from machine learning, exploratory statistics, and a variety of related methodologies, data analytics is becoming one of the hottest areas in a variety of global industries. The utility and application of these analyses have been extremely impressive and have led to successes ranging from business value generation to hospital infection control applications. This presentation examines the philosophical foundations epistemology associated with scientific discovery and shows whether the currently used analytics techniques rest on a rational philosophy of science. Examples are provided to assist in making the concepts more concrete to the business and scientific user.
Mike Hardin, The University of Alabama
Currently, there are several methods for reading JSON formatted files into SAS® that depend on the version of SAS and which products are licensed. These methods include user-defined macros, visual analytics, PROC GROOVY, and more. The user-defined macro %GrabTweet, in particular, provides a simple way to directly read JSON-formatted tweets into SAS® 9.3. The main limitation of %GrabTweet is that it requires the user to repeatedly run the macro in order to download large amounts of data over time. Manually downloading tweets while conforming to the Twitter rate limits might cause missing observations and is time-consuming overall. Imagine having to sit by your computer the entire day to continuously grab data every 15 minutes, just to download a complete data set of tweets for a popular event. Fortunately, the %GrabTweet macro can be modified to automate the retrieval of Twitter data based on the rate that the tweets are coming in. This paper describes the application of the %GrabTweet macro combined with batch processing to download tweets without manual intervention. Users can specify the phrase parameters they want, run the batch processing macro, leave their computer to automatically download tweets overnight, and return to a complete data set of recent Twitter activity. The batch processing implements an automated retrieval of tweets through an algorithm that assesses the rate of tweets for the specified topic in order to make downloading large amounts of data simpler and effortless for the user.
Isabel Litton, California Polytechnic State University, SLO
Rebecca Ottesen, City of Hope and Cal Poly SLO
This paper explores feature extraction from unstructured text variables using Term Frequency-Inverse Document Frequency (TF-IDF) weighting algorithms coded in Base SAS®. Data sets with unstructured text variables can often hold a lot of potential to enable better predictive analysis and document clustering. Each of these unstructured text variables can be used as inputs to build an enriched data set-specific inverted index, and the most significant terms from this index can be used as single word queries to weight the importance of the term to each document from the corpus. This paper also explores the usage of hash objects to build the inverted indices from the unstructured text variables. We find that hash objects provide a considerable increase in algorithm efficiency, and our experiments show that a novel weighting algorithm proposed by Paik (2013) best enables meaningful feature extraction. Our TF-IDF implementations are tested against a publicly available data breach data set to understand patterns specific to insider threats to an organization.
Ila Gokarn, Singapore Management University
Clifton Phua, SAS
The NH Citizens Health Initiative and the University of New Hampshire Institute for Health Policy and Practice, in collaboration with Accountable Care Project (ACP) participants, have developed a set of analytic reports to provide systems undergoing transformation a capacity to compare performance on the measures of quality, utilization, and cost across systems and regions. The purpose of these reports is to provide data and analysis on which our ACP learning collaborative can share knowledge and develop action plans that can be adopted by health-care innovators in New Hampshire. This breakout session showcases the claims-based reports, powered by SAS® Visual Analytics and driven by the New Hampshire Comprehensive Health Care Information System (CHIS), which includes commercial, Medicaid, and Medicare populations. With the power of SAS Visual Analytics, hundreds of pages of PDF files were distilled down to a manageable, dynamic, web-based portal that allows users to target information most appealing to them. This streamlined approach reduces barriers to obtaining information, offers that information in a digestible medium, and creates a better user experience. For more information about the ACP or to access the public reports, visit http://nhaccountablecare.org/.
Danna Hourani, SAS
Interactive Voice Response (IVR) systems are likely one of the best and worst gifts to the world of communication, depending on who you ask. Businesses love IVR systems because they take out hundreds of millions of dollars of call center costs in automation of routine tasks, while consumers hate IVRs because they want to talk to an agent! It is a delicate balancing act to manage an IVR system that saves money for the business, yet is smart enough to minimize consumer abrasion by knowing who they are, why they are calling, and providing an easy automated solution or a quick route to an agent. There are many aspects to designing such IVR systems, including engineering, application development, omni-channel integration, user interface design, and data analytics. For larger call volume businesses, IVRs generate terabytes of data per year, with hundreds of millions of rows per day that track all system and customer- facing events. The data is stored in various formats and is often unstructured (lengthy character fields that store API return information or text fields containing consumer utterances). The focus of this talk is the development of a data mining framework based on SAS® that is used to parse and analyze IVR data in order to provide insights into usability of the application across various customer segments. Certain use cases are also provided.
Dmitriy Khots, West Corp
In 2012, the Obama campaign used advanced analytics to target voters, especially in social media channels. Millions of voters were scored on models each night to predict their voting patterns. These models were used as the driver for all campaign decisions, including TV ads, budgeting, canvassing, and digital strategies. This presentation covers how the Obama campaign strategies worked, what's in store for analytics in future elections, and how these strategies can be applied in the business world.
Peter Tanner, Capital One
Becoming one of the best memorizers in the world doesn't happen overnight. With hard work, dedication, a bit of obsession, and with the assistance of some clever analytics metrics, Nelson Dellis was able to climb himself up to the top of the memory rankings in under a year to become the now 3x USA Memory Champion. In this talk, he explains what it takes to become the best at memory, what is involved in such grueling memory competitions, and how analytics helped him get there.
Nelson Dellis, Climb for Memory
Categorization hierarchies are ubiquitous in big data. Examples include the MEDLINE's Medical Subject Headings (MeSH) taxonomy, United Nations Standard Products and Services Code (UNSPSC) product codes, and the Medical Dictionary for Regulatory Activities (MedDRA) hierarchy for adverse reaction coding. A key issue is that in most taxonomies the probability of any particular example being in a category is very small at lower levels of the hierarchy. Blindly applying a standard categorization model is likely to perform poorly if this fact is not taken into consideration. This paper introduce a novel technique for text categorization, Boolean rule extraction, which enables you to effectively address this situation. In addition, models that are generated by a rule-based technique have good interpretability and can be easily modified by a human expert, enabling better human-machine interaction. The paper demonstrates how to use SAS® Text Miner macros and procedures to obtain effective predictive models at all hierarchy levels in a taxonomy.
Zheng Zhao, SAS
Russ Albright, SAS
James Cox, SAS
Ning Jin, SAS
SAS® PROC FASTCLUS generates five clusters for the group of repeat clients of Ontario's Remedial Measures program. Heat map tables are shown for selected variables such as demographics, scales, factor, and drug use to visualize the difference between clusters.
Rosely Flam-Zalcman, CAMH
Robert Mann, CAM
Rita Thomas, CAMH
The Behavioral Risk Factor Surveillance System (BRFSS) collects data on health practices and risk behaviors via telephone survey. This study focuses on the question, On average, how many hours of sleep do you get in a 24-hour period? Recall bias is a potential concern in interviews and questionnaires, such as BRFSS. The 2013 BRFSS data is used to illustrate the proper methods for implementing PROC SURVEYREG and PROC SURVEYLOGISTIC, using the complex weighting scheme that BRFSS provides.
Lucy D'Agostino McGowan, Vanderbilt University
Alice Toll, Vanderbilt University
A Chinese wind energy company designs several hundred wind farms each year. An important step in its design process is micrositing, in which it creates a layout of turbines for a wind farm. The amount of energy that a wind farm generates is affected by geographical factors (such as elevation of the farm), wind speed, and wind direction. The types of turbines and their positions relative to each other also play a critical role in energy production. Currently the company is using an open-source software package to help with its micrositing. As the size of wind farms increases and the pace of their construction speeds up, the open-source software is no longer able to support the design requirements. The company wants to work with a commercial software vendor that can help resolve scalability and performance issues. This paper describes the use of the OPTMODEL and OPTLSO procedures on the SAS® High-Performance Analytics infrastructure together with the FCMP procedure to model and solve this highly nonlinear optimization problem. Experimental results show that the proposed solution can meet the company's requirements for scalability and performance. A Chinese wind energy company designs several hundred wind farms each year. An important step of their design process is micro-siting, which creates a layout of turbines for a wind farm. The amount of energy generated from a wind farm is affected by geographical factors (such as elevation of the farm), wind speed, and wind direction. The types of turbines and their positions relative to each other also play critical roles in the energy production. Currently the company is using an open-source software package to help them with their micro-siting. As the size of wind farms increases and the pace of their construction speeds up, the open-source software is no longer able to support their design requirements. The company wants to work with a commercial software vendor that can help them resolve scalability and performance issues. This pap
er describes the use of the FCMP, OPTMODEL, and OPTLSO procedures on the SAS® High-Performance Analytics infrastructure to model and solve this highly nonlinear optimization problem. Experimental results show that the proposed solution can meet the company's requirements for scalability and performance.
Sherry (Wei) Xu, SAS
Steven Gardner, SAS
Joshua Griffin, SAS
Baris Kacar, SAS
Jinxin Yi, SAS
Hawkins (1980) defines an outlier as an observation that deviates so much from other observations as to arouse the suspicion that it was generated by a different mechanism . To identify data outliers, a classic multivariate outlier detection approach implements the Robust Mahalanobis Distance Method by splitting the distribution of distance values into two subsets (within-the-norm and out-of-the-norm), with the threshold value usually set to the 97.5% quantile of the Chi-Square distribution with p (number of variables) degrees of freedom and items whose distance values are beyond it are labeled out-of-the-norm. This threshold value is an arbitrary number, however, and it might flag as out-of-the-norm a number of items that are actually extreme values of the baseline distribution rather than outliers. Therefore, it is desirable to identify an additional threshold, a cutoff point that divides the set of out-of-norm points in two subsets--extreme values and outliers. One way to do this--in particular for larger databases--is to Increase the threshold value to another arbitrary number, but this approach requires taking into consideration the size of the data set since size affects the threshold-separating outliers from extreme values. A 2003 article by Gervini (Journal of Multivariate Statistics) proposes an adaptive threshold that increases with the number of items n if the data is clean but it remains bounded if there are outliers in the data. In 2005 Filzmoser, Garrett, and Reimann (Computers & Geosciences) built on Gervini's contribution to derive by simulation a relationship between the number of items n, the number of variables in the data p, and a critical ancillary variable for the determination of outlier thresholds. This paper implements the Gervini adaptive threshold value estimator by using PROC ROBUSTREG and the SAS® Chi-Square functions CINV and PROBCHI, available in the SAS/STAT® environment. It also provides data simulations to illustrate the reliab
ility and the flexibility of the method in distinguishing true outliers from extreme values.
Paulo Macedo, Integrity Management Services, LLC
Companies are increasingly relying on analytics as the right solution to their problems. In order to use analytics and create value for the business, companies first need to store, transform, and structure the data to make it available and functional. This paper shows a successful business case where the extraction and transformation of the data combined with analytical solutions were developed to automate and optimize the management of the collections cycle for a TELCO company (DIRECTV Colombia). SAS® Data Integration Studio is used to extract, process, and store information from a diverse set of sources. SAS Information Map is used to integrate and structure the created databases. SAS® Enterprise Guide® and SAS® Enterprise Miner™ are used to analyze the data, find patterns, create profiles of clients, and develop churn predictive models. SAS® Customer Intelligence Studio is the platform on which the collection campaigns are created, tested, and executed. SAS® Web Report Studio is used to create a set of operational and management reports.
Darwin Amezquita, DIRECTV
Paulo Fuentes, Directv Colombia
Andres Felipe Gonzalez, Directv
Breast cancer is the leading cause of cancer-related deaths among women worldwide, and its early detection can reduce mortality rate. Using a data set containing information about breast screening provided by the Royal Liverpool University Hospital, we constructed a model that can provide early indication of a patient's tendency to develop breast cancer. This data set has information about breast screening from patients who were believed to be at risk of developing breast cancer. The most important aspect of this work is that we excluded variables that are in one way or another associated with breast cancer, while keeping the variables that are less likely to be associated with breast cancer or whose associations with breast cancer are unknown as input predictors. The target variable is a binary variable with two values, 1 (indicating a type of cancer is present) and 0 (indicating a type of cancer is not present). SAS® Enterprise Miner™ 12.1 was used to perform data validation and data cleansing, to identify potentially related predictors, and to build models that can be used to predict at an early stage the likelihood of patients developing breast cancer. We compared two models: the first model was built with an interactive node and a cluster node, and the second was built without an interactive node and a cluster node. Classification performance was compared using a receiver operating characteristic (ROC) curve and average squares error. Interestingly, we found significantly improved model performance by using only variables that have a lesser or unknown association with breast cancer. The result shows that the logistic model with an interactive node and a cluster node has better performance with a lower average squared error (0.059614) than the model without an interactive node and a cluster node. Among other benefits, this model will assist inexperienced oncologists in saving time in disease diagnosis.
Gibson Ikoro, Queen Mary University of London
Beatriz de la Iglesia, University of East Anglia, Norwich, UK
Abalone is a common name given to sea snails or mollusks. These creatures are highly iridescent, with shells of strong changeable colors. This characteristic makes the shells attractive to humans as decorative objects and jewelry. The abalone structure is being researched to build body armor. The value of a shell varies by its age and the colors it displays. Determining the number of rings on an abalone is a tedious and cumbersome task and is usually done by cutting the shell through the cone, staining it, and counting the number of rings on it through a microscope. In this poster, I aim to predict the number of any rings on any abalone by using the physical characteristics. This data was obtained from UCI Machine Learning Repository, which consists of 4,177 observations with 8 attributes. I considered the number of rings to be my target variable. The abalone's age can be reasonably approximated as being 1.5 times the number of rings on its shell. Using SAS® Enterprise Miner™, I have built regression models and neural network models to determine the physical measurements responsible for determining the number of rings on the abalone. While I have obtained a coefficient of determination of 54.01%, my aim is to improve and expand the analysis using the power of SAS Enterprise Miner. The current initial results indicate that the height, the shucked weight, and the viscera weight of the shell are the three most influential variables in predicting the number of rings on an abalone.
Ganesh Kumar Gangarajula, Oklahoma State University
Yogananda Domlur Seetharam
Many epidemiological studies use medical claims to identify and describe a population. But finding out who was diagnosed, and who received treatment, isn't always simple. Each claim can have dozens of medical codes, with different types of codes for procedures, drugs, and diagnoses. Even a basic definition of treatment could require a search for any one of 100 different codes. A SAS® macro may come to mind, but generalizing the macro to work with different codes and types allows it to be reused in a variety of different scenarios. We look at a number of examples, starting with a single code type and variable. Then we consider multiple code variables, multiple code types, and multiple flag variables. We show how these macros can be combined and customized for different data with minimal rework. Macro flexibility and reusability are also discussed, along with ways to keep our list of medical codes separate from our program. Finally, we discuss time-dependent medical codes, codes requiring database lookup, and macro performance.
Andy Karnopp, Fred Hutchinson Cancer Research Center
Throughout the latter part of the twentieth century, the United States of America has experienced an incredible boom in the rate of incarceration of its citizens. This increase arguably began in the 1970s when the Nixon administration oversaw the beginning of the war on drugs in America. The U.S. now has one of the highest rates of incarceration among industrialized nations. However, the citizens who have been incarcerated on drug charges have disproportionately been African American or other racial minorities, even though many studies have concluded that drug use is fairly equal among racial groups. In order to remedy this situation, it is essential to first understand why so many more people have been arrested and incarcerated. In this research, I explore a potential explanation for the epidemic of mass incarceration. I intend to answer the question does gubernatorial rhetoric have an effect on the rate of incarceration in a state? More specifically, I am interested in examining the language that the governor of a state uses at the annual State of the State address in order to see if there is any correlation between rhetoric and the subsequent rate of incarceration in that state. In order to understand any possible correlation, I use SAS® Text Miner and SAS® Contextual Analysis to examine the attitude towards crime in each speech. The political phenomenon that I am trying to understand is how state government employees are affected by the tone that the chief executive of a state uses towards crime, and whether the actions of these state employees subsequently lead to higher rates of incarceration. The governor is the top government official in charge of employees of a state, so when this official addresses the state, the employees may take the governor's message as an order for how they do their jobs. While many political factors can affect legislation and its enforcement, a governor has the ability to set the tone of a state when it comes to policy issues suc
h as crime.
Catherine Lachapelle, UNC Chapel Hill
In today's society, where seemingly unlimited information is just a mouse click away, many turn to social media, forums, and medical websites to research and understand how mothers feel about the birthing process. Mining the data in these resources helps provide an understanding of what mothers value and how they feel. This paper shows the use of SAS® Text Analytics to gather, explore, and analyze reports from mothers to determine their sentiment about labor and delivery topics. Results of this analysis could aid in the design and development of a labor and delivery survey and be used to understand what characteristics of the birthing process yield the highest levels of importance. These resources can then be used by labor and delivery professionals to engage with mothers regarding their labor and delivery preferences.
Michael Wallis, SAS
During the financial crisis of 2007-2009, the U.S. labor market lost 8.4 million jobs, causing the unemployment rate to increase from 5% to 9.5%. One of the indicators for economic recession is negative gross domestic product (GDP) for two consecutive quarters. This poster combines quantitative and qualitative techniques to predict the economic downturn by forecasting recession probabilities. Data was collected from the Board of Governors of the Federal Reserve System and the Federal Reserve Bank of St. Louis, containing 29 variables and quarterly observations from 1976-Q1 to 2013-Q3. Eleven variables were selected as inputs based on their effects on recession and limiting the multicollinearity: long-term treasury yield (5-year and 10-year), mortgage rate, CPI inflation rate, prime rate, market volatility index, Better Business Bureau (BBB) corporate yield, house price index, stock market index, commercial real estate price index, and one calculated variable yield spread (Treasury yield-curve spread). The target variable was a binary variable depicting the economic recession for each quarter (1=Recession). Data was prepared for modeling by applying imputation and transformation on variables. Two-step analysis was used to forecast the recession probabilities for the short-term period. Predicted recession probabilities were first obtained from the Backward Elimination Logistic Regression model that was selected on the basis of misclassification (validation misclassification= 0.115). These probabilities were then forecasted using the Exponential Smoothing method that was selected on the basis of mean average error (MAE= 11.04). Results show the recession periods including the great recession of 2008 and the forecast for eight quarters (up to 2015-Q3).
Avinash Kalwani, Oklahoma State University
Nishant Vyas, Oklahoma State University
An essential part of health services research is describing the use and sequencing of a variety of health services. One of the most frequently examined health services is hospitalization. A common problem in describing hospitalizations is that a patient might have multiple hospitalizations to treat the same health problem. Specifically, a hospitalized patient might be (1) sent to and returned from another facility in a single day for testing, (2) transferred from one hospital to another, and/or (3) discharged home and re-admitted within 24 hours. In all cases, these hospitalizations are treating the same underlying health problem and should be considered as a single episode. If examined without regard for the episode, a patient would be identified as having 4 hospitalizations (the initial hospitalization, the testing hospitalization, the transfer hospitalization, and the readmission hospitalization). In reality, they had one hospitalization episode spanning multiple facilities. IMPORTANCE: Failing to account for multiple hospitalizations in the same episode has implications for many disciplines including health services research, health services planning, and quality improvement for patient safety. HEALTH SERVICES RESEARCH: Hospitalizations will be counted multiple times, leading to an overestimate of the number of hospitalizations a person had. For example, a person can be identified as having 4 hospitalizations when in reality they had one episode of hospitalization. This will result in a person appearing to be a higher user of health care than is true. RESOURCE PLANNING FOR HEALTH SERVICES. The average time and resources needed to treat a specific health problem may be underestimated. To illustrate, if a patient spends 10 days each in 3 different hospitals in the same episode, the total number of days needed to treat the health problem is 30 days, but each hospital will believe it is only 10, and planned resourcing may be inadequate. QUALITY IMPROVEMENT FOR PATIENT SAFETY. Hospital-acquir
ed infections are a serious concern and a major cause of extended hospital stays, morbidity, and death. As a result, many hospitals have quality improvement programs that monitor the occurrence of infections in order to identify ways to reduce them. If episodes of hospitalizations are not considered, an infection acquired in a hospital that does not manifest until a patient is transferred to a different hospital will incorrectly be attributed to the receiving hospital. PROPOSAL: We have developed SAS® code to identify episodes of hospitalizations, the sequence of hospitalizations within each episode, and the overall duration of the episode. The output clearly displays the data in an intuitive and easy-to-understand format. APPLICATION: The method we will describe and the associated SAS code will be useful to not only health services researchers, but also anyone who works with temporal data that includes nested, overlapping, and subsequent events.
Meriç Osman, Health Quality Council
Jacqueline Quail, Saskatchewan Health Quality Council
Nianping Hu, Saskatchewan Health Quality Council
Nedeene Hudema, Saskatchewan Health Quality Council
Developing a quality product or service, while at the same time improving cost management and maximizing profit, are challenging goals for any company. Finding the optimal balance between efficiency and profitability is not easy. The same can be said in regards to the development of a predictive statistical model. On the one hand, the model should predict as accurately as possible. On the other hand, having too many predictors can end up costing company money. One of the purposes of this project is to explore the cost of simplicity. When is it worth having a simpler model, and what are some of the costs of using a more complex one? The answer to that question leads us to another one: How can a predictive statistical model be maximized in order to increase a company's profitability? Using data from the consumer credit risk domain provided from CompuCredit (now Atlanticus), we used logistic regression to build binary classification models to predict the likelihood of default. This project compares two of these models. Although the original data set had several hundred predictor variables and more than a million observations, I chose to use rather simple models. My goal was to develop a model with as few predictors as possible, while not going lower than a concordant level of 80%. Two models were evaluated and compared based on efficiency, simplicity, and profitability. Using the selected model, cluster analysis was then performed in order to maximize the estimated profitability. Finally, the analysis was taken one step further through a supervised segmentation process, in order to target the most profitable segment of the best cluster.
Sherrie Rodriguez, Kennesaw State University
In this era of bigdata, the use of text analytics to discover insights is rapidly gainingpopularity in businesses. On average, more than 80 percent of the data inenterprises may be unstructured. Text analytics can help discover key insightsand extract useful topics and terms from the unstructured data. The objectiveof this paper is to build a model using textual data that predicts the factorsthat contribute to downtime of a truck. This research analyzes the data of over200,000 repair tickets of a leading truck manufacturing company. After theterms were grouped into fifteen key topics using text topic node of SAS® TextMiner, a regression model was built using these topics to predict truckdowntime, the target variable. Data was split into training and validation fordeveloping the predictive models. Knowledge of the factors contributing todowntime and their associations helped the organization to streamline theirrepair process and improve customer satisfaction.
Ayush Priyadarshi, Oklahoma State University
Goutam Chakraborty, Oklahoma State University
The concept of least squares means, or population marginal means, seems to confuse a lot of people. We explore least squares means as implemented by the LSMEANS statement in SAS®, beginning with the basics. Particular emphasis is paid to the effect of alternative parameterizations (for example, whether binary variables are in the CLASS statement) and the effect of the OBSMARGINS option. We use examples to show how to mimic LSMEANS using ESTIMATE statements and the advantages of the relatively new LSMESTIMATE statement. The basics of estimability are discussed, including how to get around the dreaded non-estimable messages. Emphasis is put on using the STORE statement and PROC PLM to test hypotheses without having to redo all the model calculations. This material is appropriate for all levels of SAS experience, but some familiarity with linear models is assumed.
David Pasta, ICON Clinical Research
Many SAS® procedures can be used to analyze longitudinal data. This study employed a multisite randomized controlled trial design to demonstrate the effectiveness of two SAS procedures, GLIMMIX and GENMOD, to analyze longitudinal data from five Department of Veterans Affairs Medical Centers (VAMCs). Older male veterans (n = 1222) seen in VAMC primary care clinics were randomly assigned to two behavioral health models, integrated (n = 605) and enhanced referral (n = 617). Data was collected at baseline, and at 3-, 6-, and 12- month follow-up. A mixed-effects repeated measures model was used to examine the dependent variable, problem drinking, which was defined as count and dichotomous from baseline to 12 month follow-up. Sociodemographics and depressive symptoms were included as covariates. First, bivariate analyses included general linear model and chi-square tests to examine covariates by group and group by problem drinking outcomes. All significant covariates were included in the GLIMMIX and GENMOD models. Then, multivariate analysis included mixed models with Generalized Estimation Equations (GEEs). The effect of group, time, and the interaction effect of group by time were examined after controlling for covariates. Multivariate results were inconsistent for GLIMMIX and GENMOD using Lognormal, Gaussian, Weibull, and Gamma distributions. SAS is a powerful statistical program in data analyses for longitudinal study.
Abbas Tavakoli, University of South Carolina/College of Nursing
Marlene Al-Barwani, University of South Carolina
Sue Levkoff, University of South Carolina
Selina McKinney, University of South Carolina
Nikki Wooten, University of South Carolina
Mathematical optimization is a powerful paradigm for modeling and solving business problems that involve interrelated decisions about resource allocation, pricing, routing, scheduling, and similar issues. The OPTMODEL procedure in SAS/OR® software provides unified access to a wide range of optimization solvers and supports both standard and customized optimization algorithms. This paper illustrates PROC OPTMODEL's power and versatility in building and solving optimization models and describes the significant improvements that result from PROC OPTMODEL's many new features. Highlights include the recently added support for the network solver, the constraint programming solver, and the COFOR statement, which allows parallel execution of independent solver calls. Best practices for solving complex problems that require access to more than one solver are also demonstrated.
Rob Pratt, SAS
Competing risks arise in studies in which individuals are subject to a number of potential failure events and the occurrence of one event might impede the occurrence of other events. For example, after a bone marrow transplant, a patient might experience a relapse or might die while in remission. You can use one of the standard methods of survival analysis, such as the log-rank test or Cox regression, to analyze competing-risks data, whereas other methods, such as the product-limit estimator, might yield biased results. An increasingly common practice of assessing the probability of a failure in competing-risks analysis is to estimate the cumulative incidence function, which is the probability subdistribution function of failure from a specific cause. This paper discusses two commonly used regression approaches for evaluating the relationship of the covariates to the cause-specific failure in competing-risks data. One approach models the cause-specific hazard, and the other models the cumulative incidence. The paper shows how to use the PHREG procedure in SAS/STAT® software to fit these models.
Ying So, SAS
Polytomous items have been widely used in educational and psychological settings. As a result, the demand for statistical programs that estimate the parameters of polytomous items has been increasing. For this purpose, Samejima (1969) proposed the graded response model (GRM), in which category characteristic curves are characterized by the difference of the two adjacent boundary characteristic curves. In this paper, we show how the SAS-PIRT macro (a SAS® macro written in SAS/IML®) was developed based on the GRM and how it performs in recovering the parameters of polytomous items using simulated data.
Sung-Hyuck Lee, ACT, Inc.
A bank that wants to use the Internal Ratings Based (IRB) methods to calculate minimum Basel capital requirements has to calculate default probabilities (PDs) for all its obligors. Supervisors are mainly concerned about the credit risk being underestimated. For high-quality exposures or groups with an insufficient number of obligors, calculations based on historical data may not be sufficiently reliable due to infrequent or no observed defaults. In an effort to solve the problem of default data scarcity, modeling assumptions are made, and to control the possibility of model risk, a high level of conservatism is applied. Banks, on the other hand, are more concerned about PDs that are too pessimistic, since this has an impact on their pricing and economic capital. In small samples or where we have little or no defaults, the data provides very little information about the parameters of interest. The incorporation of prior information or expert judgment and using Bayesian parameter estimation can potentially be a very useful approach in a situation like this. Using PROC MCMC, we show that a Bayesian approach can serve as a valuable tool for validation and monitoring of PD models for low default portfolios (LDPs). We cover cases ranging from single-period, zero correlation, and zero observed defaults to multi-period, non-zero correlation, and few observed defaults.
Machiel Kruger, North-West University
How do you serve 25 million video ads a day to Internet users in 25 countries, while ensuring that you target the right ads to the right people on the right websites at the right time? With a lot of help from math, that's how! Come hear how Videology, an Internet advertising company, combines mathematical programming, predictive modeling, and big data techniques to meet the expectations of advertisers and online publishers, while respecting the privacy of online users and combatting fraudulent Internet traffic.
Kaushik Sinha, Videology
In many situations, an outcome of interest has a large number of zero outcomes and a group of nonzero outcomes that are discrete or highly skewed. For example, in modeling health care costs, some patients have zero costs, and the distribution of positive costs are often extremely right-skewed. When modeling charitable donations, many potential donors give nothing, and the majority of donations are relatively small with a few very large donors. In the analysis of count data, there are also times where there are more zeros than would be expected using standard methodology, or cases where the zeros might differ substantially than the non-zeros, such as number of cavities a patient has at a dentist appointment or number of children born to a mother. If data has such structure, and ordinary least squares methods are used, then predictions and estimation might be inaccurate. The two-part model gives us a flexible and useful modeling framework in many situations. Methods for fitting the models with SAS® software are illustrated.
Laura Kapitula, Grand Valley State University
Many freshmen leave their first college and go on to attend another institution. Some of these students are even successful in earning degrees elsewhere. As there is more focus on college graduation rates, this paper shows how the power of SAS® can pull in data from many disparate sources, including the National Student Clearinghouse, to answer questions on the minds of many institutional researchers. How do we use the data to answer questions such as What would my graduation rate be if these students graduated at my institution instead of at another one?', What types of schools do students leave to attend? , and Are there certain characteristics of students who leave, and are they concentrated in certain programs? The data-handling capabilities of SAS are perfect for this type of analysis, and this presentation walks you through the process.
Stephanie Thompson, Datamum
We demonstrate a method of using SAS® 9.4 to supplement the interpretation of dimensions of a Multidimensional Scaling (MDS) model, a process that could be difficult without SAS®. In our paper, we examine why do people choose to drive to work (over other means of travel),' a question that transportation researchers need to answer in order to encourage drivers to switch to more environmentally-friendly travel modes. We applied the MDS approach on a travel survey data set because MDS has the advantage of extracting drivers' motivations in multiple dimensions.To overcome the challenges of dimension interpretation with MDS, we used the logistic regression function of SAS 9.4 to identify the variables that are strongly associated with each dimension, thus greatly aiding our interpretation procedure. Our findings are important to transportation researchers, practitioners, and MDS users.
Jun Neoh, University of Southampton
Jun Neoh, University of Southampton
Many retail and consumer packaged goods (CPG) companies are now keeping track of what their customers purchased in the past, often through some form of loyalty program. This record keeping is one example of how modern corporations are building data sets that have a panel structure, a data structure that is also pervasive in insurance and finance organizations. Panel data (sometimes called longitudinal data) can be thought of as the joining of cross-sectional and time series data. Panel data enable analysts to control for factors that cannot be considered by simple cross-sectional regression models that ignore the time dimension. These factors, which are unobserved by the modeler, might bias regression coefficients if they are ignored. This paper compares several methods of working with panel data in the PANEL procedure and discusses how you might benefit from using multiple observations for each customer. Sample code is available.
Bobby Gutierrez, SAS
Kenneth Sanford, SAS