SAS Global Forum 2016 Proceedings

It is common for hundreds or even thousands of clinical endpoints to be collected from individual subjects, but events from the majority of clinical endpoints are rare. The challenge of analyzing high dimensional sparse data is in balancing analytical consideration for statistical inference and clinical interpretation with precise meaningful outcomes of interest at intra-categorical and inter-categorical levels. Lumping or grouping similar rare events into a composite category has the statistical advantage of increasing testing power, reducing multiplicity size, and avoiding competing risk problems. However, too much or inappropriate lumping would jeopardize the clinical meaning of interest and external validity, whereas splitting or keeping each individual event at its basic clinical meaningful category can overcome the drawbacks of lumping. This practice might create analytical issues of increasing type II errors, multiplicity size, competing risks, and having a large proportion of endpoints with rare events. It seems that lumping and splitting are diametrically opposite approaches, but in fact, they are complementary. Both are essential for high dimensional data analysis. This paper describes the steps required for the lumping and splitting analysis and presents SAS^® code that can be used to implement each step.

View the e-poster or slides (PDF)

In the regulatory world of patient safety and pharmacovigilance, whether it's during clinical trials or post-market surveillance, SAEs that affect participants must be collected, and if certain criteria are met, reported to the FDA and other regulatory authorities. SAEs are often entered into multiple databases by various users, resulting in possible data discrepancies and quality loss. Efforts have been made to reconcile the SAE data between databases, but there is no industrial standard regarding the methodology or tool employed for this task. Some organizations still reconcile the data manually, with visual inspections and vocal verification. Not only is this laborious and error-prone, it becomes prohibitive when the data reach hundreds of records. We devised an efficient algorithm using SAS^® to compare two data sources automatically. Our algorithm identifies matched, discrepant, and unpaired SAE records. Additionally, it employs a user-supplied list of synonyms to find non-identical but relevant matches. First, two data sources are combined and sorted by key fields such as Subject ID , Onset Date , Stop Date , and Event Term . Record counts and Levenshtein edit distances are calculated within certain groups to assist with sorting and matching. This combined record list is then fed into a DATA step to decide whether a record is paired or unpaired. For an unpaired record, a stub record with all fields set as ? is generated as a matching placeholder. Each record is written to one of two data sets. Later, the data sets are tagged and pulled into a comparison logic using hash objects, which enable field-by-field comparison and display discrepancies in clean format for each field. Identical fields or columns are cleared or removed for clarity. The result is a streamlined and user-friendly process that allows for fast and easy SAE reconciliation.

Read the paper (PDF)

Movie reviews provide crucial information and act as an important factor when deciding whether to spend money on seeing a film in the theater. Each review reflects an individual's take on the movie and there are often contrasting reviews for the same movie. Going through each review may create confusion in the mind of a reader. Analyzing all the movie reviews and generating a quick summary that describes the performance, direction, and screenplay, among other aspects is helpful to the readers in understanding the sentiments of movie reviewers and in deciding if they would like to watch the movie in the theaters. In this paper, we demonstrate how the SAS^® Enterprise Miner™ nodes enable us to generate a quick summary of the terms and their relationship with each other, which describes the various aspects of a movie. The Text Cluster and Text Topic nodes are used to generate groups with similar subjects such as genres of movies, acting, and music. We use SAS^® Sentiment Analysis Studio to build models that help us classify 10,000 reviews where each review is a separate document and the reviews are equally split into good or bad. The Smoothed Relative Frequency and Chi-square model is found to be the best model with overall precision of 78.37%. As soon as the latest reviews are out, such analysis can be performed to help viewers quickly grasp the sentiments of the reviews and decide if the movie is worth watching.

Read the paper (PDF)

Organizations are collecting more structured and unstructured data than ever before, and it is presenting great opportunities and challenges to analyze all of that complex data. In this volatile and competitive economy, there has never been a bigger need for proactive and agile strategies to overcome these challenges by applying the analytics directly to the data rather than shuffling data around. Teradata and SAS^® have joined forces to revolutionize your business by providing enterprise analytics in a harmonious data management platform to deliver critical strategic insights by applying advanced analytics inside the database or data warehouse where the vast volume of data is fed and resides. This paper highlights how Teradata and SAS strategically integrate in-memory, in-database, and the Teradata relational database management system (RDBMS) in a box, giving IT departments a simplified footprint and reporting infrastructure, as well as lower total cost of ownership (TCO), while offering SAS users increased analytic performance by eliminating the shuffling of data over external network connections.

Read the paper (PDF)

Arrays and DO loops have been used for years by SAS^® programmers who work with diagnosis fields in health-care data, and the opportunity to use these techniques has only grown since the launch of the Affordable Care Act (ACA) in the United States. Users new to SAS or to the health-care field may find an overview of existing (as well as new) applications helpful. Risk-adjustment software, including the publicly available Health and Human Services (HHS) risk software that uses SAS and was released as part of the ACA implementation, is one example of code that is significantly improved by the use of arrays. Similar projects might include evaluations of diagnostic persistence, comparisons of diagnostic frequency or intensity between providers, and checks for unusual clusters of diagnosed conditions. This session reviews examples suitable for intermediate SAS users, including the application of two-dimensional arrays to diagnosis fields.

Read the paper (PDF)

In light of past indiscretions by the police departments leading to riots in major U.S. cities, it is important to assess factors leading to the arrest of a citizen. The police department should understand the arrest situation before they can make any decision and diffuse any possibility of a riot or similar incidents. Many such incidents in the past are a result of the police department failing to make right decisions in the case of emergencies. The primary objective of this study is to understand the key factors that impact arrest of people in New York City and understanding various crimes that have taken place in the region in the last two years. The study explores different regions of New York City where the crimes have taken place and also the timing of incidents leading to them. The data set consists of 111 variables and 273,430 observations from the year 2013 to 2014, with a binary target variable arrest made . The percentage of Yes and No incidents of arrest made are 6% and 94% respectively. This study analyzes the reasons for arrest, which are suspicion, timing of frisk activities, identifying threats, basis of search, whether arrest required a physical intervention, location of the arrest, and whether the frisk officer needs any support. Decision tree, regression, and neural network models were built, and decision tree turned out to be the best model with a validation misclassification rate of 6.47% and model sensitivity of 73.17%. The results from the decision tree model showed that suspicion count, search basis count, summons leading to violence, ethnicity, and use of force are some of the important factors influencing arrest of a person.

Read the paper (PDF)

The SAS^® Visual Analytics autoload feature loads any files placed in the autoload location into the public SAS^® LASR™ Analytic Server. But what if we want to autoload the tables into a private SAS LASR Analytic Server and we want to load database tables into SAS Visual Analytics and we also want to reload them every day so that the tables reflect any changes in the database? One option is to create a query in SAS^® Visual Data Builder and schedule the table for load into the SAS LASR Analytic Server, but in this case, a query has to be created for each table that has to be loaded. It would be rather efficient if all tables could be loaded simultaneously; that would make the job a lot easier. This paper provides a custom solution for autoloading the tables in one step, so that tables in SAS Visual Analytics don't have to be reloaded manually or multiple queries don't have to be rerun whenever there is a change in the source data. This solution automatically unloads and reloads all the tables listed in a text file every day so that the data available in SAS Visual Analytics is always up-to-date. This paper focuses on autoloading the database tables into the private SAS LASR Analytic server in SAS Visual Analytics hosted in a UNIX environment and uses the SASIOLA engine to perform the load. The process involves four steps: initial load, unload, reload, and schedule unload and reload. This paper provides the scripts associated for each step. The initial load needs to be performed only once. Unload and reload have to be performed periodically to refresh the data source, so these steps have to be scheduled. Once the tables are loaded, a description is also added to the tables detailing the name of the table, time of the load, and user ID performing the load, which is similar to the description added when loading the tables manually into the SAS LASR Analytic Server using SAS Visual Analytics Administrator.

Read the paper (PDF) | Watch the recording

When you were a kid, did you dress up as a unicorn for Halloween? Did you ever imagine you were a fairy living at the bottom of your garden? Maybe your dreams can come true! Modern-day marketers need a variety of complementary (and sometimes conflicting) skill sets. Forbes and others have started talking about unicorns, a rare breed of marketer--marketing technologists who understand both marketing and marketing technology. A good marketing analyst is part of this new breed; a unicorn isn't a horse with a horn but truly a new breed. It is no longer enough to be good at coding in SAS^®--or a whiz with Excel--and to know a few marketing buzzwords. In this session, we explore the skills an analyst needs to turn himself or herself into one of these mythical, highly sought-after creatures.

Read the paper (PDF) | Watch the recording

Even though marketing is inevitable in every business, every year the marketing budget is limited and prudent fund allocations are required to optimize marketing investment. In many businesses, the marketing fund is allocated based on the marketing manager's experience, departmental budget allocation rules, and sometimes 'gut feelings' of business leaders. Those traditional ways of budget allocation yield suboptimal results and in many cases lead to money being wasted on irrelevant marketing efforts. Marketing mixed models can be used to understand the effects of marketing activities and identify the key marketing efforts that drive the most sales among a group of competing marketing activities. The results can be used in marketing budget allocation to take out the guesswork that typically goes into the budget allocation. In this paper, we illustrate practical methods for developing and implementing marketing mixed modeling using SAS^® procedures. Real-life challenges of marketing mixed model development and execution are discussed, and several recommendations are provided to overcome some of those challenges.

Read the paper (PDF)

This paper is a case study to explore the analytical and reporting capabilities of SAS^® Visual Analytics to perform data exploration, determine order patterns and trends, and create data visualizations to generate extensive dashboard reports using the open source pollution data available from the United States Environmental Protection Agency (EPA). Data collection agencies report their data to EPA via the Air Quality System (AQS). The EPA makes available several types of aggregate (summary) data sets, such as daily and annual pollutant summaries, in CSV format for public use. We intend to demonstrate SAS Visual Analytics capabilities by using pollution data to create visualizations that compare Air Quality Index (AQI) values for multiple pollutants by location and time period, generate a time series plot by location and time period, compare 8-hour ozone 'exceedances' from this year with previous years, and perform other such analysis. The easy-to-use SAS Visual Analytics web-based interface is leveraged to explore patterns in the pollutant data to obtain insightful information. SAS^® Visual Data Builder is used to summarize data, join data sets, and enhance the predictive power within the data. SAS^® Visual Analytics Explorer is used to explore data, to create calculated data items and aggregated measures, and to define geography items. Visualizations such as chats, bar graphs, geo maps, tree maps, correlation matrices, and other graphs are created to graphically visualize pollutant information contaminating the environment; hierarchies are derived from date and time items and across geographies to allow rolling up the data. Various reports are designed for pollution analysis. Viewing on a mobile device such as an iPad is also explored. In conclusion, this paper attempts to demonstrate use of SAS Visual Analytics to determine the impact of pollution on the environment over time using various visualizations and graphs.

Read the paper (PDF)

In healthcare and other fields, the importance of cell suppression as a means to avoid unintended disclosure or identification of Protected Health Information (PHI) or any other sensitive data has grown as we move toward dynamic query systems and reports. Organizations such as Centers for Medicare & Medicaid Services (CMS), the National Center for Health Statistics (NCHS), and the Privacy Technical Assistance Center (PTAC) have outlined best practices to help researchers, analysts, report writers, and others avoid unintended disclosure for privacy reasons and to maintain statistical validity. Cell suppression is a crucial consideration during report design and can be a substantial hurdle in the dissemination of information. Often, the goal is to display as much data as possible without enabling the identification of individuals and while maintaining statistical validity. When designing reports using SAS^® Visual Analytics, achieving suppression can be handled multiple ways. One way is to suppress the data before loading it into the SAS^® LASR™ Analytic Server. This has the drawback that a user cannot take full advantage of the dynamic filtering and aggregation available with SAS Visual Analytics. Another method is to create formulas that govern how SAS Visual Analytics displays cells within a table (crosstab) or bars within a chart. The logic can be complex and can meet a variety of needs. This presentation walks through examples of the latter methodology, namely, the creation of suppression formulas and how to apply them to report objects.

Read the paper (PDF)

We set out to model two of the leading causes of fatal automobile accidents in America - drunk driving and speeding. We created a decision tree for each of those causes that uses situational conditions such as time of day and type of road to classify historical accidents. Using this model a law enforcement or town official can decide if a speed trap or DUI checkpoint can be the most effective in a particular situation. This proof of concept can easily be applied to specific jurisdictions for customized solutions.

Read the paper (PDF)

Managing and coordinating various aspects of a report can be challenging. This is especially true when the structure and composition of the report are data driven. For complex reports, the inclusion of attributes such as color, labeling, and the ordering of items complicates the coding process. Fortunately we have some powerful reporting tools in SAS^® that allow the process to be automated to a great extent. In the example presented in this paper, we are tasked with generating an Excel spreadsheet that ranks types of injuries within age groups. A given injury type is to be assigned a constant color regardless of its rank, and the labeling is to include not only the injury label but the actual count as well. Of course the user needs to be able to control such things as the age groups, color selection and order, and number of desired ranks.

Read the paper (PDF) | Watch the recording

One of the major diseases that records a high number of readmissions is bacterial pneumonia in Medicare patients. This study aims at comparing and contrasting Northeast and South regions of the United States based on the factors causing the 30-day readmissions. The study also identifies some of the ICD-9 medical procedures associated with those readmissions. Using SAS^® Enterprise Guide^® 7.1 and SAS^® Enterprise Miner™ 14.1, this study analyzes patient and hospital demographics of the Northeast and South regions of United States using the Cerner Health Facts Database. Further, the study suggests some preventive measures to reduce readmissions. The 30-day readmissions are computed based on admission and discharge dates from 2003 until 2013. Using clustering, various hospitals, along with discharge disposition levels (where a patient is sent after discharge), are grouped. In both regions, the patients who are discharged to home have shown significantly lower chances of readmission. The odds ratio = 0.562 for the Northeast region and for the South region; all other disposition levels have significantly high odds ( > 1), compared to discharged to home. For the South region, females have around 53 percent more possibility of readmission compared to males (odds ratio = 1.535). Also some of the hospital groups have higher readmission cases. The association analysis using the Market Basket node finds catheterization procedures to be highly significant (Lift = 1.25 for the Northeast and 3.84 for the South; Confidence = 52.94% for the Northeast and 85.71% for the South) in bacterial pneumonia readmissions. By research it was found that during these procedures, patients are highly susceptible to acquiring Methicillin-resistant Staphylococcus aureus (MRSA) bacteria, which causes Methicillin-susceptible pneumonia. Providing timely follow up for the patients operated with these procedures might possibly limit readmissions. These patients might also be discharge d to home under medical supervision, as such patients had shown significantly lower chances of readmission.

Read the paper (PDF)

Suppose that you have a very large data set with some specific values in one of the columns of the data set, and you want to classify the entire data set into different comma-separated-values format (CSV) sheets based on the values present in that specific column. Perhaps you might use codes using IF/THEN and ELSE statement conditions in SAS^®, along with some OUTPUT statements. If you divide that data set into csv sheets, it is more frustrating to use the conventional, manual process of converting each of the separated data sets into csv files. This paper shows a comparative study of using the Macro command in SAS with the help of the proc Export statement and the Output Delivery System (ODS) command using proc tabulate. In these two processes, the whole tedious process is done automatically using the SAS code.

Read the paper (PDF) | Watch the recording

The HIPAA Privacy Rule can restrict geographic and demographic data used in health-care analytics. After reviewing the HIPAA requirements for de-identification of health-care data used in research, this poster guides the beginning SAS^® Visual Analytics user through different options to create a better user experience. This poster presents a variety of data visualizations the analyst will encounter when describing a health-care population. We explore the different options SAS Visual Analytics offers and also offer tips on data preparation prior to using SAS^® Visual Analytics Designer. Among the topics we cover are SAS Visual Analytics Designer object options (including geo bubble map, geo region map, crosstab, and treemap), tips for preparing your data for use in SAS Visual Analytics, and tips on filtering data after it's been loaded into SAS Visual Analytics, and more.

View the e-poster or slides (PDF)

You've heard that SAS^® Output Delivery System (ODS) Graphics provides a powerful and detailed syntax for creating custom graphs, but for whatever reason you still haven't added them to your bag of SAS^® tricks. Let's change that! We will also present a code playground based on Microsoft Office that will enable you to quickly try out a variety of prepared SAS ODS Graphics examples, tweak the code, and see the results--all directly from Microsoft Excel. More experienced users will also find the code playground (which is similar in spirit to Google Code Playground or JSFiddle) useful for compiling SAS ODS Graphics code snippets for themselves and for sharing with colleagues, as well as for creating dashboards hosted by Microsoft Excel or Microsoft PowerPoint that contain precisely sized and placed SAS graphics.

Read the paper (PDF) | Watch the recording

Credit card profitability prediction is a complex problem because of the variety of card holders' behavior patterns and the different sources of interest and transactional income. Each consumer account can move to a number of states such as inactive, transactor, revolver, delinquent, or defaulted. This paper i) describes an approach to credit card account-level profitability estimation based on the multistate and multistage conditional probabilities models and different types of income estimation, and ii) compares methods for the most efficient and accurate estimation. We use application, behavioral, card state dynamics, and macroeconomic characteristics, and their combinations as predictors. We use different types of logistic regression such as multinomial logistic regression, ordered logistic regression, and multistage conditional binary logistic regression with the LOGISTIC procedure for states transition probability estimation. The state transition probabilities are used as weights for interest rate and non-interest income models (which one is applied depends on the account state). Thus, the scoring model is split according to the customer behavior segment and the source of generated income. The total income consists of interest and non-interest income. Interest income is estimated with the credit limit utilization rate models. We test and compare five proportion models with the NLMIXED, LOGISTIC, and REG procedures in SAS/STAT^® software. Non-interest income depends on the probability of being in a particular state, the two-stage model of conditional probability to make a point-of-sales transaction (POS) or cash withdrawal (ATM), and the amount of income generated by this transaction. We use the LOGISTIC procedure for conditional probability prediction and the GLIMMIX and PANEL procedures for direct amount estimation with pooled and random-effect panel data. The validation results confirm that traditional techniques can be effectively applied to complex tasks with many para meters and multilevel business logic. The model is used in credit limit management, risk prediction, and client behavior analytics.

Read the paper (PDF)

The recent advances in regulatory stress testing, including stress testing regulated by Comprehensive Capital Analysis and Review (CCAR) in the US, the Prudential Regulation Authority (PRA) in the UK, and the European Banking Authority in the EU, as well as the new international accounting requirement known as IFRS 9 (International Financial Reporting Standard), all pose new challenges to credit risk modeling. The increasing sophistication of the models that are supposed to cover all the material risks in the underlying assets in various economic scenarios makes models harder to implement. Banks are spending a lot of resources on the model implementation but are still facing issues due to long time to deployment and disconnection between the model development and implementation teams. Models are also required at a more granular level, in many cases, down to the trade and account levels. Efficient model execution becomes valuable for banks to get timely response to the analysis requests. At the same time, models are subject to more stringent internal and external scrutiny. This paper introduces a suite of risk modeling solutions from credit risk modeling leader SAS^® to help banks overcome these new challenges and be competent to meet the regulatory requirements.

Read the paper (PDF)

There are standard risk metrics financial institutions use to assess the risk of a portfolio. These include well known measures like value at risk and expected shortfall and related measures like contribution value at risk. While there are industry-standard approaches for calculating these measures, it is often the case that financial institutions have their own methodologies. Further, financial institutions write their own measures, in addition to the common risk measures. SAS^® High-Performance Risk comes equipped with over 20 risk measures that use standard methodology, but the product also allows customers to define their own risk measures. These user-defined statistics are treated the same way as the built-in measures, but the logic is specified by the customer. This paper leads the user through the creation of custom risk metrics using the HPRISK procedure.

Read the paper (PDF)

Customer retention is a primary concern for businesses that rely on a subscription revenue model. It is common for marketers of subscription-based offerings to develop predictive models that are aimed at identifying subscribers who have the highest risk of attrition. With these likely unsubscribes identified, marketers then attempt to forestall customer termination by using a variety of retention enhancement tactics, which might include free offers, customer training, satisfaction surveys, or other measures. Although customer retention is always a worthy pursuit, it is often expensive to retain subscribers. In many cases, associated retention programs simply prove unprofitable over time because the overall cost of such programs frequently exceeds the lifetime value of the cohort of unsubscribed customers. Generally, it is more profitable to focus resources on identifying and marketing to a targeted prospective customer. When the target marketing strategy focuses on identifying prospects who are most likely to subscribe over the long term, the need for special retention marketing efforts decreases sharply. This paper describes results of an analytically driven targeting approach that is aimed at inviting new customers to a milk and grocery home-delivery service, with the promise of attracting only those prospects who are expected to exhibit high long-term retention rates.

Read the paper (PDF)

In any organization where people work with data, it is extremely unlikely that there will be only one way of doing things. Things like divisional silos and differences in education, training, and subject matter often result in a diversity of tools and levels of expertise. One situation that frequently arises is that of 'code versus click.' Let's call it the difference between code-based, 'power user' data tools and simpler, purely graphic point-and-click tools such as Microsoft Excel. Even though the work itself might be quite similar, differences in analysis tools often mean differences in vocabulary and experience, and it can be difficult to convert users of one tool to another. This discussion will highlight the potential challenges of SAS^® adoption in an Excel-based workplace and propose strategies to gain new SAS advocates in your organization.

Read the paper (PDF)

Many organizations need to report forecasts of large numbers of time series at various levels of aggregation. Numerous model-based forecasts that are statistically generated at the lowest level of aggregation need to be combined to form an aggregate forecast that is not required to follow a fixed hierarchy. The forecasts need to be dynamically aggregated according to any subset of the time series, such as from a query. This paper proposes a technique for large-scale automatic forecast aggregation and uses SAS^® Forecast Server and SAS/ETS^® software to demonstrate this technique.

Read the paper (PDF)

This session illustrates how to quickly create rates over a specified period of time, using the MEANS and EXPAND procedures. For example, do you want to know how to use the power of SAS^® to create a year-to-date, rolling 12-month, or monthly rate? At Kaiser Permanente, we use this technique to develop Emergency Department (ED) use rates, ED admit rates, patient day rates, readmission rates, and more. A powerful function of PROC MEANS, given a database table with several dimensions and one or more facts, is to perform a mathematical calculation on fact columns across several different combinations of dimensions. For example, if a membership database table exists with the dimensions member ID, year-month, line of business, medical office building, and age grouping, PROC MEANS can easily determine and output the count of members by every possible dimension combination into a SAS data set. Likewise, if a hospital visit database table exists with the same dimensions and facts, PROC MEANS can output the number of hospital visits by the dimension combinations into a second SAS data set. With the power of PROC EXPAND, each of the data sets above, once sorted properly, can have columns added, which calculate total members and total hospital visits by a time dimension of the analyst's choice. Common time dimensions used for Kaiser Permanente's utilization rates are monthly, rolling 12-months, and year-to-date. The resulting membership and hospital visit data sets can be joined with a MERGE statement, and simple division produces a rate for the given dimensions.

Read the paper (PDF) | Watch the recording

The experimental item response theory procedure (PROC IRT), included in recently released SAS/STAT^® 13.1 and 13.2, enables item response modeling and trait estimation in SAS^®. PROC IRT enables you to perform item parameter calibration and latent trait estimation using a wide spectrum of educational and psychological research. This paper evaluates the performance of PROC IRT in item parameter recovery and under various testing conditions. The pros and cons of PROC IRT versus BILOG-MG 3.0 are presented. For practitioners of IRT models, the development of IRT-related analysis in SAS is inspiring. This analysis offers a great choice to the growing population of IRT users. A shift to SAS can be beneficial based on several features of SAS: its flexibility in data management, its power in data analysis, its convenient output delivery, and its increasing richness in graphical presentation. It is critical to ensure the quality of item parameter calibration and trait estimation before you can continue with other components, such as test scoring, test form constructions, IRT equatings, and so on.

View the e-poster or slides (PDF)

Scotiabank Colombian division - Colpatria, is the national leader in terms of providing credit cards, with more than 1,600,000 active cards--the equivalent to a portfolio of 700 million dollars approximately. The behavior score is used to offer credit cards through a cross-sell process, which happens only if customers have completed six months on books after using their first product with the bank. This is the minimum period of time requested by the behavior Artificial Neural Network (ANN) model. The six months on books internal policy suggests that the maturation of the client in this period is adequate, but this has never been proven. The following research aims to evaluate this hypothesis and calculate the appropriate time to offer cross-sales to new customers using Logistic Regression (Logit), while also segmenting these sales targets by their level of seniority using Discrete-Time Markov Chains (DTMC).

Read the paper (PDF)

The Armed Conflict Location & Event Data Project (ACLED) is a comprehensive public collection of conflict data for developing states. As such, the data is instrumental in informing humanitarian and development work in crisis and conflict-affected regions. The ACLED project currently manually codes for eight types of events from a summarized event description, which is a time-consuming process. In addition, when determining the root causes of the conflict across developing states, these fixed event types are limiting. How can researchers get to a deeper level of analysis? This paper showcases a repeatable combination of exploratory and classification-based text analytics provided by SAS^® Contextual Analysis, applied to the publicly available ACLED for African states. We depict the steps necessary to determine (using semi-automated processes) the next deeper level of themes associated with each of the coded event types; for example, while there is a broad protests and riots event type, how many of those are associated individually with rallies, demonstrations, marches, strikes, and gatherings? We also explore how to use predictive models to highlight the areas that require aid next. We prepare the data for use in SAS^® Visual Analytics and SAS^® Visual Statistics using SAS^® DS2 code and enable dynamic explorations of the new event sub-types. This ultimately provides the end-user analyst with additional layers of data that can be used to detect trends and deploy humanitarian aid where limited resources are needed the most.

Read the paper (PDF)

Logistic regression models are commonly used in direct marketing and consumer finance applications. In this context the paper discusses two topics about the fitting and evaluation of logistic regression models. Topic 1 is a comparison of two methods for finding multiple candidate models. The first method is the familiar best subsets approach. Best subsets is then compared to a proposed new method based on combining models produced by backward and forward selection plus the predictors considered by backward and forward. This second method uses the SAS^® procedure HPLOGISTIC with selection of models by Schwarz Bayes criterion (SBC). Topic 2 is a discussion of model evaluation statistics to measure predictive accuracy and goodness-of-fit in support of the choice of a final model. Base SAS^® and SAS/STAT^® software are used.

Read the paper (PDF) | Download the data file (ZIP)

Hierarchical nonlinear mixed models are complex models that occur naturally in many fields. The NLMIXED procedure's ability to fit linear or nonlinear models with standard or general distributions enables you to fit a wide range of such models. SAS/STAT^® 13.2 enhanced PROC NLMIXED to support multiple RANDOM statements, enabling you to fit nested multilevel mixed models. This paper uses an example to illustrate the new functionality.

Read the paper (PDF)

Analytics is now pervasive in our society, informing millions of decisions every year, and even making decisions without human intervention in automated systems. The implicit assumption is that all completed analytical projects are accurate and successful, but the reality is that some are not. Bad decisions that result from faulty analytics can be costly, embarrassing and even dangerous. By understanding the root causes of analytical failures, we can minimize the chances of repeating the mistakes of the past.

Read the paper (PDF)

The importance of understanding terrorism in the United States assumed heightened prominence in the wake of the coordinated attacks of September 11, 2001. Yet, surprisingly little is known about the frequency of attacks that might happen in the future and the factors that lead to a successful terrorist attack. This research is aimed at forecasting the frequency of attacks per annum in the United States and determining the factors that contribute to an attack's success. Using the data acquired from the Global Terrorism Database (GTD), we tested our hypothesis by examining the frequency of attacks per annum in the United States from 1972 to 2014 using SAS^® Enterprise Miner™ and SAS^® Forecast Studio. The data set has 2,683 observations. Our forecasting model predicts that there could be as many as 16 attacks every year for the next 4 years. From our study of factors that contribute to the success of a terrorist attack, we discovered that attack type, weapons used in the attack, place of attack, and type of target play pivotal roles in determining the success of a terrorist attack. Results reveal that the government might be successful in averting assassination attempts but might fail to prevent armed assaults and facilities or infrastructure attacks. Therefore, additional security might be required for important facilities in order to prevent further attacks from being successful. Results further reveal that it is possible to reduce the forecasted number of attacks by raising defense spending and by putting an end to the raging war in the Middle East. We are currently working on gathering data to do in-depth analysis at the monthly and quarterly level.

View the e-poster or slides (PDF)

Oil and gas wells encounter many types of adverse events, including unexpected shut-ins, scaling, rod pump failures, breakthrough, and fluid influx, to name just a few common ones. This paper presents a real-time event detection system that presents visual reports and alerts petroleum engineers when an event is detected in a well. This system uses advanced time series analysis on both high- and low-frequency surface and downhole measurements. This system gives the petroleum engineer the ability to switch from passive surveillance of events, which then require remediation, to an active surveillance solution, which enables the engineer to optimize well intervention strategies. Events can occur simultaneously or rapidly, or can be masked over long periods of time. We tested the proposed method on data received from both simulated and publicly available data to demonstrate how a multitude of well events can be detected by using multiple models deployed into the data stream. In our demonstration, we show how our real-time system can detect a mud motor pressure failure resulting from a latent fluid overpressure event. Giving the driller advanced warning of the motor state and the potential to fail can save operators millions of dollars a year by reducing nonproductive time caused by these failures. Reducing nonproductive time is critical to reducing the operating costs of constructing a well and to improving cash flow by shortening the time to first oil.

Read the paper (PDF)

Color is an important aspect of data visualization and provides an analyst with another tool for identifying data trends. But is the default option the best for every case? Default color scales may be familiar to us; however, they can have inherent flaws that skew our perception of the data. The impact of data visualizations can be considerably improved with just a little thought toward choosing the correct colors. Selecting an appropriate color map can be difficult, and you may decide that it may be easier to generate your own custom color scale. After a brief introduction to the red, green, and blue (RGB) color space, we discuss the strengths and weaknesses of some widely used color scales, as well as what to keep in mind when designing your own. A simple technique is presented, detailing how you can use SAS^® to generate a number of color scales and apply them to your data. Using just a few graphics procedures, you can transform almost any complex data into an easily digestible set of visuals. The techniques used in this discussion were developed and tested using SAS^® Enterprise Guide^® 5.1.

Read the paper (PDF) | View the e-poster or slides (PDF)

Since Atul Gawande popularized the term in describing the work of Dr. Jeffrey Brenner in a New Yorker article, hot-spotting has been used in health care to describe the process of identifying super-utilizers of health care services, then defining intervention programs to coordinate and improve their care. According to Brenner's data from Camden, New Jersey, 1% of patients generate 30% of payments to hospitals, while 5% of patients generate 50% of payments. Analyzing administrative health care claims data, which contains information about diagnoses, treatments, costs, charges, and patient sociodemographic data, can be a useful way to identify super-users, as well as those who may be receiving inappropriate care. Both groups can be targeted for care management interventions. In this paper, techniques for patient outlier identification and prioritization are discussed using examples from private commercial and public health insurance claims data. The paper also describes techniques used with health care claims data to identify high-risk, high-cost patients and to generate analyses that can be used to prioritize patients for various interventions to improve their health.

Read the paper (PDF)

Electric load forecasting is a complex problem that is linked with social activity considerations and variations in weather and climate. Furthermore, electric load is one of only a few goods that require the demand and supply to be balanced at almost real time (that is, there is almost no inventory). As a result, the utility industry could be much more sensitive to forecast error than many other industries. This electric load forecasting problem is even more challenging for holidays, which have limited historical data. Because of the limited holiday data, the forecast error for holidays is higher, on average, than it is for regular days. Identifying and using days in the history that are similar to holidays to help model the demand during holidays is not new for holiday demand forecasting in many industries. However, the electric demand in the utility industry is strongly affected by the interaction of weather conditions and social activities, making the problem even more dynamic. This paper describes an investigation into the various technologies that are used to identify days that are similar to holidays and using those days for holiday demand forecasting in the utility industry.

Read the paper (PDF)

Contemporary data-collection processes usually involve recording information about the geographic location of each observation. This geospatial information provides modelers with opportunities to examine how the interaction of observations affects the outcome of interest. For example, it is likely that car sales from one auto dealership might depend on sales from a nearby dealership either because the two dealerships compete for the same customers or because of some form of unobserved heterogeneity common to both dealerships. Knowledge of the size and magnitude of the positive or negative spillover effect is important for creating pricing or promotional policies. This paper describes how geospatial methods are implemented in SAS/ETS^® and illustrates some ways you can incorporate spatial data into your modeling toolkit.

Read the paper (PDF)

Statistical quality improvement is based on understanding process variation, which falls into two categories: variation that is natural and inherent to a process, and unusual variation due to specific causes that can be addressed. If you can distinguish between natural and unusual variation, you can take action to fix a broken process and avoid disrupting a stable process. A control chart is a tool that enables you to distinguish between the two types of variation. In many health care activities, carefully designed processes are in place to reduce variation and limit adverse events. The types of traditional control charts that are designed to monitor defect counts are not applicable to monitoring these rare events, because these charts tend to be overly sensitive, signaling unusual variation each time an event occurs. In contrast, specialized rare events charts are well suited to monitoring low-probability events. These charts have gained acceptance in health care quality improvement applications because of their ease of use and their suitability for processes that have low defect rates. The RAREEVENTS procedure, which is new in SAS/QC^® 14.1, produces rare events charts. This paper presents an overview of PROC RAREEVENTS and illustrates how you can use rare events charts to improve health care quality.

Read the paper (PDF)

Memory-based reasoning (MBR) is an empirical classification method that works by comparing cases in hand with similar examples from the past and then applying that information to the new case. MBR modeling is based on the assumptions that the input variables are numeric, orthogonal to each other, and standardized. The latter two assumptions are taken care of by principal components transformation of raw variables and using the components instead of the raw variables as inputs to MBR. To satisfy the first assumption, the categorical variables are often dummy coded. This raises issues such as increasing dimensionality and overfitting in the training data by introducing discontinuity in the response surface relating inputs and target variables. The Weight of Evidence (WOE) method overcomes this challenge. This method measures the relative response of the target for each group level of a categorical variable. Then the levels are replaced by the pattern of response of the target variable within that category. The SAS^® Enterprise Miner™ Interactive Grouping Node is used to achieve this. With this method, the categorical variables are converted into numeric variables. This paper demonstrates the improvement in performance of an MBR model when categorical variables are WOE coded. A credit screening data set obtained from SAS^® Education that comprises 25 attributes for 3000 applicants is used for this study. Three different types of MBR models were built using the SAS Enterprise Miner MBR node to check the improvement in performance. The results show the MBR model with WOE coded categorical variables is the best, based on misclassification rate. Using this data, when WOE coding is adopted, the model misclassification rate decreases from 0.382 to 0.344 while the sensitivity of the model increases from 0.552 to 0.572.

Read the paper (PDF)

High-quality effective graphs not only enhance understanding of the data but also facilitate regulators in the review and approval process. In recent SAS^® releases, SAS has made significant progress toward more efficient graphing in ODS Statistical Graphics (SG) procedures and Graph Template Language (GTL). A variety of graphs can be quickly produced using convenient built-in options in SG procedures. With graphical examples and comparison between SG procedures and traditional SAS/GRAPH^® procedures in reporting clinical trial data, this paper highlights several key features in ODS Graphics to efficiently produce sophisticated statistical graphs with more flexible and dynamic control of graphical presentation including: 1) Better control of axes in different scales and intervals; 2) Flexible ways to control graph appearance; 3) Plots overlay in single-cell or multi-cell graphs; 4) Enhanced annotation; 5) Classification panel of multiple plots with individualized labeling.

Read the paper (PDF) | Watch the recording

This paper presents the use of latent class analysis (LCA) to base the identification of a set of mutually exclusive latent classes of individuals on responses to a set of categorical, observed variables. The LCA procedure, a user-defined SAS^® procedure for conducting LCA and LCA with covariates, is demonstrated using follow-up data on substance use from Monitoring the Future panel data, a nationally representative sample of high school seniors who are followed at selected time points during adulthood. The demonstration includes guidance on data management prior to analysis, PROC LCA syntax requirements and options, and interpretation of output.

Read the paper (PDF)

From stock price histories to hospital stay records, analysis of time series data often requires use of lagged (and occasionally lead) values of one or more analysis variables. For the SAS^® user, the central operational task is typically getting lagged (lead) values for each time point in the data set. While SAS has long provided a LAG function, it has no analogous lead function--an especially significant problem in the case of large data series. This paper reviews the LAG function, in particular the powerful, but non-intuitive implications of its queue-oriented basis. The paper demonstrates efficient ways to generate leads with the same flexibility as the LAG function, but without the common and expensive recourse to data re-sorting. It also shows how to dynamically generate leads and lags through use of the hash object.

Read the paper (PDF)

For SAS^® Enterprise Guide^® users, sometimes macro variables and their values need to be brought over to the local workspace from the server, especially when multiple data sets or outputs need to be written to separate files in a local drive. Manually retyping the macro variables and their values in the local workspace after they have been created on the server workspace would be time-consuming and error-prone, especially when we have quite a number of macro variables and values to bring over. Instead, this task can be achieved in an efficient manner by using dictionary tables and the CALL SYMPUT routine, as illustrated in more detail below. The same approach can also be used to bring macro variables and their values from the local to the server workspace.

Read the paper (PDF) | Download the data file (ZIP) | Watch the recording

When analyzing data with SAS^®, we often encounter missing or null values in data. Missing values can arise from the availability, collectibility, or other issues with the data. They represent the imperfect nature of real data. Under most circumstances, we need to clean, filter, separate, impute, or investigate the missing values in data. These processes can take up a lot of time, and they are annoying. For these reasons, missing values are usually unwelcome and need to be avoided in data analysis. There are two sides to every coin, however. If we can think outside the box, we can take advantage of the negative features of missing values for positive uses. Sometimes, we can create and use missing values to achieve our particular goals in data manipulation and analysis. These approaches can make data analyses convenient and improve work efficiency for SAS programming. This kind of creative and critical thinking is the most valuable quality for data analysts. This paper exploits real-world examples to demonstrate the creative uses of missing values in data analysis and SAS programming, and discusses the advantages and disadvantages of these methods and approaches. The illustrated methods and advanced programming skills can be used in a wide variety of data analysis and business analytics fields.

Read the paper (PDF)

Specialized access requirements to tap into event streams vary depending on the source of the events. Open-source approaches from Spark, Kafka, Storm, and others can connect event streams to big data lakes, like Hadoop and other common data management repositories. But a different approach is needed to ensure latency and throughput are not adversely affected when processing streaming data with different open-source frameworks, that is, they need to scale. This talk distinguishes the advantages of adapters and connectors and shows how SAS^® can leverage both Hadoop and YARN technologies to scale while still meeting the needs of streaming data analysis and large, distributed data repositories.

Read the paper (PDF)

It is well documented in the literature the impact of missing data: data loss (Kim & Curry, 1977) and consequently loss of power (Raaijmakers, 1999), and biased estimates (Roth, Switzer, & Switzer, 1999). If left untreated, missing data is a threat to the validity of inferences made from research results. However, the application of a missing data treatment does not prevent researchers from reaching spurious conclusions. In addition to considering the research situation at hand and the type of missing data present, another important factor that researchers should also consider when selecting and implementing a missing data treatment is the structure of the data. In the context of educational research, multiple-group structured data is not uncommon. Assessment data gathered from distinct subgroups of students according to relevant variables (e.g., socio-economic status and demographics) might not be independent. Thus, when comparing the test performance of subgroups of students in the presence of missing data, it is important to preserve any underlying group effect by applying separate multiple imputation with multiple groups before any data analysis. Using attitudinal (Likert-type) data from the Civics Education Study (1999), this simulation study evaluates the performance of multiple-group imputation and total-group multiple imputation in terms of item parameter invariance within structural equation modeling using the SAS^® procedure CALIS and item response theory using the SAS procedure MCMC.

Read the paper (PDF)

In the consumer credit industry, privacy is key and the scrutiny increases every day. When returning files to a client, we must ensure that they are depersonalized so that the client cannot match back to any personally identifiable information (PII). This means that we must locate any values for a variable that occur on a limited number of records and null them out (i.e., replace them with missing values). When you are working with large files that have more than one million observations and thousands of variables, locating variables with few unique values is a difficult task. While the FREQ procedure and the DATA step merging can accomplish the task, using first./last. BY variable processing to locate the suspect values and hash objects to merge the data set back together might offer increased efficiency.

Read the paper (PDF)

An ad publisher like eBay has multiple ad sources to serve ads from. Specifically, for eBay's text ads, there are two ad sources, viz. eBay Commerce Network (ECN) and Google. The problem and solution we have formulated considers two ad sources. However, the solution can be extended for any number of ad sources. A study was done on performance of ECN and Google ad sources with respect to the revenue per mille (RPM, or revenue per thousand impressions) ad queries they bring while serving ads. It was found that ECN performs better for some ad queries and Google performs better for others. Thus, our problem is to optimally allocate ad traffic between the two ad sources to maximize RPM.

Read the paper (PDF)

Pedal-to-the-metal analytics is the notion that analytics can be quickly and easily achieved using web technologies. SAS^® web technologies seamlessly integrate with each other through a web browser and with data via web APIs, enabling organizations to leapfrog traditional, manual analytic and data processes. Because of this integration (and the operational efficiencies obtained as a result), pedal-to-the-metal analytics dramatically accelerates the analytics lifecycle, which consists of these steps: 1) Data Preparation; 2) Exploration; 3) Modeling; 4) Scoring; and 5) Evaluating results. In this paper, data preparation is accomplished with SAS^® Studio custom tasks (reusable drag-and-drop visual components or interfaces for underlying SAS code). This paper shows users how to create and implement these for public health surveillance. With data preparation complete, explorations of the data can be performed using SAS^® Visual Analytics. Explorations provide insights for creating, testing, and comparing models in SAS^® Visual Statistics to predict or estimate risk. The model score code produced by SAS Visual Statistics can then be deployed from within SAS Visual Analytics for scoring. Furthermore, SAS Visual Analytics provides the necessary dashboard and reporting capabilities to evaluate modeling results. In conclusion, the approach presented in this paper provides both new and long-time SAS users with easy-to-follow guidance and a repeatable workflow to maximize the return on their SAS investments while gaining actionable insights on their data. So, fasten your seat belts and get ready for the ride!

Read the paper (PDF)

Financial institutions are working hard to optimize credit and pricing strategies at adjudication and for ongoing account and customer management. For cards and other personal lending products, there is intense competitive pressure, together with relentless revenue challenges, that creates a huge requirement for sophisticated credit and pricing optimization tools. Numerous credit and pricing optimization applications are available on the market to satisfy these needs. We present a relatively new approach that relies heavily on the effect modeling (uplift or net lift) technique for continuous target metrics--revenue, cost, losses, and profit. Examples of effect modeling to optimize the impact of marketing campaigns are known. We discuss five essential steps on the credit and pricing optimization path: (1) setting up critical credit and pricing champion/challenger tests, (2) performance measurement of specific test campaigns, (3) effect modeling, (4) defining the best effect model, and (5) moving from the effect model to the optimal solution. These steps require specific applications that are not easily available in SAS^®. Therefore, necessary tools have been developed in SAS/STAT^® software. We go through numerous examples to illustrate credit and pricing optimization solutions.

Read the paper (PDF)

Optimization models require continuous constraints to converge. However, some real-life problems are better described by models that incorporate discontinuous constraints. A common type of such discontinuous constraints becomes apparent when a regulation-mandated diversification requirement is implemented in an investment portfolio model. Generally stated, the requirement postulates that the aggregate of investments with individual weights exceeding certain threshold in the portfolio should not exceed some predefined total within the portfolio. This format of the diversification requirement can be defined by the rules of any specific portfolio construction methodology and is commonly imposed by the regulators. The paper discusses the impact of this type of discontinuous portfolio diversification constraint on the portfolio optimization model solution process, and develops a convergent approach. The latter includes a sequence of definite series of convergent non-linear optimization problems and is presented in the framework of the OPTMODEL procedure modeling environment. The approach discussed has been used in constructing investable equity indexes.

Read the paper (PDF) | Download the data file (ZIP) | View the e-poster or slides (PDF)

Due to advances in medical care and the rise in living standards, life expectancy increased on average to 79 years in the US. This resulted in an increase in aging populations and increased demand for development of technologies that aid elderly people to live independently and safely. It is possible to achieve this challenge through ambient-assisted living (AAL) technologies that help elderly people live more independently. Much research has been done on human activity recognition (HAR) in the last decade. This research work can be used in the development of assistive technologies, and HAR is expected to be the future technology for e-health systems. In this research, I discuss the need to predict human activity accurately by building various models in SAS^® Enterprise Miner™ 14.1 on a free public data set that contains 165,633 observations and 19 attributes. Variables used in this research represent the metrics of accelerometers mounted on the waist, left thigh, right arm, and right ankle of four individuals performing five different activities, recorded over a period of eight hours. The target variable predicts human activity such as sitting, sitting down, standing, standing up, and walking. Upon comparing different models, the winner is an auto neural model whose input is taken from stepwise logistic regression, which is used for variable selection. The model has an accuracy of 98.73% and sensitivity of 98.42%.

Read the paper (PDF) | View the e-poster or slides (PDF)

Years ago, doctors advised women with autoimmune diseases such as systemic lupus erythematosus (SLE) and rheumatoid arthritis (RA) not to become pregnant for fear of maternal health. Now, it is known that healthy pregnancy is possible for women with lupus but at the expense of a higher pregnancy complication rate. The main objective of this research is to identify key factors contributing to these diseases and to predict the occurrence rate of SLE and RA in pregnant women. Based on the approach used in this study, the prediction of adverse pregnancy outcomes for women with SLE, RA, and other diseases such as diabetes mellitus (DM) and antiphospholipid antibody syndrome (APS) can be carried out. These results will help pregnant women undergo a healthy pregnancy by receiving proper medication at an earlier stage. The data set was obtained from Cerner Health Facts data warehouse. The raw data set contains 883,473 records and 85 variables such as diagnosis code, age, race, procedure code, admission date, discharge date, total charges, and so on. Analyses were carried out with two different data sets--one for SLE patients and the other for RA patients. The final data sets had 398,742 and 397,898 records each for modeling SLE and RA patients, respectively. To provide an honest assessment of the models, the data was split into training and validation using the data partition node. Variable selection techniques such as LASSO, LARS, stepwise regression, and forward regression were used. Using a decision tree, prominent factors that determine the SLE and RA occurrence rate were identified separately. Of all the predictive models run using SAS^® Enterprise Miner™ 12.3, the model comparison node identified the decision tree (Gini) as the best model with the least misclassification rate of 0.308 to predict the SLE patients and 0.288 to predict the RA patients.

Read the paper (PDF)

Many inquisitive minds are filled with excitement and anticipation of response every time one posts a question on a forum. This paper explores the factors that impact the response time of the first response for questions posted in the SAS^® Community forum. The factors are contributors' availability, nature of topic, and number of contributors knowledgeable for that particular topic. The results from this project help SAS^® users receive an estimated response time, and the SAS Community forum can use this information to answer several business questions such as following: What time of the year is likely to have an overflow of questions? Do specific topics receive delayed responses? Which days of the week are the community most active? To answer such questions, we built a web crawler using Python and Selenium to fetch data from the SAS Community forum, one of the largest analytics groups. We scraped over 13,443 queries and solutions starting from January 2014 to present. We also captured several query-related attributes such as the number of replies, likes, views, bookmarks, and the number of people conversing on the query. Using different tools, we analyzed this data set after clustering the queries into 22 subtopics and found interesting patterns that can help the SAS Community forum in several ways, as presented in this paper.

View the e-poster or slides (PDF)

In early 2006, the United States experienced a housing bubble that affected over half of the American states. It was one of the leading causes of the 2007-2008 financial recession. Primarily, the overvaluation of housing units resulted in foreclosures and prolonged unemployment during and after the recession period. The main objective of this study is to predict the current market value of a housing unit with respect to fair market rent, census region, metropolitan statistical area, area median income, household income, poverty income, number of units in the building, number of bedrooms in the unit, utility costs, other costs of the unit, and so on, to determine which factors affect the market value of the housing unit. For the purpose of this study, data was collected from the Housing Affordability Data System of the US Department of Housing and Urban Development. The data set contains 20 variables and 36,675 observations. To select the best possible input variables, several variable selection techniques were used. For example, LARS (least angle regression), LASSO (least absolute shrinkage and selection operator), adaptive LASSO, variable selection, variable clustering, stepwise regression, (PCA) principal component analysis only with numeric variables, and PCA with all variables were all tested. After selecting input variables, numerous modeling techniques were applied to predict the current market value of a housing unit. An in-depth analysis of the findings revealed that the current market value of a housing unit is significantly affected by the fair market value, insurance and other costs, structure type, household income, and more. Furthermore, a higher household income and median income of an area are associated with a higher market value of a housing unit.

Read the paper (PDF) | View the e-poster or slides (PDF)

The Oklahoma State Department of Health (OSDH) conducts home visiting programs with families that need parental support. Domestic violence is one of the many screenings performed on these visits. The home visiting personnel are trained to do initial screenings; however, they do not have the extensive information required to treat or serve the participants in this arena. Understanding how demographics such as age, level of education, and household income among others, are related to domestic violence might help home visiting personnel better serve their clients by modifying their questions based on these demographics. The objective of this study is to better understand the demographic characteristics of those in the home visiting programs who are identified with domestic violence. We also developed predictive models such as logistic regression and decision trees based on understanding the influence of demographics on domestic violence. The study population consists of all the women who participated in the Children First Program of the OSDH from 2012 to 2014. The data set contains 1,750 observations collected during screening by the home visiting personnel over the two-year period. In addition, they must have completed the Demographic form as well as the Relationship Assessment form at the time of intake. Univariate and multivariate analysis has been performed to discover the influence that age, education, and household income have on domestic violence. From the initial analysis, we can see that women who are younger than 25 years old, who haven't completed high school, and who are somewhat dependent on their husbands or partners for money are most vulnerable. We have even segmented the clients based on the likelihood of domestic violence.

View the e-poster or slides (PDF)

This session discusses challenges and considerations typically faced in credit risk scoring as well as options and practical ways to address them. No two problems are ever the same even if the general approach is clear. Every model has its own unique characteristics and creative ways to address them. Successful credit scoring modeling projects are always based on a combination of both advanced analytical techniques and data, and a deep understanding of the business and how the model will be applied. Different aspects of the process are discussed, including feature selection, reject inferencing, sample selection and validation, and model design questions and considerations.

Read the paper (PDF)

Automation of everyday activities holds the promise of consistency, accuracy, and relevancy. When applied to business operations, the additional benefits of governance, adaptability, and risk avoidance are realized. Prescriptive analytics empowers both systems and front-line workers to take the desired company action each and every time. And with data streaming from transactional systems, from the Internet of Things (IoT), and any other source, doing the right thing with exceptional processing speed produces the responsiveness that customers depend on. This talk describes how SAS^® and Teradata are enabling prescriptive analytics in current business environments and in the emerging IoT.

Read the paper (PDF) | View the e-poster or slides (PDF)

In customer relationship management (CRM) or consumer finance it is important to predict the time of repeated events. These repeated events might be a purchase, service visit, or late payment. Specifically, the goal is to find the probability density for the time to first event, the probability density for the time to second event, and so on. Two approaches are presented and contrasted. One approach uses discrete time hazard modeling (DTHM). The second, a distinctly different approach, uses a multinomial logistic model (MLM). Both DTHM and MLM satisfy an important consistency criterion, which is to provide probability densities whose average across all customers equals the empirical population average. DTHM requires fewer models if the number of repeated events is small. MLM requires fewer models if the time horizon is small. SAS^® code is provided through a simulation in order to illustrate the two approaches.

Read the paper (PDF)

Donorschoose.org is a nonprofit organization that allows individuals to donate directly to public school classroom projects. Teachers from public schools post a request for funding a project with a short essay describing it. Donors all around the world can look at these projects when they log in to Donorschoose.org and donate to projects of their choice. The idea is to have a personalized recommendation webpage for all the donors, which will show them the projects, which they prefer, like and love to donate. Implementing a recommender system for the DonorsChoose.org website will improve user experience and help more projects meet their funding goals. It also will help us in understanding the donors' preferences and delivering to them what they want or value. One type of recommendation system can be designed by predicting projects that will be less likely to meet funding goals, segmenting and profiling the donors and using that information for recommending right projects when the donors log in to DonorsChoose.org.

Read the paper (PDF)

Whether it is a question of security or a question of centralizing the SAS^® installation to a server, the need to phase out SAS in the PC environment has never been so common. On the surface, this type of migration seems very simple and smooth. However, migrating SAS from a PC environment to a SAS server environment (SAS^® Enterprise Guide^®) is really easy to underestimate. This paper presents a way to set up the winning conditions to achieve this goal without falling into the common traps. Based on a successful conversion with a previous employer, I have identified a high-level roadmap with specific objectives that will guide people in this important task.

Read the paper (PDF) | Watch the recording

In forecasting, there are often situations where several time series are interrelated: components of one time series can transition into and from other time series. A Markov chain forecast model may readily capture such intricacies through the estimation of a transition probability matrix, which enables a forecaster to forecast all the interrelated time series simultaneously. A Markov chain forecast model is flexible in accommodating various forecast assumptions and structures. Implementation of a Markov chain forecast model is straightforward using SAS/IML^® software. This paper demonstrates a real-world application in forecasting a community supervision caseload in Washington State. A Markov model was used to forecast five interrelated time series in the midst of turbulent caseload changes. This paper discusses the considerations and techniques in building a Markov chain forecast model at each step. Sample code using SAS/IML is provided. Anyone interested in adding another tool to their forecasting technique toolbox will find that the Markov approach is useful and has some unique advantages in certain settings.

Read the paper (PDF)

A stored process is a SAS^® program that can be executed as required by different applications. Stored processes have been making SAS users' lives easier for decades. In SAS^® Visual Analytics, stored processes can be used to enhance the user experience, create custom functionality and output, and expand the usefulness of reports. This paper discusses a technique for how data can be loaded on demand into memory for SAS Visual Analytics and accessed by reports as needed using stored processes. Loading tables into memory so that they can be used to create explorations or reports is a common task in SAS Visual Analytics. This task is usually done by an administrator, enabling the SAS Visual Analytics user to have a seamless transition from data to report. At times, however, users need for tables to be partially loaded or modified in memory on demand. The step-by-step instructions in this paper make it easy enough for any SAS Visual Analytics report builder to include these stored processes in their work. By using this technique, SAS Visual Analytics users have the freedom to access their data without having to work through an administrator for every little task, helping everyone be more effective and productive.

Read the paper (PDF)

Sensors, devices, social conversation streams, web movement, and all things in the Internet of Things (IoT) are transmitting data at unprecedented volumes and rates. SAS^® Event Stream Processing ingests thousands and even hundreds of millions of data events per second, assessing both the content and the value. The benefit to organizations comes from doing something with those results, and eliminating the latencies associated with storing data before analysis happens. This paper bridges the gap. It describes how to use streaming data as a portable, lightweight micro-analytics service for consumption by other applications and systems.

Read the paper (PDF) | Download the data file (ZIP)

This paper compares different solutions to a data transpose puzzle presented to the SAS^® User Group at the United States Census Bureau. The presented solutions range from a SAS 101 multi-step solution to an advanced solution using techniques that are not widely known, which yields run-time savings of 85 percent!

Read the paper (PDF) | Download the data file (ZIP)

Cancer is the second-leading cause of deaths in the United States. About 10% to 15% of all lung cancers are non-small cell lung cancer, but they constitute about 26.8% of cancer deaths. An efficient cancer treatment plan is therefore of prime importance for increasing the survival chances of a patient. Cancer treatments are generally given to a patient in multiple sittings and often doctors tend to make decisions based on the improvement over time. Calculating the survival chances of the patient with respect to time and determining the various factors that influence the survival time would help doctors make more informed decisions about the further course of treatment and also help patients develop a proactive approach in making choices for the treatment. The objective of this paper is to analyze the survival time of patients suffering from non-small cell lung cancer, identify the time interval crucial for their survival, and identify various factors such as age, gender, and treatment type. The performances of the models built are analyzed to understand the significance of cubic splines used in these models. The data set is from the Cerner database with 548 records and 12 variables from 2009 to 2013. The patient records with loss to follow-up are censored. The survival analysis is performed using parametric and nonparametric methods in SAS^® Enterprise Miner™ 13.1. The analysis revealed that the survival probability of a patient is high within two weeks of hospitalization and the probability of survival goes down by 60% between weeks 5 and 9 of admission. Age, gender, and treatment type play prominent roles in influencing the survival time and risk. The probability of survival for female patients decreases by 70% and 80% during weeks 6 and 7, respectively, and for male patients the probability of survival decreases by 70% and 50% during weeks 8 and 13, respectively.

Read the paper (PDF)

A survey is often the best way to obtain information and feedback to help in program improvement. At a bare minimum, survey research comprises economics, statistics, and psychology in order to develop a theory of how surveys measure and predict important aspects of the human condition. Through healthcare-related research papers written by experts in the industry, I hypothesize a list of several factors that are important and necessary for patients during hospital visits. Through text mining using SAS^® Enterprise Miner™ 12.1, I measured tangible aspects of hospital quality and patient care by using online survey responses and comments found on the website for the National Health Service in England. I use these patient comments to determine the majority opinion and whether it correlates with expert research on hospital quality. The implications of this research are vital in comprehending our health-care system and the factors that are important in satisfying patient needs. Analyzing survey responses can help us to understand and mitigate disparities in health-care services provided to population subgroups in the United States. Starting with online survey responses can help us to understand the overall methodology and motivation needed to analyze the surveys that have been developed for departments like the Centers for Medicare and Medicaid Services (CMS).

Read the paper (PDF) | View the e-poster or slides (PDF)

In this session, we examine the use of text analytics as a tool for strategic analysis of an organization, a leader, a brand, or a process. The software solutions SAS^® Enterprise Miner™ and Base SAS^® are used to extract topics from text and create visualizations that identify the relative placement of the topics next to various business entities. We review a number of case studies that identify and visualize brand-topic relationships in the context of branding, leadership, and service quality.

Read the paper (PDF) | Watch the recording

The HPSUMMARY procedure provides data summarization tools to compute basic descriptive statistics for variables in a SAS^® data set. It is a high-performance version of the SUMMARY procedure in Base SAS^®. Though PROC SUMMARY is popular with data analysts, PROC HPSUMMARY is still a new kid on the block. The purpose of this paper is to provide an introduction to PROC HPSUMMARY by comparing it with its well-known counterpart, PROC SUMMARY. The comparison focuses on differences in syntax, options, and performance in terms of execution time and memory usage. Sample code, outputs, and SAS log snippets are provided to illustrate the discussion. Simulated large data is used to observe the performance of the two procedures. Using SAS^® 9.4 installed on a single-user machine with four cores available, preliminary experiments that examine performance of the two procedures show that HPSUMMARY is more memory-efficient than SUMMARY when the data set is large. (For example, SUMMARY failed due to insufficient memory, whereas HPSUMMARY finished successfully). However, there is no evidence of a performance advantage of the HPSUMMARY over the SUMMARY procedures in this single-user machine.

Read the paper (PDF) | View the e-poster or slides (PDF)

The Scheduler is an innovative tool that maps linear data (that is, time stamps) to an intuitive three-dimensional representation. This transformation enables the user to identify relationships, conflicts, gaps, and so on, that were not readily apparent in the data's native form. This tool has applications in operations research and litigation-related analyses. This paper discusses why the Scheduler was created, the data that is available to analyze the issue, and how this code can be used in other types of applications. Specifically, this paper discusses how the Scheduler can be used by supervisors to maintain the presence of three or more employees at all times to ensure that all federal- and state- mandated work breaks are taken. Additional examples of the Scheduler include assisting construction foremen to create schedules that visualize the presence of contractors in a logical sequence while eliminating overlap and ensuring cushions of time between contractors and matching items to non-uniform information.

Read the paper (PDF)

Data transformations serve many functions in data analysis, including improving normality of distribution and equalizing variance to meet assumptions and improve effective sizes. Traditionally, the first step in the analysis is to preprocess and transform the data to derive different representations for further exploration. But now in this era of big data, it is not always feasible to have transformed data available beforehand. Analysts need to conduct exploratory data analysis and subsequently transform data on the fly according to their needs. SAS^® Visual Analytics has an expression builder component integrated into SAS^® Visual Data Builder, SAS^® Visual Analytics Explorer, and SAS^® Visual Analytics Designer that helps you transform data on the fly. The expression builder enables you to create expressions that you can use to aggregate columns, perform multiple operations on data, and perform conditional processing. It supports numeric, comparison, Boolean, text, and date time operators, and different functions like Log, Ln, Mod, Exp, Power, Root, and so on. This paper demonstrates how you can use the expression builder that is integrated into the data builder, the explorer, and the designer to create different types of expressions and transform data for analysis and reporting purpose.

Read the paper (PDF)

Administrative health databases, including hospital and physician records, are frequently used to estimate the prevalence of chronic diseases. Disease-surveillance information is used by policy makers and researchers to compare the health of populations and develop projections about disease burden. However, not all cases are captured by administrative health databases, which can result in biased estimates. Capture-recapture (CR) models, originally developed to estimate the sizes of animal populations, have been adapted for use by epidemiologists to estimate the total sizes of disease populations for such conditions as cancer, diabetes, and arthritis. Estimates of the number of cases are produced by assessing the degree of overlap among incomplete lists of disease cases captured in different sources. Two- and three-source CR models are most commonly used, often with covariates. Two important assumptions--independence of capture in each data source and homogeneity of capture probabilities, which underlie conventional CR models--are unlikely to hold in epidemiological studies. Failure to satisfy these assumptions bias the model results. Log-linear, multinomial logistic regression, and conditional logistic regression models, if used properly, can incorporate dependency among sources and covariates to model the effect of heterogeneity in capture probabilities. However, none of these models is optimal, and researchers might be unfamiliar with how to use them in practice. This paper demonstrates how to use SAS^® to implement the log-linear, multinomial logistic regression, and conditional logistic regression CR models. Methods to address the assumptions of independence between sources and homogeneity of capture probabilities for a three-source CR model are provided. The paper uses a real numeric data set about Parkinson's disease involving physician claims, hospital abstract, and prescription drug records from one Canadian province. Advantages and disadvantages of each model are discus sed.

View the e-poster or slides (PDF)

For SAS^® users, PROC TABULATE and PROC REPORT (and its compute blocks) are probably among the most common procedures for calculating and displaying data. It is, however, pretty difficult to calculate and display changes from one column to another using data from other rows with just these two procedures. Compute blocks in PROC REPORT can calculate additional columns, but it would be challenging to pick up values from other rows as inputs. This presentation shows how PROC TABULATE can work with the lag(n) function to calculate rates of change from one period of time to another. This offers the flexibility of feeding into calculations the data retrieved from other rows of the report. PROC REPORT is then used to produce the desired output. The same approach can also be used in a variety of scenarios to produce customized reports.

Read the paper (PDF) | Download the data file (ZIP) | Watch the recording

Stephen Curry, James Harden, and LeBron James are considered to be three of the most gifted professional basketball players in the National Basketball Association (NBA). Each year the Kia Most Valuable Player (MVP) award is given to the best player in the league. Stephen Curry currently holds this title, followed by James Harden and LeBron James, the first two runners-up. The decision for MVP was made by a panel of judges comprised of 129 sportswriters and broadcasters, along with fans who were able to cast their votes through NBA.com. Did the judges make the correct decision? Is there statistical evidence that indicates that Stephen Curry is indeed deserving of this prestigious title over James Harden and LeBron James? Is there a significant difference between the two runners-up? These are some of the questions that are addressed through this project. Using data collected from NBA.com for the 2014-2015 season, a variety of parametric and nonparametric k-sample methods were used to test 20 quantitative variables. In an effort to determine which of the three players is the most deserving of the MVP title, post-hoc comparisons were also conducted on the variables that were shown to be significant. The time-dependent variables were standardized, because there was a significant difference in the number of minutes each athlete played. These variables were then tested and compared with those that had not been standardized. This led to significantly different outcomes, indicating that the results of the tests could be misleading if the time variable is not taken into consideration. Using the standardized variables, the results of the analyses indicate that there is a statistically significant difference in the overall performances of the three athletes, with Stephen Curry outplaying the other two players. However, the difference between James Harden and LeBron James is not so clear.

Read the paper (PDF) | View the e-poster or slides (PDF)

In many discrete-event simulation projects, the chief goal is to investigate the performance of a system. You can use output data to better understand the operation of the real or planned system and to conduct various what-if analyses. But you can also use simulation for validation--specifically, to validate a solution found by an optimization model. In optimization modeling, you almost always need to make some simplifying assumptions about the details of the system you are modeling. These assumptions are especially important when the system includes random variation--for example, in the arrivals of individuals, their distinguishing characteristics, or the time needed to complete certain tasks. A common approach holds each random element at some nominal value (such as the mean of its observed values) and proceeds with the optimization. You can do better. In order to test an optimization model and its underlying assumptions, you can build a simulation model of the system that uses the optimal solution as an input and simulates the system's detailed behavior. The simulation model helps determine how well the optimal solution holds up when randomness and perhaps other logical complexities (which the optimization model might have ignored, summarized, or modeled only approximately) are accounted for. Simulation might confirm the optimization results or highlight areas of concern in the optimization model. This paper describes cases in which you can use simulation and optimization together in this manner and discusses the positive implications of this complementary analytic approach. For the reader, prior experience with optimization and simulation is helpful but not required.

Read the paper (PDF) | Download the data file (ZIP)

Cycling is one of the fastest growing recreational activities and sports in India. Many non-government organizations (NGOs) support this emission-free mode of transportation that helps protect the environment from air pollution hazards. Lots of cyclist groups in metropolitan areas organize numerous ride events and build social networks. Although India was a bit late for joining the Internet, the social networking sites are getting popular in every Indian state and are expected to grow after the announcement of Digital India Project. Many networking groups and cycling blogs share tons of information and resources across the globe. However, these blogs posts are difficult to categorize according to their relevance, making it difficult to access required information quickly on the blogs. This paper provides ways to categorize the content of these blog posts and classify them in meaningful categories for easy identification. The initial data set is created from scraping the online cycling blog posts (for example, Cyclists.in, velocrushindia.com, and so on) using Python. The initial data set consists of 1,446 blog posts with titles and blog content since 2008. Approximately 25% of the blog posts are manually categorized into six different categories, such as Ride stories, Events, Nutrition, Bicycle hacks, Bicycle Buy and Sell, and Others, by three independent raters to generate a training data set. The common blog-post categories are identified, and the text classification model is built to identify the blog-post classification and generate the text rules to classify the blogs based on those categories using SAS^® Text Miner. To improve the look and feel of the social networking blog, a tool developed in JavaScript automatically classifies the blog posts into six categories and provides appropriate suggestions to the blog user.

Read the paper (PDF)

Marriage laws have changed in 38 states, including the District of Columbia, permitting same-sex couples to marry since 2004. However, since 1996, 13 states have banned same-sex marriages, some of which are under legal challenge. A U.S. animated map, made using the SAS^® GMAP procedure in SAS^® 9.4, shows the changes in the acceptance, rejection, or undecided legal status of this issue, year-by-year. In this study, we also present other SAS data visualization techniques to show how concentrations (percentages) of married same-sex couples and their relationships have evolved over time, thereby instigating changes in marriage laws. Using a SAS GRAPH display, we attempt to show how both the total number of same-sex-couple households and the percent that are married have changed on a state-by-state basis since 2005. These changes are also compared to the year in which marriage laws permitting same-sex marriages were enacted, and are followed over time since that year. It is possible to examine trends on an annual basis from 2005 to 2013 using the American Community Survey one-year estimates. The SAS procedures and SAS code used to create the data visualization and data manipulation techniques are provided.

Read the paper (PDF) | Download the data file (ZIP) | View the e-poster or slides (PDF)

The LACE readmission risk score is a methodology used by Kaiser Permanente Northwest (KPNW) to target and customize readmission prevention strategies for patients admitted to the hospital. This presentation shares how KPNW used SAS^® in combination with Epic's Datalink to integrate the LACE score into its electronic health record (EHR) for usage in real time. The LACE score is an objective measure, composed of four components including: L) length of stay; A) acuity of admission; C) pre-existing co-morbidities; and E) Emergency department (ED) visits in the prior six months. SAS was used to perform complex calculations and combine data from multiple sources (which was not possible for the EHR alone), and then to calculate a score that was integrated back into the EHR. The technical approach includes a trigger macro to kick off the process once the database ETL completes, several explicit and implicit proc SQL statements, a volatile temp table for filtering, and a series of SORT, MEANS, TRANSPOSE, and EXPORT procedures. We walk through the technical approach taken to generate and integrate the LACE score into Epic, as well as describe the challenges we faced, how we overcame them, and the beneficial results we have gained throughout the process.

View the e-poster or slides (PDF)

Few data visualizations are as striking or as useful as heat maps, and fortunately there are many applications for them. A heat map displaying eye-tracking data is a particularly potent example: the intensity of a viewer's gaze is quantified and superimposed over the image being viewed, and the resulting data display is often stunning and informative. This paper shows how to use Base SAS^® to prepare, smooth, and transform eye-tracking data and ultimately render it on top of a corresponding image. By customizing a graphical template and using specific Graph Template Language (GTL) options, a heat map can be drawn precisely so that the user maintains pixel-level control. In this talk, and in the related paper, eye-tracking data is used primarily, but the techniques provided are easily adapted to other fields such as sports, weather, and marketing.