SAS Global Forum 2014 Proceedings

Complex data manipulations can be resource intensive, both in terms of development time and processing duration. However, in recent years SAS has introduced a number of new technologies that, when used together, can produce a dramatic increase in performance while simultaneously simplifying program development and maintenance. This paper presents a development paradigm that utilizes the problem decomposition capabilities of DS2, the flexibility of SQL, and the performance benefits of in-memory storage using hash objects.

Stepwise regression includes regression models in which the predictive variables are selected by an automated algorithm. The stepwise method involves two approaches: backward elimination and forward selection. Currently, SAS^® has three procedures capable of performing stepwise regression: REG, LOGISTIC, and GLMSELECT. PROC REG handles the linear regression model, but does not support a CLASS statement. PROC LOGISTIC handles binary responses and allows for logit, probit, and complementary log-log link functions. It also supports a CLASS statement. The GLMSELECT procedure performs selections in the framework of general linear models. It allows for a variety of model selection methods, including the LASSO method of Tibshirani (1996) and the related LAR method of Efron et al. (2004). PROC GLMSELECT also supports a CLASS statement. We present a stepwise algorithm for generalized linear mixed models for both marginal and conditional models. We illustrate the algorithm using data from a longitudinal epidemiology study aimed to investigate parents beliefs, behaviors, and feeding practices that associate positively or negatively with indices of sleep quality.

The ExcelXP tagset offers several options for controlling column widths, including Width_Points, Width_Fudge, and Absolute_Column_Width. Although Absolute_Column_Width might seem unpredictable at first, it is possible to fix the first two options so that the Absolute_Column_Width is the exact column width in pixels. This poster presents these settings and suggests how to create and manage the integer string of column widths.

Accelerated testing is an effective tool for predicting when systems fail, where the system can be as simple as an engine gasket or as complex as a magnetic resonance imaging (MRI) scanner. In particular, you might conduct an experiment to determine how factors such as temperature and voltage impose enough stress on the system to cause failure. Because system components usually meet nominal quality standards, it can take a long time to obtain failure data under normal-use conditions. An effective strategy is to accelerate the experiment by testing under abnormally stressful conditions, such as higher temperatures. Following that approach, you obtain data more quickly, and you can then analyze the data by using the RELIABILITY procedure in SAS/QC^® software. The analysis is a three-step process: you establish a probability model, explore the relationship between stress and failure, and then extrapolate to normal-use conditions. Graphs are a key component of all three stages: you choose a model by comparing residual plots from candidate models, use graphs to examine the stress-failure relationship, and then use an appropriately scaled graph to extrapolate along a straight line. This paper guides you through the process, and it highlights features added to the RELIABILITY procedure in SAS/QC 13.1.

Cluster (group) randomization in trials is increasingly used over patient-level randomization. There are many reasons for this, including more pragmatic trials associated with comparative effectiveness research. Examples of clusters that could be randomized for study are clinics or hospitals, counties within a state, and other geographical areas such as communities. In many of these trials, the number of clusters is relatively small. This can be a problem if there are important covariates at the cluster level that are not balanced across the intervention and control groups. For example, if we randomize eight counties, a simple randomization could put all counties with high socioeconomic status in one group or the other, leaving us without good comparison data. There are strategies to prevent an unlucky cluster randomization. These include matching, stratification, minimization and covariate-constrained randomization. Each method is discussed, and a county-level Health Economics example of covariate-constrained randomization is shown for intermediate SAS^® users working with SAS^® Foundation for Release 9.2 and SAS/STAT^® on a Windows operating system.

Finding groups with similar attributes is at the core of knowledge discovery. To this end, Cluster Analysis automatically locates groups of similar observations. Despite successful applications, many practitioners are uncomfortable with the degree of automation in Cluster Analysis, which causes intuitive knowledge to be ignored. This is more true in text mining applications since individual words have meaning beyond the data set. Discovering groups with similar text is extremely insightful. However, blind applications of clustering algorithms ignore intuition and hence are unable to group similar text categories. The challenge is to integrate the power of clustering algorithms with the knowledge of experts. We demonstrate how SAS/STAT^® 9.2 procedures and the SAS^® Macro Language are used to ensemble the opinion of domain experts with multiple clustering models to arrive at a consensus. The method has been successfully applied to a large data set with structured attributes and unstructured opinions. The result is the ability to discover observations with similar attributes and opinions by capturing the wisdom of the crowds whether man or model.

This paper expands upon A Multilevel Model Primer Using SAS^® PROC MIXED in which we presented an overview of estimating two- and three-level linear models via PROC MIXED. However, in our earlier paper, we, for the most part, relied on simple options available in PROC MIXED. In this paper, we present a more advanced look at common PROC MIXED options used in the analysis of social and behavioral science data, as well introduce users to two different SAS macros previously developed for use with PROC MIXED: one to examine model fit (MIXED_FIT) and the other to examine distributional assumptions (MIXED_DX). Specific statistical options presented in the current paper include (a) PROC MIXED statement options for estimating statistical significance of variance estimates (COVTEST, including problems with using this option) and estimation methods (METHOD =), (b) MODEL statement option for degrees of freedom estimation (DDFM =), and (c) RANDOM statement option for specifying the variance/covariance structure to be used (TYPE =). Given the importance of examining model fit, we also present methods for estimating changes in model fit through an illustration of the SAS macro MIXED_FIT. Likewise, the SAS macro MIXED_DX is introduced to remind users to examine distributional assumptions associated with two-level linear models. To maintain continuity with the 2013 introductory PROC MIXED paper, thus providing users with a set of comprehensive guides for estimating multilevel models using PROC MIXED, we use the same real-world data sources that we used in our earlier primer paper.

A central component of discussions of healthcare reform in the U.S. are estimations of healthcare cost and use at the national or state level, as well as for subpopulation analyses for individuals with certain demographic properties or medical conditions. For example, a striking but persistent observation is that just 1% of the U.S. population accounts for more than 20% of total healthcare costs, and 5% account for almost 50% of total costs. In addition to descriptions of specific data sources underlying this type of observation, we demonstrate how to use SAS^® to generate these estimates and to extend the analysis in various ways; that is, to investigate costs for specific subpopulations. The goal is to provide SAS programmers and healthcare analysts with sufficient data-source background and analytic resources to independently conduct analyses on a wide variety of topics in healthcare research. For selected examples, such as the estimates above, we concretely show how to download the data from federal web sites, replicate published estimates, and extend the analysis. An added plus is that most of the data sources we describe are available as free downloads.

Sampling is widely used in different fields for quality control, population monitoring, and modeling. However, the purposes of sampling might be justified by the business scenario, such as legal or compliance needs. This paper uses one probability sampling method, stratified sampling, combined with quality control review business cost to determine an optimized procedure of sampling that satisfies both statistical selection criteria and business needs. The first step is to determine the total number of strata by grouping the strata with a small number of sample units, using box-and-whisker plots outliers as a whole. Then, the cost to review the sample in each stratum is justified by a corresponding business counter-party, which is the human working hour. Lastly, using the determined number of strata and sample review cost, optimal allocation of predetermined total sample population is applied to allocate the sample into different strata.

The power of social media has increased to such an extent that businesses that fail to monitor consumer responses on social networking sites are now clearly at a disadvantage. In this paper, we aim to provide some insights on the impact of the Digital Rights Management (DRM) policies of Microsoft and the release of Xbox One on their customers' reactions. We have conducted preliminary research to compare the basic text mining capabilities of SAS^® and R, two very diverse yet powerful tools. A total of 6,500 Tweets were collected to analyze the impact of the DRM policies of Microsoft. The Tweets were segmented into three groups based on date: before Microsoft announced its Xbox One policies (May 18 to May 26), after the policies were announced (May 27 to June 16), and after changes were made to the announced policies (June 16 to July 1). Our results suggest that SAS works better than R when it comes to extensive analysis of textual data. In our following work, customers reactions to the release of Xbox One will be analyzed using SAS^® Sentiment Analysis Studio. We will collect Tweets on Xbox posted before and after the release of Xbox One by Microsoft. We will have two categories, Tweets posted between November 15 and November 21 and those posted between November 22 and November 29. Sentiment analysis will then be performed on these Tweets, and the results will be compared between the two categories.

Most marketers today are trying to use Facebook s network of 1.1 billion plus registered users for social media marketing. Local television stations and newspapers are no exception. This paper investigates what makes a post effective. A Facebook page that is owned by a brand has fans, or people who like the page and follow the stories posted on that page. The posts on a brand page, however, do not appear on all the fans News Feeds. This is determined by EdgeRank, a Facebook proprietary algorithm that determines what content users see and how it s prioritized on their News Feed. If marketers can understand how EdgeRank works, then they can develop more impactful posts and ultimately produce more effective social marketing using Facebook. The objective of this paper is to find the characteristics of a Facebook post that enhance the efficacy of a news outlet s page among their fans using Facebook Power Ratio as the target variable. Power Ratio, a surrogate to EdgeRank, was developed by experts at Frank N. Magid Associates, a research-based media consulting firm. Seventeen variables that describe the characteristics of a post were extracted from more than 8,000 posts, which were encoded by 10 media experts at Magid. Numerous models were built and compared to predict Power Ratio. The most useful model is a polynomial regression with the top three important factors as whether a post asks fans to like the post, content category of a post (including news, weather, etc.), and number of fans of the page.

Do you have an abstract for an idea that you want to submit as a proposal to SAS^® conferences, but you are not sure which section is the most appropriate one? In this paper, we discuss a methodology for automatically identifying the most suitable section or sections for your proposal content. We use SAS^® Text Miner 12.1 and SAS^® Content Categorization Studio 12.1 to develop a rule-based categorization model. This model is used to automatically score your paper abstract to identify the most relevant and appropriate conference sections to submit to for a better chance of acceptance.

This paper kicks off a project to write a comprehensive book of best practices for documenting SAS^® projects. The presenter s existing documentation styles are explained. The presenter wants to discuss and gather current best practices used by the SAS user community. The presenter shows documentation styles at three different levels of scope. The first is a style used for project documentation, the second a style for program documentation, and the third a style for variable documentation. This third style enables researchers to repeat the modeling in SAS research, in an alternative language, or conceptually.

There is an ever-increasing number of study designs and analysis of clinical trials using Bayesian frameworks to interpret treatment effects. Many research scientists prefer to understand the power and probability of taking a new drug forward across the whole range of possible true treatment effects, rather than focusing on one particular value to power the study. Examples are used in this paper to show how to compute Bayesian probabilities using the SAS/STAT^® MIXED procedure and UNIVARIATE procedure. Particular emphasis is given to the application on efficacy analysis, including the comparison of new drugs to placebos and to standard drugs on the market.

Because the macro language is primarily a code generator, it makes sense that the code that it creates must be generated before it can be executed. This implies that execution of the macro language comes first. Simple as this is in concept, timing issues and conflicts are often not so simple to recognize in application. As we use the macro language to take on more complex tasks, it becomes even more critical that we have an understanding of these issues.

Macro variables and their values are stored in symbol tables, which in turn are held in memory. Not only are there are a number of ways to create macro variables, but they can be created in a wide variety of situations. How they are created and under what circumstances effects the variable s scope how and where the macro variable is stored and retrieved. There are a number of misconceptions about macro variable scope and about how the macro variables are assigned to symbol tables. These misconceptions can cause problems that the new, and sometimes even the experienced, macro programmer does not anticipate. Understanding the basic rules for macro variable assignment can help the macro programmer solve some of these problems that are otherwise quite mystifying.

The scheduling of surgical operations in a hospital is a complex problem, with each surgical specialty having to satisfy their demand while competing for resources with other hospital departments. This project extends the construction of a weekly timetable, the Master Surgery Schedule, which assigns surgical specialties to operating theater sessions by taking into account the post-surgery resource requirements, primarily post-operative beds on hospital wards. Using real data from the largest teaching hospital in Wales, UK, this paper describes how SAS^® has been used to analyze large data sets to investigate the relationship between the operating theater schedule and the demand for beds on wards in the hospital. By understanding this relationship, a more well-informed and robust operating theater schedule can be produced that delivers economic benefit to the hospital and a better experience for the patients by reducing the number of cancelled operations caused by the unavailability of beds on hospital wards.

Discover how SAS^® leverages field marketing programs to support AllAnalytics.com, a sponsored third-party community. This paper explores the use of SAS software, including SAS^® Enterprise Guide^®, SAS^® Customer Experience Analytics, and SAS^® Marketing Automation to enable marketers to have better insight, better targeting, and better response from SAS programs.

Inference of variance components in linear mixed effect models (LMEs) is not always straightforward. I introduce and describe a flexible SAS^® macro (%COVTEST) that uses the likelihood ratio test (LRT) to test covariance parameters in LMEs by means of the parametric bootstrap. Users must supply the null and alternative models (as macro strings), and a data set name. The macro calculates the observed LRT statistic and then simulates data under the null model to obtain an empirical p-value. The macro also creates graphs of the distribution of the simulated LRT statistics. The program takes advantage of processing accomplished by PROC MIXED and some SAS/IML^® functions. I demonstrate the syntax and mechanics of the macro using three examples.

The use of Cohen s kappa has enjoyed a growing popularity in the social sciences as a way of evaluating rater agreement on a categorical scale. The kappa statistic can be calculated as Cohen first proposed it in his 1960 paper or by using any one of a variety of weighting schemes. The most popular among these are the linear weighted kappa and the quadratic weighted kappa. Currently, SAS^® users can produce the kappa statistic of their choice through PROC FREQ and the use of relevant AGREE options. Complications arise however when the data set does not contain a completely square cross-tabulation of data. That is, this method requires that both raters have to have at least one data point for every available category. There have been many solutions offered for this predicament. Most suggested solutions include the insertion of dummy records into the data and then assigning a weight of zero to those records through an additional class variable. The result is a multi-step macro, extraneous variable assignments, and potential data integrity issues. The author offers a much more elegant solution by producing a segment of code which uses brute force to calculate Cohen s kappa as well as all popular variants. The code uses nested PROC SQL statements to provide a single conceptual step which generates kappa statistics of all types even those that the user wishes to define for themselves.

A case control study is in its most basic form comparing a case series to a matched control series and are commonly implemented in the field of public health. While matching is intended to eliminate confounding, the main potential benefit of matching in case control studies is a gain in efficiency. There are many known methods for selecting potential match or matches (in case of 1:n studies) per case, the most prominent being distance-based approach and matching on propensity scores. In this paper, we will go through both and compare their results and will present a macro capable of performing both.

Can clustering discharge records improve a predictive model s overall fit statistic? Do the predictors differ across segments? This talk describes the methods and results of data mining pediatric IBD patient records from my Analytics 2013 poster. SAS^® Enterprise Miner^™ 12.1 was used to segment patients and model important predictors for the length of hospital stay using discharge records from the national Kid s Inpatient Database. Profiling revealed that patient segments were differentiated by primary diagnosis, operating room procedure indicator, comorbidities, and factors related to admission and disposition of patient. Cluster analysis of patient discharges improved the overall average square error of predictive models and distinguished predictors that were unique to patient segments.

The graphical display of the individual data is important in understanding the raw data and the relationship between the variables in the data set. You can explore your data to ensure statistical assumptions hold by detecting and excluding outliers if they exist. Since you can visualize what actually happens to individual subjects, you can make your conclusions more convincing in statistical analysis and interpretation of the results. SAS^® provides many tools for creating graphs of individual data. In some cases, multiple tools need to be combined to make a specific type of graph that you need. Examples are used in this paper to show how to create graphs of individual data using the SAS^® ODS Graphics procedures (SG procedures).

Missing data commonly occurs in medical, psychiatry, and social researches. The SAS^® MI and MIANALYZE procedures are often used to generate multiple imputations and then provide valid statistical inferences based on them. However, MIANALYZE is not applicable to combine type-III analyses obtained using multiple imputed data sets. In this manuscript, we write a macro to combine the type-III analyses generated from the SAS MIXED procedure based on multiple imputations. The proposed method can be extended to other procedures reporting type-III analyses, such as GENMOD and GLM.

Graphic software users are confronted with what I call Options Over-Choice, and with defaults that are designed to easily give you a result, but not necessarily the best result. This presentation and paper focus on guidelines for communication-effective data visualization. It demonstrates their practical implementation, using graphic examples likely to be adaptable to your own work. Code is provided for the examples. Audience members will receive the latest update of my tip sheet compendium of graphic design principles. The examples use SAS^® tools (traditional SAS/GRAPH^® or the newer ODS graphics procedures that are available with Base SAS^®), but the design principles are really software independent. Come learn how to use data visualization to inform and influence, to reveal and persuade, using tips and techniques developed and refined over 34 years of working to get the best out of SAS^® graphic software tools.

Non-cognitive assessments, which measure constructs such as time management, goal-setting, and personality, are becoming more prevalent today in research within the domains of academic performance and workforce readiness. Many instruments that are used for this purpose contain a large number of items that can each be assigned to specific facets of the larger construct. The factor structure of each instrument emerges from a mixture of psychological theory and empirical research, often by doing exploratory factor analysis (EFA) using the SAS^® procedure PROC FACTOR. Once an initial model is established, it is important to perform confirmatory factor analysis (CFA) to confirm that the hypothesized model provides a good fit to the data. If outcome data such as grades are collected, structural equation modeling (SEM) should also be employed to investigate how well the assessment predicts these measures. This paper demonstrates how the SAS procedure PROC CALIS is useful for performing confirmatory factor analysis and structural equation modeling. Examples of these methods are demonstrated and proper interpretation of the fit statistics and resulting output is illustrated.

The INTCK function is used to obtain the number of time intervals between two dates. The INTCK function comes with arguments and argument-modifiers to enable us to perform a variety of date-related manipulations. This paper deals with a real-time simple usage of the INTCK function to calculate frequency of days of the week between the start and end day of a trip. The INTCK function with its arguments can directly calculate the number of days of the week as illustrated in this paper. The same usage of the INTCK function using PROC SQL is also presented in this paper. All the codes executed and presented in this paper involve Base SAS^® Release 9.3 only.

LaTeX is a free document creation package that is often used to create journal articles. It provides the capability to create very specific formatting and to write a wide variety of formulas. Using ODS, SAS^® can write documents to a LaTeX file, which can then be compiled through LaTeX into PDF files. This paper briefly reviews the basic syntax and options to produce these files. Then, we look at how to create a new tagset to make changes to the standard ODS LaTeX templates to create the non-gridded table appearance that is typically seen in journal articles. We also explore how to write special characters and equations not otherwise available through ODS LaTeX.

The first table in many research journal articles is a statistical comparison of demographic traits across study groups. It might not be exciting, but it s necessary. And although SAS^® calculates these numbers with ease, it is a time-consuming chore to transfer these results into a journal-ready table. Introducing the time-saving deluxe %MAKETABLE SAS macro it does the heavy work for you. It creates a Microsoft Word table of up to four comparative groups reporting t-tests, chi-square, ANOVA, or median test results, including a p-value. You specify only a one-line macro call for each line in the table, and the macro takes it from there. The result is a tidily formatted journal-ready Word table that you can easily include in a manuscript, report, or Microsoft PowerPoint presentation. For statisticians and researchers needing to summarize group comparisons in a table, this macro saves time and relieves you from the drudgery of trying to make your output neat and pretty. And after all, isn t that what we want computing to do for us?

Business Intelligence platforms provide a bridge between expert data analysts and decision-makers and other end-users. But what do you do when you can identify no system that meets both your needs and your budget? If you are the Consolidated Data Analysis Center in the HHS Office of Inspector General, you use SAS^® Enterprise BI Server and the SAS^® Stored Process Web Application to build your own. This presentation covers the inception, design, and implementation of the PAYment by Geographic Area (PAYGAR) system, which uses only SAS^® Enterprise BI tools, namely the SAS Stored Process Web Application, PROC GMAP, and HTML/JAVA embedded in a DATA step, to create an interactive platform for presenting and exploring data that has a geographic component. In particular, the presentation reviews how we created a system of chained stored processes to enable a user to select the data to be presented, navigate through different geographic levels, and display companion reports related to the current data and geographic selections. It also covers the creation of the HTML front-end that sits over and manages the system. Throughout, the presentation emphasizes the scalability of PAYGAR, which the SAS Stored Process Web Application facilitates.

Merging or joining data sets is an integral part of the data consolidation process. Within SAS^®, there are numerous methods and techniques that can be used to combine two or more data sets. We commonly think that within the DATA step the MERGE statement is the only way to join these data sets, while in fact, the MERGE is only one of numerous techniques available to us to perform this process. Each of these techniques has advantages, and some have disadvantages. The informed programmer needs to have a grasp of each of these techniques if the correct technique is to be applied. This paper covers basic merging concepts and options within the DATA step, as well as a number of techniques that go beyond the traditional MERGE statement. These include fuzzy merges, double SET statements, and the use of key indexing. The discussion will include the relative efficiencies of these techniques, especially when working with large data sets.

Cross-visit checks are a vital part of data cleaning for longitudinal studies. The nature of longitudinal studies encourages repeatedly collecting the same information. Sometimes, these variables are expected to remain static, go away, increase, or decrease over time. This presentation reviews the na ve and the better approaches at handling one-variable and two-variable consistency checks. For a single-variable check, the better approach features the new ALLCOMB function, introduced in SAS^® 9.2. For a two-variable check, the better approach uses the .first pseudo-class to flag inconsistencies. This presentation will provide you the tools to enhance your longitudinal data cleaning process.

A revolution is taking place in the U.S. at both the national and state level in the area of health care transparency. Large amounts of data on the health of communities, the quality of health care providers, and the cost of health care is being collected and is being made available by both levels of government to a variety of stakeholders. The surfacing of this data and the consumption of it by health care decision makers unfolds a new opportunity to view, explore, and analyze health care data in novel ways. Furthermore, this data provides the health care system an opportunity to advance the achievement of the Triple Aim. Data transparency will bring a sea change to the world of health care by necessitating new ways of communicating information to end users such as payers, providers, researchers, and consumers of health care. This paper examines the information needs of public health care payers such as Medicare and Medicaid, and discusses the convergence of health care and data visualization in creating consumable health insights that will aid in achieving cost containment, quality improvement, and increased accessibility for populations served. Moreover, using claims data and SAS^® Visual Analytics, it examines how data visualization can help identify the most critical insights necessary to managing population health. If health care payers can analyze large amounts of claims data effectively, they can improve service and care delivery to their recipients.

The evolution of the mobile landscape has created a shift in the workforce that now favors mobile devices over traditional desktops. Considering that today's workforce is not always in the office or at their desks, new opportunities have been created to deliver report content through innovative mobile experiences. SAS^® Mobile BI for both iOS and Android tablets compliments the SAS^® Visual Analytics offering by providing anytime, anywhere access to reports containing information that consumers need. This paper presents best practices and tips on how to optimize reports for mobile users, taking into consideration the constraints of limited screen real estate and connectivity, as well as answers a few frequently asked questions. Discover how SAS Mobile BI captures the power of mobile reporting to prepare for the vast growth that is predicted in the future.

New innovative, analytical techniques are necessary to extract patterns in big data that have temporal and geo-spatial attributes. An approach to this problem is required when geo-spatial time series data sets, which have billions of rows and the precision of exact latitude and longitude data, make it extremely difficult to locate patterns of interest The usual temporal bins of years, months, days, hours, and minutes often do not allow the analyst to have control of the precision necessary to find patterns of interest. Geohashing is a string representation of two-dimensional geometric coordinates. Time hashing is a similar representation, which maps time to preserve all temporal aspects of the date and time of the data into a one-dimensional set of data points. Geohashing and time hashing are both forms of a Z-order curve, which maps multidimensional data into single dimensions and preserves the locality of the data points. This paper explores the use of a multidimensional Z-order curve, combining both geohashing and time hashing, that is known as geo-temporal hashing or space-time boxes using SAS^®. This technique provides a foundation for reducing the data into bins that can yield new methods for pattern discovery and detection in big data.

The health-care industry in the United States is going through a paradigm shift moving away from its focus on treating diseases and toward promoting health, wellness, and preventive public health programs, so that both the individuals and the government can maintain a healthy bottom line. The high-level business problem is to reduce the expected medical costs and number of medical services required by the people of New Hampshire by implementing successful disease prevention programs. The objective is to identify which among the six prevention programs will successfully improve the health of the residents of New Hampshire over nine future years (2012 2020). The business scenario of the case is to identify the preventive programs that are most effective in reducing the costs in New Hampshire and to invest the money in those programs so that the overall health-care overhead costs can be reduced or controlled. The effectiveness of implementing the preventive programs was evaluated using SAS^® Enterprise Guide^® 5.1 and SAS^® Enterprise Miner^™ 12. Time series analysis, in particular, forecasting, is used to project the future health-care services and costs for the years from 2012 to 2020. Our analysis showed that all the preventive programs should be implemented concurrently. The minimum anticipated savings in cost is approximately $572,111 or 3.3% of the expected baseline cost of $17,297,931. Therefore, our recommendation is to use this cost reduction figure, $572,111, as the initial funding investment toward initiating the six prevention programs concurrently, so that tangible results can be noticed by 2020.

Vistaprint saw the opportunity in the printing market to get more out of high-volume printing by grouping similar orders in large groups. They heavily rely on technology to handle design, printing, and order handling and use the Internet as a medium. With their successful expansion across the world, the issue they were facing was a lot of one-time buyers and a lot of registered users who didn't finish the check-out. The need to implement a retention strategy was the next logical step, for which they chose SAS^® Campaign Management. In this session, Vistaprint explains how they use campaign management for retention and how the project was addressed. They will also touch on how the concept of high performance could open up new possibilities for them.

Both recent banking and insurance risk regulations require effective aggregation of risks. To determine the total enterprise risk for a financial institution, all risks must be aggregated and analyzed. Typically, there are two approaches: bottom-up and top-down risk aggregation. In either approach, financial institutions face challenges due to various levels of risks with differences in metrics, data source, and availability. First, it is especially complex to aggregate risk. A common view of the dependence between all individual risks can be hard to achieve. Second, the underlying data sources can be updated at different times and can have different horizons. This in turn requires an incremental update of the overall risk view. Third, the risk needs to be analyzed across on-demand hierarchies. This paper presents SAS^® solutions to these challenges. To address the first challenge, we consider a mixed approach to specify copula dependence between individual risks and allow step-by-step specification with a minimal amount of information. Next, the solution leverages an event-driven architecture to update results on a continuous basis. Finally, the platform provides a self-service reporting and visualization environment for designing and deploying reports across any hierarchy and granularity on the fly. These capabilities enable institutions to create an accurate, timely, comprehensive, and adaptive risk-aggregation and reporting system.

The implicit loop refers to the DATA step repetitively reading data and creating observations, one at a time. The explicit loop, which uses the iterative DO, DO WHILE, or DO UNTIL statements, is used to repetitively execute certain SAS^® statements within each iteration of the DATA step execution. Explicit loops are often used to simulate data and to perform a certain computation repetitively. However, when an explicit loop is used along with array processing, the applications are extended widely, which includes transposing data, performing computations across variables, and so on. To be able to write a successful program that uses loops and arrays, one needs to know the contents in the program data vector (PDV) during the DATA step execution, which is the fundamental concept of DATA step programming. This workshop covers the basic concepts of the PDV, which is often ignored by novice programmers, and then illustrates how to use loops and arrays to transform lengthy code into more efficient programs.

Data quality depends on review by operational stewards of the content. Volumes of complex data disappear as e-mail attachments. Is there a critical data shift that might be missed? Embedding a summary image drives expert data review from 15% to 87%. Downstream error rate is significantly reduced. Increased accuracy to variable physician compensation measures results.

The availability of specialized programming and analysis resources in academic medical centers is often limited, creating a significant challenge for clinical research. The current work describes how Base SAS^® and SAS^® Enterprise Guide^® are being used to empower research staff so that they are less reliant on these scarce resources.

For data analysts, one of the most important steps after manipulating and analyzing the data set is to create a report for it. Nowadays, many statistics tables and reports are generated as HTML files that can be easily accessed through the Internet. However, the SAS^® Output Delivery System (ODS) HTML output has many limitations on interacting with users. In this paper, we introduce a method to enhance the traditional ODS HTML output by using jQuery (a JavaScript library). A macro was developed to implement this idea. Compared to the standard HTML output, this macro can add sort, pagination, search, and even dynamic drilldown function to the ODS HTML output file.

This paper is based on the belief that debugging your programs is not only necessary, but also a good way to gain insight into how SAS^® works. Once you understand why you got an error, a warning, or a note, you'll be better able to avoid problems in the future. In other words, people who are good debuggers are good programmers. This paper covers common problems including missing semicolons and character-to-numeric conversions, and the tricky problem of a DATA step that runs without suspicious messages but, nonetheless, produces the wrong results. For each problem, the message is deciphered, possible causes are listed, and how to fix the problem is explained.

In evaluation instruments and tests, individual items are often collected using an ordinal measurement or Likert type scale. Typically measures such as Cronbach s alpha are estimated using the standard Pearson correlation. Gadderman and Zumbo (2012) illustrate how using the standard Pearson correlations may yield biased estimates of reliability when the data are ordinal and present methodology for using the polychoric correlation in reliability estimates as an alternative. This session shows how to implement the methods of Gadderman and Zumbo using SAS^® software. An example will be presented that incorporates these methods in the estimation of the reliability of an active learning post-occupancy evaluation instrument developed by Steelcase Education Solutions researchers.

Many organizations need to forecast large numbers of time series that are organized in a hierarchical fashion. Good forecasting practices recommend that several hierarchies be used and that each hierarchy contain a homogeneous set of time series with similar statistical properties. Modeling and forecasting homogeneous time series hierarchies provide better out-of-sample forecast performance. Because an organization might have many time series hierarchies, it is often desirable to model and forecast these hierarchical time series in parallel for computational efficiency. Additionally, it is often desirable to aggregate forecasts from several nonhomogeneous time series hierarchies for report generation. This paper demonstrates these techniques for forecasting time series hierarchies in parallel and for aggregating the forecasts by using SAS^® Forecast Server and SAS^® Grid Manager.

SAS^® can easily perform calculations and export the result to Microsoft Excel in a report. However, sometimes you need Excel to have a formula or a function in a cell and not just a number. Whether it s for a boss who wants to see a SUM formula in the total cell or to have automatically updating reports that can be sent to people who don t use SAS to be completed, exporting formulas to Excel can be very powerful. This paper illustrates how, by using PROC REPORT and PROC PRINT along with the ExcelXP tagset, you can easily export formulas and functions into Excel directly from SAS. The method outlined in this paper requires Base SAS^® 9.1 or higher and Excel 2002 or later and requires a basic understanding of the ExcelXP tagset.

The growing adoption of electronic systems for keeping medical records provides an opportunity for health care practitioners and biomedical researchers to access traditionally unstructured data in a new and exciting way. Pathology reports, progress notes, and many other sections of the patient record that are typically written in a narrative format can now be analyzed by employing natural language processing contextual extraction techniques to identify specific concepts contained within the text. Linking these concepts to a standardized nomenclature (for example, SNOMED CT, ICD-9, ICD-10, and so on) frees analysts to explore and test hypotheses using these observational data. Using SAS^® software, we have developed a solution in order to extract data from the unstructured text found in medical pathology reports, link the extracted terms to biomedical ontologies, join the output with more structured patient data, and view the results in reports and graphical visualizations. At its foundation, this solution employs SAS^® Enterprise Content Categorization to perform entity extraction using both manually and automatically generated concept definition rules. Concept definition rules are automatically created using technology developed by SAS, and the unstructured reports are scored using the DS2/SAS^® Content Categorization API. Results are post-processed and added to tables compatible with SAS^® Visual Analytics, thus enabling users to visualize and explore data as required. We illustrate the interrelated components of this solution with examples of appropriate use cases and describe manual validation of performance and reliability with metrics such as precision and recall. We also provide examples of reports and visualizations created with SAS Visual Analytics.

When you want to know the details about a small subset of a much larger data set, it can take a long time to select the records you need. This paper shows you how to create a user-defined SAS^® format to pull only the observations that you want out of a big data source. Even when selecting a million records out of data sets that can have more than 100 million records, this method is much quicker than either a PROC SQL join or a SAS merge.

For almost two decades, Western Kentucky University's Office of Institutional Research (WKU-IR) has used SAS^® to help shape the future of the institution by providing faculty and administrators with information they can use to make a difference in the lives of their students. This presentation provides specific examples of how WKU-IR has shaped the policies and practices of our institution and discusses how WKU-IR moved from a support unit to a key strategic partner. In addition, the presentation covers the following topics: How the WKU Office of Institutional Research developed over time; Why WKU abandoned reactive reporting for a more accurate, convenient system using SAS^® Enterprise Intelligence Suite for Education; How WKU shifted from investigating what happened to predicting outcomes using SAS^® Enterprise Miner^™ and SAS^® Text Miner; How the office keeps the system relevant and utilized by key decision makers; What the office has accomplished and key plans for the future.

Beginning with SA^®S 9.2, ODS Graphics introduces a whole new way of generating graphs using SAS^®. With just a few lines of code, you can create a wide variety of high-quality graphs. This paper covers the three basic ODS Graphics procedures SGPLOT, SGPANEL, and SGSCATTER. SGPLOT produces single-celled graphs. SGPANEL produces multi-celled graphs that share common axes. SGSCATTER produces multi-celled graphs that might use different axes. This paper shows how to use each of these procedures in order to produce different types of graphs, how to send your graphs to different ODS destinations, how to access individual graphs, and how to specify properties of graphs, such as format, name, height, and width.

This paper illustrates some SAS^® graphs that can be useful for variable selection in predictive modeling. Analysts are often confronted with hundreds of candidate variables available for use in predictive models, and this paper illustrates some simple SAS graphs that are easy to create and that are useful for visually evaluating candidate variables for inclusion or exclusion in predictive models. The graphs illustrated in this paper are bar charts with confidence intervals using the GCHART procedure and comparative histograms using the UNIVARIATE procedure. The graphs can be used for most combinations of categorical or continuous target variables with categorical or continuous input variables. This paper assumes the reader is familiar with the basic process of creating predictive models using multiple (linear or logistic) regression.

Missing data is an ever-present issue, and analysts should exercise proper care when dealing with it. Depending on the data and the analytical approach, this problem can be addressed by simply removing records with missing data. However, in most cases, this is not the best approach. In fact, this can potentially result in inaccurate or biased analyses. The SAS^® programming language offers many DATA step processes and functions for handling missing values. However, some analysts might not like or be comfortable with programming. Fortunately, SAS^® Enterprise Guide^® can provide those analysts with a number of simple built-in tasks for discovering missing data and diagnosing their distribution across fields. In addition, various techniques are available in SAS Enterprise Guide for imputing missing values, varying from simple built-in tasks to more advanced tasks that might require some customized SAS code. The focus of this presentation is to demonstrate how SAS Enterprise Guide features such as Query Builder, Filter and Sort Wizard, Describe Data, Standardize Data, and Create Time Series address missing data issues through the point-and-click interface. As an example of code integration, we demonstrate the use of a code node for more advanced handling of missing data. Specifically, this demonstration highlights the power and programming simplicity of PROC EXPAND (SAS/ETS^® software) in imputing missing values for time series data.

Have you ever needed additional data that was only accessible via a web service in XML or JSON? In some situations, the web service is set up to only accept parameter values that return data for a single observation. To get the data for multiple values, we need to iteratively pass the parameter values to the web service in order to build the necessary dataset. This paper shows how to combine the SAS^® hash object with the FILEVAR= option to iteratively pass a parameter value to a web service and input the resulting JSON or XML formatted data.

The role of the Data Scientist is the viral job description of the decade. And like LOLcats, there are many types of Data Scientists. What is this new role? Who is hiring them? What do they do? What skills are required to do their job? What does this mean for the SAS^® programmer and the statistician? Are they obsolete? And finally, if I am a SAS user, how can I become a Data Scientist? Come learn about this job of the future and what you can do to be part of it.

Recent studies suggest that unstructured data, such as customer comments or feedback, can enhance the power of existing predictive models. SAS^® Text Miner can generate singular value decomposition (SVD) units from text documents, which is a vectorial representation of terms in documents. These SVDs, when used as additional inputs along with the existing structured input variables, often prove to capture the response better. However, SVD units are sort of black box variables and are not easy to interpret or explain. This is a big hindrance to win over the decision makers in the organizations to incorporate these derived textual data components in the models. In this paper, we demonstrate a new and powerful feature in SAS^® Text Miner 12.1 that helps in explaining the SVDs or the text cluster components. We discuss two important methods that are useful to interpreting them. For this purpose, we used data from a television network company that has transcripts of its call center notes from three prior calls of each customer. We are able to extract the key terms from the call center notes in the form of Boolean rules, which have contributed to the prediction of customer churn. These rules provide an intuitive sense of which set of terms, when occurring in either the presence or absence of another set of terms in the call center notes, might lead to a churn. It also provides insights into which customers are at a bigger risk of churning from the company s services and, more importantly, why.

No matter how long you ve been programming in SAS^®, using and manipulating dates still seems to require effort. Learn all about SAS dates, the different ways they can be presented, and how to make them useful. This paper includes excellent examples for dealing with raw input dates, functions to manage dates, and outputting SAS dates into other formats. Included is all the date information you will need: date and time functions, Informats, formats, and arithmetic operations.

This paper illustrates a permutation method for implementing multiple comparisons on Pearson s Chi-square test for an R×C contingency table, using the SAS^® FREQ procedure and a newly developed SAS macro called CHISQ_MC. This method is analogous to the Tukey-type multiple comparison method for one-way analysis of variance.

Administrators at Western Kentucky University rely on the Institutional Research department to perform detailed statistical analyses to deepen the understanding of issues associated with enrollment management, student and faculty performance, and overall program operations. This paper presents several instances of analyses performed for the university to help it identify and recruit suitable candidates, uncover root causes in grade and enrollment trends, evaluate faculty effectiveness, and assess the impact of student characteristics, programs, or student activities on retention and graduation rates. The paper briefly discusses the data infrastructure created and used by Institutional Research. For each analysis performed, it reviews the SAS^® program and key components of the SAS code involved. The studies presented include the use of SAS^® Enterprise Miner^™ to create a retention model incorporating dozens of student background variables. It shows an examination of grade trends in the same courses taught by different faculty and subsequent student behavior and success, providing insights into the nuances and subtleties of evaluating faculty performance. Another analysis uncovers the possible influence of fraternities and sororities in freshmen algebra courses. Two investigations explore the impact of programs on student retention and graduation rates. Each example and its findings illustrate how Institutional Research can support the administration of university operations. The target audience is any SAS professional interested in learning more about Institutional Research in higher education and how SAS software is used by an Institutional Research department to serve its organization.

This presentation addresses two main topics: The first topic focuses on the industry's norms and the best practices for building internal credit ratings (PD, EAD, and LGD). Although there is not any capital relief to local US banks using internal credit ratings (the US hs not adopted the Internal Rating Based approach of Basel2, with the exception of the top 10 banks), there is an increased responsiveness in credit ratings modeling for the last two years in the US banking industry. The main reason is the added value a bank can achieve from these ratings, and that is the focus of the second part of this presentation. It describes our journey (a client story) for getting there, introducing the SAS^® project. Even more importantly, it describes how we use credit ratings in order to achieve effective credit risk management and get real added value out of that investment. The key success factor for achieving it is to effectively implement ratings within the credit process and throughout decision making . Only then can ratings be used to improve risk-adjusted return on capital, which is the high-end objective of all of us.

Understanding the actual gambling behavior of an individual over the Internet, we develop markers which identify behavioral patterns, which in turn can be used to predict the level of risk a subscriber is prone to gambling. The data set contains 4,056 subscribers. Using SAS^® Enterprise Miner^™ 12.1, a set of models are run to predict which subscriber is likely to become a high-risk internet gambler. The data contains 114 variables such as first active date and first active product used on the website as well as the characteristics of the game such as fixed odds, poker, casino, games, etc. Other measures of a subscriber s data such as money put at stake and what odds are being bet are also included. These variables provide a comprehensive view of a subscriber s behavior while gambling over the website. The target variable is modeled as a binary variable, 0 indicating a risky gambler and 1 indicating a controlled gambler. The data is a typical example of real-world data with many missing values and hence had to be transformed, imputed, and then later considered for analysis. The model comparison algorithm of SAS Enterprise Miner 12.1 was used to determine the best model. The stepwise Regression performs the best among a set of 25 models which were run using over a 100 permutations of each model. The Stepwise Regression model predicts a high-risk Internet gambler at an accuracy of 69.63% with variables such as wk4frequency and wk3frequency of bets.

Traditional SAS^® programs typically consist of a series of SAS DATA steps, which refine input data sets until the final data set or report is reached. SAS DATA steps do not run in-database. However, SAS^® Enterprise Guide^® users can replicate this kind of iterative programming and have the resulting process flow run in-database by linking a series of SAS Enterprise Guide Query Builder tasks that output SAS views pointing at data that resides in a Teradata database, right up to the last Query Builder task, which generates the final data set or report. This session both explains and demonstrates this functionality.

The soaring number of publicly available data sets across disciplines have allowed for increased access to real-life data for use in both research and educational settings. These data often leverage cost-effective complex sampling designs including stratification and clustering, which allow for increased efficiency in survey data collection and analyses. Weighting becomes a necessary component in these survey data in order to properly calculate variance estimates and arrive at sound inferences through statistical analysis. Generally speaking, these weights are included with the variables provided in the public use data, though an explanation for how and when to use these weights is often lacking. This paper presents an analysis using the California Health Interview Survey to compare weighted and non-weighted results using SAS^® PROC LOGISTIC and PROC SURVEYLOGISTIC.

A common complaint from users working on identifying fraud and abuse in Medicare is that teams focus on operational applications, static reports, and high-level outliers. But, when faced with the need to constantly evaluate changing Medicare provider and beneficiary or enrollee dynamics, users are clamoring for more dynamic and accurate detection approaches. Providing these organizations with a data discovery and predictive analytics framework that leverages Hadoop and other big data approaches, while providing a clear path for teams to make more fact-based decisions more quickly is very important in pre- and post-fraud and abuse analysis. Organizations that do pursue a framework and a reusable services-based data discovery and analytics framework and architecture approach enjoy greater success in supporting data management, reporting, and analytics demands. They can quickly turn models into prioritized alerts and avoid improper or fraudulent payments. A successful framework should enable organizations to come up with efficient fraud, waste, and abuse models to address complex schemes; identify fraud, waste, and abuse vulnerabilities; and shorten triage efforts using a variety of data sourced from big data platforms like Hadoop and other relational database management systems. This paper talks about the data management, data discovery, predictive analytics, and social network analysis capabilities that are included in the SAS fraud framework and how a unified approach can significantly reduce the lifecycle of building and deploying fraud models. We hope this paper will provide IT leaders with a clear path for resolving issues from the simple to the incredibly complex, through a measured and scalable approach for delivering value for fraud, waste, and abuse models by providing deep insights to support evidence-based investigations.

Email is an important marketing channel for digital marketers. We can stay connected with our subscribers and attract them with relevant content as long as they are still subscribed to our email communication. In this session, we are planning to discuss why it's important to manage opt-out risk; how did we predict opt-out risk; and how do we proactively manage opt-out using the models we developed.

This paper describes a technique for calibrating street address match logic to maximize the match rate without introducing excessive erroneous matching.

One of the most common questions about logistic regression is How do I know if my model fits the data? There are many approaches to answering this question, but they generally fall into two categories: measures of predictive power (like R-squared) and goodness of fit tests (like the Pearson chi-square). This presentation looks first at R-squared measures, arguing that the optional R-squares reported by PROC LOGISTIC might not be optimal. Measures proposed by McFadden and Tjur appear to be more attractive. As for goodness of fit, the popular Hosmer and Lemeshow test is shown to have some serious problems. Several alternatives are considered.

Predicting loss given default (LGD) is playing an increasingly crucial role in quantitative credit risk modeling. In this paper, we propose to apply mixed effects models to predict corporate bonds LGD, as well as other widely used LGD models. The empirical results show that mixed effects models are able to explain the unobservable heterogeneity and to make better predictions compared with linear regression and fractional response regression. All the statistical models are performed in SAS/STAT^®, SAS^® 9.2, using specifically PROC REG and PROC NLMIXED, and the model evaluation metrics are calculated in PROC IML. This paper gives a detailed description on how to use PROC NLMIXED to build and estimate generalized linear models and mixed effects models.

While survey researchers make great attempts to standardize their questionnaires including the usage of ratings scales in order to collect unbiased data, respondents are still prone to introducing their own interpretation and bias to their responses. This bias can potentially affect the understanding of commonly investigated drivers of customer satisfaction and limit the quality of the recommendations made to management. One such problem is scale use heterogeneity, in which respondents do not employ a panoramic view of the entire scale range as provided, but instead focus on parts of the scale in giving their responses. Studies have found that bias arising from this phenomenon was especially prevalent in multinational research, e.g., respondents of some cultures being inclined to use only the neutral points of the scale. Moreover, personal variability in response tendencies further complicates the issue for researchers. This paper describes an implementation that uses a Bayesian hierarchical model to capture the distribution of heterogeneity while incorporating the information present in the data. More specifically, SAS^® PROC MCMC is used to carry out a comprehensive modeling strategy of ratings data that account for individual level scale usage. Key takeaways include an assessment of differences between key driver analyses that ignore this phenomenon versus the one that results from our implementation. Managerial implications are also emphasized in light of the prevalent use of more simplistic approaches.

Over the past decade, sports analytics has seen an explosion in research and model development to calculate wins, reaching cult popularity with the release of the film 'Moneyball.' The purpose of this paper is to explore the methodology of solving a real-life Moneyball problem in basketball. An optimal basketball lineup will be selected in an attempt to maximize the total points per game while maximizing court coverage. We will briefly review some of the literature that has explored this type of problem, traditionally called the maximum coverage problem (MCP) in operations research. An exploratory data analysis will be performed, including visualizations and clustering in order to prep the modeling dataset for optimization. Finally, SAS^® will be used to formulate an MCP problem, and additional constraints will be added to run different business scenarios.

Two examples of Vector Autoregressive Moving Average modeling with exogenous variables are given in this presentation. Data is from the real world. One example is about a two-dimensional time series for wages and prices in Denmark that spans more than a hundred years. The other is about the market for agricultural products, especially eggs! These examples give a general overview of the many possibilities offered by PROC VARMAX, such as handling of seasonality, causality testing and Bayesian modeling, and so on.

This presentation features implementation leads from SAS^® Professional Services and Health Canada's Non-Insured Health benefits (NIHB) program, on a joint implementation of SAS^® Fraud Framework for Health Care. The presentation walks through the fast-paced implementation of NIHB's Pharmacy Surveillance System that guards Canadian taxpayers from undue costs, and protects the safety of NIHB clients. This presentation is a blend of project management and technical material, and presents both the client (NIHB) and consultant (SAS) perspectives throughout the story. The presentation converges onto several core principles needed to successfully deliver analytical solutions.

SAS/OR^® software for operations research includes mathematical optimization, discrete-event simulation, and project and resource scheduling capabilities. This paper surveys a number of its new features that better equip you to address decision-making challenges such as planning, resource management, and asset allocation. Optimization performance improvements help you solve larger, more detailed problems more quickly. Improvements encompass linear, mixed integer linear, and nonlinear optimization, and include multithreading of the mixed integer linear solver and major improvements in the performance and functionality of the decomposition algorithm for linear and mixed integer linear optimization. The OPTMODEL procedure for optimization modeling adds direct access to the same set of efficient network optimization algorithms available via the OPTNET procedure in SAS/OR, enabling you to embed network optimization as a component of larger solution processes. Other new features enable you to execute multiple optimizations in parallel and use the FCMP procedure to define functions. The OPTLSO procedure for global and local search optimization adds the ability to work with multiple objective functions and produce a set of Pareto-optimal solutions. This approach enables you to manage the trade-offs that arise between competing objectives and adds to the range of optimization problems that you can solve using PROC OPTLSO. Another new feature is support for the READ_ARRAY function in PROC FCMP, with which you can much more easily input array-structured data to be used in function definitions. Finally, SAS^® Simulation Studio for discrete-event simulation enhances its graphical interface to better support customization and increase ease of use.

In business environments, a common obstacle to effective data-informed decision making occurs when key stakeholders are reluctant to embrace statistically derived predicted values or forecasts. If concerns regarding model inputs, underlying assumptions, and limitations are not addressed, decision makers might choose to trust their gut and reject the insight offered by a statistical model. This presentation explores methods for converting potential critics into partners by proactively involving them in the modeling process and by incorporating simple inputs derived from expert judgment, focus groups, market research, or other directional qualitative sources. Techniques include biasing historical data, what-if scenario testing, and Monte Carlo simulations.

It is often the case that parameters in a predictive model should be restricted to an interval that is either reasonable or necessary given the model s application. A simple and classic example of such a restriction is the regression model which requires that all parameters to be positive. In the case of multiple least squares (MLS) regression, the resulting model is therefore strictly additive and, in certain applications, not only appropriate but also intuitive. This special case of an MLS model is commonly referred to as a nonnegative least squares regression. While Base SAS^® contains a multitude of ways to perform a multiple least squares regression (PROC REG and PROC GLM, to name two), there exists no native SAS^® procedure to conduct a nonnegative least squares regression. The author offers a concise way to conduct the nonnegative least squares analysis by using PRON NLIN (proc non-linear ). PROC NLIN offers user restriction on parameter estimates. By fashioning a linear model in the framework of a nonlinear procedure, the end result can be achieved. As an additional corollary, the author will show how to calculate the _RSQUARE_ statistic for the resulting model, which has been left out of the PROC NLIN output for the reason that it is invalid in most cases (though not ours).

When viewing and working with SAS^® data sets especially wide ones it s often instinctive to rearrange the variables (columns) into some intuitive order. The RETAIN statement is one of the most commonly cited methods used for ordering variables. Though RETAIN can perform this task, its use as an ordering clause can cause a host of easily missed problems due to its intended function of retaining values across DATA step iterations. This risk is especially great for the more novice SAS programmer. Instead, two equally effective and less risky ways to order data set variables are recommended, namely, the FORMAT and SQL SELECT statements.

Part of being a good analyst and statistician is being able to understand the output of a statistical test in SAS^®. P-values are ubiquitous in statistical output as well as medical literature and can be the deciding factor in whether a paper gets published. This shows a somewhat dictatorial side of them. But do we really know what they mean? In a democratic process, people vote for another person to represent them, their values, and their opinions. In this sense, the sample of research subjects, their characteristics, and their experience, are combined and represented to a certain degree by the p-value. This paper discusses misconceptions about and misinterpretations of the p-value, as well as how things can go awry in calculating a p-value. Alternatives to p-values are described, with advantages and disadvantages of each. Finally, some thoughts about p-value interpretation are given. To disarm the dictator, we need to understand what the democratic p-value can tell us about what it represents&.and what it doesn't. This presentation is aimed at beginning to intermediate SAS statisticians and analysts working with SAS/STAT^®.

Graphs in oncology studies are essential for getting more insight about the clinical data. This presentation demonstrates how ODS Graphics can be effectively and easily used to create graphs used in oncology studies. We discuss some examples and illustrate how to create plots like drug concentration versus time plots, waterfall charts, comparative survival plots, and other graphs using Graph Template Language and ODS Graphics procedures. These can be easily incorporated into a clinical report.

The SQL procedure contains many powerful and elegant language features for intermediate and advanced SQL users. This presentation discusses topics that will help SAS^® users unlock the many powerful features, options, and other gems found in the SQL universe. Topics include CASE logic; a sampling of summary (statistical) functions; dictionary tables; PROC SQL and the SAS macro language interface; joins and join algorithms; PROC SQL statement options _METHOD, MAGIC=101, MAGIC=102, and MAGIC=103; and key performance (optimization) issues.

The use of predictive models in healthcare has steadily increased over the decades. Statistical models now are assumed to be a necessary component in population health management. This session will review practical considerations in the choice of models to develop, criteria for assessing the utility of the models for production, and challenges with incorporating the models into business process flows. Specific examples of models will be provided based upon work by the Health Economics team at Blue Cross Blue Shield of North Carolina.

Over the years, there has been a growing concern about consumption of tobacco among youth. But no concrete studies have been done to find what exactly leads the children to start consuming tobacco. This study is an attempt to figure out the potential reasons for the same. Through our analysis, we have also tried to build A model to predict whether a child would smoke next year or not. This study is based on the 2011 National Youth Tobacco Survey data of 18,867 observations. In order to prepare data for insightful analysis, imputation operations were performed on the data using tree-based imputation methods. From a pool of 197 variables, 48 key variables were selected using variable selection methods, partial least squares, and decision tree models. Logistic Regression and Decision Tree models were built to predict whether a child would smoke in the next year or not. Comparing the models using Misclassification rate as the selection criteria, we found that the Stepwise Logistic Regression Model outperformed other models with a Validation Misclassification of 0.028497, 47.19% Sensitivity and 95.80% Specificity. Factors such as company of friends, cigarette brand ads, accessibility to the tobacco products, and passive smoking turned out to be the most important predictors in determining a child smoker. After this study, we could outline some important findings like the odds of a child taking up smoking are 2.17 times high when his close friends are also smoking.

The steady expansion of electronic health records (EHR) over the past decade has increased the use of observational healthcare data for analysis. One of the challenges with EHR data is to combine information from different domains (diagnosis, procedures, drugs, adverse events, labs, quality of life scores, and so on) onto a single timeline to get a longitudinal view of the patient. This enables the physician or researcher to visualize a patient's health profile, thereby revealing anomalies, trends, and responses graphically,thus empowering them to treat more effectively. This paper attempts to provide a composite view of a patient by using SAS^® Graph Template Language to create a profile graph using the following data elements: key event dates, drugs, adverse events, Quality of Life (QoL) scores. For visualization, the GTL graph uses X and X2 axes for dates, vertical reference lines to represent key dates (for example, when the disease is first diagnosed), horizontal bar plot for duration of drugs taken and adverse events reported, and a series plot at the bottom to show the QoL score.

Sparse data sets are common in applications of text and data mining, social network analysis, and recommendation systems. In SAS^® software, sparse data sets are usually stored in the coordinate list (COO) transactional format. Two major drawbacks are associated with this sparse data representation: First, most SAS procedures are designed to handle dense data and cannot consume data that are stored transactionally. In that case, the options for analysis are significantly limited. Second, a sparse data set in transactional format is hard to store and process in distributed systems. Most techniques require that all transactions for a particular object be kept together; this assumption is violated when the transactions of that object are distributed to different nodes of the grid. This paper presents some different ideas about how to package all transactions of an object into a single row. Approaches include storing the sparse matrix densely, doing variable selection, doing variable extraction, and compressing the transactions into a few text variables by using Base64 encoding. These simple but effective techniques enable you to store and process your sparse data in better ways. This paper demonstrates how to use SAS^® Text Miner procedures to process sparse data sets and generate output data sets that are easy to store and can be readily processed by traditional SAS modeling procedures. The output of the system can be safely stored and distributed in any grid environment.

Many SAS^® procedures use classification variables when they are processing the data. These variables control how the procedure forms groupings, summarizations, and analysis elements. For statistics procedures, they are often used in the formation of the statistical model that is being analyzed. Classification variables can be explicitly specified with a CLASS statement, or they can be specified implicitly from their usage in the procedure. Because classification variables have such a heavy influence on the outcome of so many procedures, it is essential that the analyst have a good understanding of how classification variables are applied. Certainly there are a number of options (system and procedural) that affect how classification variables behave. While you may be aware of some of these options, a great many are new, and some of these new options and techniques are especially powerful. You really need to be open to learning how to program with CLASS.

Healthcare expenditure growth continues to be a prominent healthcare policy issue, and the uncertain impact of the Affordable Care Act (ACA) has put increased pressure on payers to find ways to exercise control over costs. Fueled by provider performance analytics, BCBSNC has developed innovative strategies that recognize, reward, and assist providers delivering high-quality and efficient care. A leading strategy has been the introduction of a new tiered network product called Blue Select, which was launched in 2013 and will be featured in the State Health Exchange. Blue Select is a PPO with differential member cost-sharing for tiered providers. Tier status of providers is determined by comparing providers to their peers on the efficiency and quality of the care they delivery. Providers who meet or exceed the standard for quality, measured using Healthcare Effectiveness Data and Information Set (HEDIS) adherence rates and potentially avoidable complication rates for certain procedures, are then evaluated on their case-mix adjusted costs for total episodic costs. Each practice s performance is compared, through indirect standardization, to expected performance, given the patients and conditions treated within a practice. A ratio of observed to expected performance is calculated for both cost and quality to use in determining the tier status of providers for Blue Select. While the primary goal of provider tiering is cost containment through member steerage, the initiative has also resulted in new and strengthened collaborative relationships between BCBSNC and providers. The strategy offers the opportunity to bend the cost curve and provide meaningful change in the quality of healthcare delivery.

In healthcare, we often express our analytics results as being adjusted . For example, you might have read a study in which the authors reported the data as age-adjusted or risk-adjusted. The concept of adjustment is widely used in program evaluation, comparing quality indicators across providers and systems, forecasting incidence rates, and in cost-effectiveness research. In order to make reasonable comparisons across time, place, or population, we need to account for small sample sizes and case-mix variation in other words, we need to level the playing field and account for differences in health status and for uniqueness in a given population. If you are new to healthcare. What it really means to adjust the data in order to make comparisons might not be obvious. In this paper, we explore the methods by which we control for potentially confounding variables in our data. We do so through a series of examples from the healthcare literature in both primary care and health insurance. In this survey of methods, we discuss the concepts of rates and how they can be adjusted for demographic strata (such as age, gender, and race), as well as health risk factors such as case mix.

This paper reveals the human mobility behavior in the metropolitan area of Rio de Janeiro, Brazil. The base for this study is the mobile phone data provided by one of the largest mobile carriers in Brazil. Mobile phone data comprises a reasonable variety of information, including data about time and location for call activity throughout urban areas. This information might be used to build users trajectories over time, describing the major characteristics of the urban mobility within the city. A variety of distribution analyses is presented in this paper aiming clearly describes the most relevant characteristics of the overall mobility in the metropolitan area of Rio de Janeiro. In addition to that, methods from physics to describe trends in trips such as gravity and radiation models were computed and compared in terms of granularity of the geographic scales and also in relation to traditional data mining approach such as linear regressions. A brief comparison in terms of performance in predicting the amount of trips between pairs of locations is presented at the end.

The project focuses on using analytics to reveal unwarranted use of access to medical records, i.e. employees in health organizations that access information about neighbours, friends, celebrities, etc., without a sound reason to do so. The method is based on the natural assumption that the vast majority of lookups are legitimate lookups that differ from a statistically defined normal behavior will be subject to manual investigation. The work was carried out in collaboration between SAS Institute Norway and the largest Norwegian hospital, Oslo University Hospital (OUS) and was aimed at establishing whether the method is suitable for unveiling unwarranted lookups in medical records. A number of so called scenarios are used to indicate adverse behaviour, each responsible for looking at one particular aspect of journal access data. For instance, one scenario determines the timeliness of a lookup relative to the patient's admission history; another judges whether the medical competency of the employee is relevant to the situation of the patient at the time of the lookup. We have so far designed and developed a library of around 20 scenarios that together are used in weighted combination to render a final judgment of the appropriateness of the lookup. The approach has been proven highly successful, and a further development of these ideas is currently being done, the aim of which is to establish a joint Norwegian solution to the problem of unwarranted access. Furthermore, we believe that the approach and the framework may be utilised in many other industries where sensitive data is being processed, such as financial, police, tax and social services. In this paper, the method is outlined, as well as results of its application on data from OUS.

The high school dropout problem has been called a national crisis (Heppen & Therriault, 2008). Almost one-third of all high school students leave the public school system before graduating (Swanson, 2004), and the problem is particularly severe among minority students (Greene & Winters, 2005; U.S. Department of Education, 2006). Educators, researchers, and policymakers continue to work to identify effective dropout prevention strategies. One effective approach is to identify high-risk students at an early stage, and then provide corresponding interventions to keep them in school. One of the strengths of Educational Data Mining is to reveal hidden patterns and predict future performance by analyzing accessible student data. These predictive algorithms generated by predictive modeling can serve as an early warning system. However, because individual schools and districts have various combinations of race, gender, and socioeconomic status, we cannot use a set of standardized predictors and obtain satisfactory predictive results. Analyzing a limited number of variables and limited historical data does not generate accurate models. Additionally, the predictive model might not consider interactions among predictors. The strength of data mining is the capability to analyze a large amount of data and variables. Multiple analytic strategies (including model comparisons) can be applied to maximize model performance. For future goals, we propose a debuted data mining framework to construct an early warning and trend analysis system with components of data warehousing, data mining, and reporting at the levels of individual students, schools, school districts, and the entire state.

Hospital readmission rates have become a key indicator for measuring the quality of health care. Currently, use of these rates has been adopted by major healthcare stakeholders, including the Centers for Medicare & Medicaid Services (CMS), the Agency for Healthcare Research and Quality (AHRQ), and the National Committee for Quality Assurance (NCQA). In the calculation of the readmission rate, it is often a challenging task to identify eligible hospital readmissions from the convoluted administrative claims data. By taking advantage of the flexibility and power of SAS^® programming tools, this paper proposes three different solutions using both DATA step and PROC SQL to help identify 30-day hospital readmissions more efficiently and accurately. Solution 1 (DATA STEP vertically) employs the LAG function to calculate the gap between the current admission date and the immediate previous discharge date. This vertical thinking process is straightforward and does not require additional data management. Solution 2 (DATA STEP horizontally) uses PROC TRANSPOSE procedures, ARRAYs, and DO loops to transform claims data from long to wide, and examines each patient s hospitalization experiences in just one line. A similar horizontal thinking process has been discussed in previous SAS papers for calculating medication utilization. Solution 3 (PROC SQL) takes advantage of a special table joining (self-join) by creating a Cartesian product further subsetted by a joining condition and WHERE statements. All three solutions have achieved the same results by correctly identifying 30-day hospital readmissions, and they can be handily applied to tackle similar programming challenges in research projects.

This workshop provides hands-on experience using SAS^® Text Miner Workshop participants will do the following: read a collection of text documents and convert them for use by SAS Text Miner using the Text Import node use the simple query language supported by the Text Filter node to extract information from a collection of documents use the Text Topic node to identify the dominant themes and concepts in a collection of documents use the Text Rule Builder node to classify documents that have pre-assigned categories

Statistical mediation analysis is common in business, social sciences, epidemiology, and related fields because it explains how and why two variables are related. For example, mediation analysis is used to investigate how product presentation affects liking the product, which then affects the purchase of the product. Mediation analysis evaluates the mechanism by which a health intervention changes norms that then change health behavior. Research on mediation analysis methods is an active area of research. Some recent research in statistical mediation analysis focuses on extracting accurate information from small samples by using Bayesian methods. The Bayesian framework offers an intuitive solution to mediation analysis with small samples; namely, incorporating prior information into the analysis when there is existing knowledge about the expected magnitude of mediation effects. Using diffuse prior distributions with no prior knowledge allows researchers to reason in terms of probability rather than in terms of (or in addition to) statistical power. Using SAS^® PROC MCMC, researchers can choose one of two simple and effective methods to incorporate their prior knowledge into the statistical analysis, and can obtain the posterior probabilities for quantities of interest such as the mediated effect. This project presents four examples of using PROC MCMC to analyze a single mediator model with real data using: (1) diffuse prior information for each regression coefficient in the model, (2) informative prior distributions for each regression coefficient, (3) diffuse prior distribution for the covariance matrix of variables in the model, and (4) informative prior distribution for the covariance matrix.

SAS^® has a number of procedures for smoothing scatter plots. In this tutorial, we review the nonparametric technique called LOESS, which estimates local regression surfaces. We review the LOESS procedure and then compare it to a parametric regression methodology that employs restricted cubic splines to fit nonlinear patterns in the data. Not only do these two methods fit scatterplot data, but they can also be used to fit multivariate relationships.

Linear regression has been a widely used approach in social and medical sciences to model the association between a continuous outcome and the explanatory variables. Assessing the model assumptions, such as linearity, normality, and equal variance, is a critical step for choosing the best regression model. If any of the assumptions are violated, one can apply different strategies to improve the regression model, such as performing transformation of the variables or using a spline model. SAS^® has been commonly used to assess and validate the postulated model and SAS^® 9.3 provides many new features that increase the efficiency and flexibility in developing and analyzing the regression model, such as ODS Statistical Graphics. This paper aims to demonstrate necessary steps to find the best linear regression model in SAS 9.3 in different scenarios where variable transformation and the implementation of a spline model are both applicable. A simulated data set is used to demonstrate the model developing steps. Moreover, the critical parameters to consider when evaluating the model performance are also discussed to achieve accuracy and efficiency.

The Purchasing Department is considering contracting with your team for a new SAS^® Enterprise BI application. He's already met with SAS^® and seen the sales pitch, and he is very interested. But the manager is a tightwad and not sure about spending the money. Also, he wants his team to be the primary developers for this new application. Before investing his money on training, programming, and support, he would like a proof-of-concept. This paper will walk you through the seven steps to create a SAS Enterprise BI POC project: Develop a kick-off meeting including a full demo of the SAS Enterprise BI tools. Set up your UNIX file systems and security. Set up your SAS metadata ACTs, users, groups, folders, and libraries. Make sure the necessary SAS client tools are installed on the developers machines. Hold a SAS Enterprise BI workshop to introduce them to the basics, including SAS^® Enterprise Guide^®, SAS^® Stored Processes, SAS^® Information Maps, SAS^® Web Report Studio, SAS^® Information Delivery Portal, and SAS^® Add-In for Microsoft Office, along with supporting documentation. Work with them to develop a simple project, one that highlights the benefits of SAS Enterprise BI and shows several methods for achieving the desired results. Last but not least, follow up! Remember, your goal is not to launch a full-blown application. Instead, we ll strive toward helping them see the potential in your organization for applying this methodology.

Big data is all the rage these days, with the proliferation of data-accumulating electronic gadgets and instrumentation. At the heart of big data analytics is the MapReduce programming model. As a framework for distributed computing, MapReduce uses a divide-and-conquer approach to allow large-scale parallel processing of massive data. As the name suggests, the model consists of a Map function, which first splits data into key-value pairs, and a Reduce function, which then carries out the final processing of the mapper outputs. It is not hard to see how these functions can be simulated with the SAS^® hash objects technique, and in reality, implemented in the new SAS^® DS2 language. This paper demonstrates how hash object programming can handle data in a MapReduce fashion and shows some potential applications in physics, chemistry, biology, and finance.

For electricity retailer and distribution companies, the introduction of smart-meter technologies has been a key investment, reducing the significant costs associated with meter reading. Electricity companies continue to look for ways to generate a dividend from them in other ways. This presentation looks at selected practical applications of smart-meter data: forecasting using smart-meter data as inputs customer segmentation revenue protection This presentation aims to show some techniques that can be used to effectively manage and analyze the large amounts of data generated by these devices in order to generate business value.

Businesses today are inundated with unstructured data not just social media but books, blogs, articles, journals, manuscripts, and even detailed legal documents. Manually managing unstructured data can be time consuming and frustrating, and might not yield accurate results. Having an analyst read documents often introduces bias because analysts have their own experiences, and those experiences help shape how the text is interpreted. The fact that people become fatigued can also impact the way that the text is interpreted. Is the analyst as motivated at the end of the day as they are at the beginning? Data science involves using data management, analytical, and visualization strategies to uncover the story that the data is trying to tell in a more automated fashion. This is important with structured data but becomes even more vital with unstructured data. Introducing automated processes for managing unstructured data can significantly increase the value and meaning gleaned from the data. This paper outlines the data science processes necessary to ingest, transform, analyze, and visualize three Star Wars movie scripts: A New Hope, The Empire Strikes Back, and Return of the Jedi. It focuses on the need to create structure from unstructured data using SAS^® Data Management, SAS^® Text Miner, and SAS^® Content Categorization. The results are featured using SAS^® Visual Analytics.

Systematic reviews have become increasingly important in healthcare, particularly when there is a need to compare new treatment options and to justify clinical effectiveness versus cost. This paper describes a method in SAS/STAT^® 9.2 for computing weighted averages and weighted standard deviations of clinical variables across treatment options while correctly using these summary measures to make accurate statistical inference. The analyses of data from systematic reviews typically involve computations of weighted averages and comparisons across treatment groups. However, the application of the TTEST procedure does not currently take into account weighted standard deviations when computing p-values. The use of a default non-weighted standard deviation can lead to incorrect statistical inference. This paper introduces a method for computing correct p-values using weighted averages and weighted standard deviations. Given a data set containing variables for three treatment options, we want to make pairwise comparisons of three independent treatments. This is done by creating two temporary data sets using PROC MEANS, which yields the weighted means and weighted standard deviations. Subsequently, we then perform a t-test on each temporary data set.The resultant data sets containing all comparisons of each treatment options are merged and then transposed to obtain the necessary statistics. The resulting output provides pairwise comparisons of each treatment option and uses the weighted standard deviations to yield the correct p-values in a desired format. This method allows the use of correct weighted standard deviations using PROC MEANS and PROC TTEST in summarizing data from a systematic review while providing correct p-values.

Contrasting two sets of textual data points out important differences. For example, consider social media data that have been collected on the race between incumbent Kay Hagan and challenger Thom Tillis in the 2014 election for the seat of US Senator from North Carolina. People talk about the candidates in different terms for different topics, and you can extract the words and phrases that are used more in messages about one candidate than about the other. By using SAS^® Sentiment Analysis on the extracted information, you can discern not only the most important topics and sentiments for each candidate, but also the most prominent and distinguishing terms that are used in the discussion. Find out if Republicans and Democrats speak different languages!

With smartphone and mobile apps market developing so rapidly, the expectations about effectiveness of mobile applications is high. Marketers and app developers need to analyze huge data available much before the app release, not only to better market the app, but also to avoid costly mistakes. The purpose of this poster is to build models to predict the success rate of an app to be released in a particular category. Data has been collected for 540 android apps under the Top free newly released apps category from https://play.google.com/store . The SAS^® Enterprise Miner^™ Text Mining node and SAS^® Sentiment Analysis Studio are used to parse and tokenize the collected customer reviews and also to calculate the average customer sentiment score for each app. Linear regression, neural, and auto-neural network models have been built to predict the rank of an app by considering average rating, number of installations, total number of reviews, number of 1-5 star ratings, app size, category, content rating, and average customer sentiment score as independent variables. A linear regression model with least Average Squared Error is selected as the best model, and number of installations, app maturity content are considered as significant model variables. App category, user reviews, and average customer sentiment score are also considered as important variables in deciding the success of an app. The poster summarizes the app success trends across various factors and also introduces a new SAS^® macro %getappdata, which we have developed for web crawling and text parsing.

Nowadays, in the Big Data era, Business Intelligence Departments collect, store, process, calculate, and monitor massive amounts of data. Nevertheless, sometimes hundreds of metrics built on the structured data are inefficient to explain why the offered deal sold better or worse than expected. The answer might be found in text data that every company owns and yet is not aware of its possible usage or neglects its value. This project shows text mining methods, implemented in SAS^® Text Miner 12.1, that enable the determination of a deal's success or failure factors based on in-house or Internet-scattered customers' views and opinions. The study is conducted on data gathered from Groupon Sp. z o.o. (Polish business unit) - e-commerce company, as it is assumed that the market is by and large a customer-driven environment.

'Can I have that in Excel?' This is a request that makes many of us shudder. Now your boss has discovered Microsoft Excel pivot tables. Unfortunately, he has not discovered how to make them. So you get to extract the data, massage the data, put the data into Excel, and then spend hours rebuilding pivot tables every time the corporate data is refreshed. In this workshop, you learn to be the armchair quarterback and build pivot tables without leaving the comfort of your SAS^® environment. In this workshop, you learn the basics of Excel pivot tables and, through a series of exercises, you learn how to augment basic pivot tables first in Excel, and then using SAS. No prior knowledge of Excel pivot tables is required.

The FORMAT procedure in SAS^® is a very powerful and productive tool, yet many beginning programmers rarely make use of it. The FORMAT procedure provides a convenient way to do a table lookup in SAS. User-generated FORMATS can be used to assign descriptive labels to data values, create new variables, and find unexpected values. PROC FORMAT can also be used to generate data extracts and to merge data sets. This paper provides an introductory look at PROC FORMAT for the beginning user and provides sample code that illustrates the power of PROC FORMAT in a number of applications. Additional examples and applications of PROC FORMAT can be found in the SAS^® Press book titled 'The Power of PROC FORMAT.'

Raking (iterative proportional fitting) is a procedure that takes sampling weights from complex sample surveys and adjusts them so that they add to known control totals. This process reduces variance and adjusts for undercoverage. But raking in multiple dimensions can lead to extreme weights, which increase variance. Trimming is another sample weighting procedure that reduces extreme weights to cutoffs, thereby improving variance properties while potentially introducing bias. The RAKE-TRIM macro combines raking and trimming in an iterative algorithm to achieve these two goals simultaneously. The raking reduces the bias potential from trimming, and the trimming reduces the variance inflation from raking. When convergence occurs, the final weights aggregate to the control totals, as well as respect the trimming limits. SAS^® macros are well suited for this kind of envelope program: the larger macro consists of the integration of component macros that were developed for other applications. A parameter specification sheet enables users to provide all of the parameters needed to define the algorithm for their particular situation, and, if necessary, to alter the parameters to facilitate convergence. Diagnostics are included when convergence fails. Microsoft Excel tables are imported to provide the cell structure and are exported to provide statistics for the algorithm s results. This RAKE-TRIM macro was first developed in 2010 for the 2009 National Household Transportation Survey and has been used in other studies as well. The paper describes the algorithm and discusses our experiences with it.

Direct marketing is the practice of delivering promotional messages directly to potential customers on an individual basis rather than by using mass medium. In this project, we build a finely tuned response model that helps a financial services company to select high-quality receptive customers for their future campaigns and to identify the important factors that influence marketing to effectively manage their resources. This study was based on the customer solicitation center s marketing campaign data (45,211 observations and 18 variables) available on UC Irvine's web site with attributes of present and past campaign information (communication type, contact duration, previous campaign outcome, and so on) and customer s personal and banking information. As part of data preparation, we had performed mean imputation to handle missing values and categorical recoding for reducing levels of class variables. In this study, we had built several predictive models using the SAS^® Enterprise Miner^™ models Decision Tree, Neural Network, Logistic Regression, and SVM to predict whether the customer responds to the loan offer by subscribing. The results showed that the Stepwise Logistic Regression model was the best when chosen based on the misclassification rate criteria. When the top 3 decile customers were selected based on the best model, the cumulative response rate was 14.5% in contrast to the baseline response rate of 5%. Further analysis showed that the customers are more likely to subscribe to the loan offer if they have the following characteristics: never been contacted in the past, no default history, and provided cell phone as primary contact information.

Applying models to analyze sports data has always been done by teams across the globe. The film Moneyball has generated much hype about how a sports team can use data and statistics to build a winning team. The objective of this poster is to use the model comparison algorithm of SAS^® Enterprise Miner^™ to pick the best model that can predict the outcome of a soccer game. It is hence important to determine which factors influence the results of a game. The data set used contains input variables about a team s offensive and defensive abilities and the outcome of a game is modeled as a target variable. Using SAS Enterprise Miner, multinomial regression, neural networks, decision trees, ensemble models and gradient boosting models are built. Over 100 different versions of these models are run. The data contains statistics from the 2012-13 English premier league season. The competition has 20 teams playing each other in a home and away format. The season has a total of 380 games; the first 283 games are used to predict the outcome of the last 97 games. The target variable is treated as both nominal variable and ordinal variable with 3 levels for home win, away win, and tie. The gradient boosting model is the winning model which seems to predict games with 65% accuracy and identifies factors such as goals scored and ball possession as more important compared to fouls committed or red cards received.

Identifying claim fraud using predictive analytics represents a unique challenge. 1. Predictive analytics generally requires that you have a target variable which can be analyzed. Fraud is unique in this regard in that there is a lot of fraud that has occurred historically that has not been identified. Therefore, the definition of the target variable is difficult. 2.There is also a natural assumption that the past will bear some resemblance to the future. In the case of fraud, methods of defrauding insurance companies change quickly and can make the analysis of a historical database less valuable for identifying future fraud. 3. In an underlying database of claims that may have been determined to be fraudulent by an insurance company, there is many times an inconsistency between different claim adjusters regarding which claims are referred for investigation. This inconsistency can lead to erroneous model results due to data that is not homogenous. This paper will demonstrate how analytics can be used in several ways to help identify fraud: 1. More consistent referral of suspicious claims 2. Better identification of new types of suspicious claims 3. Incorporating claim adjuster insight into the analytics results. As part of this paper, we will demonstrate the application of several approaches to fraud identification: 1. Clustering 2. Association analysis 3. PRIDIT (Principal Component Analysis of RIDIT scores).

This new SAS^® tool is a two-dimensional color chart for visualizing changes in a population or in a system over time. Data for one point in time appear as a thin horizontal band of color. Bands for successive periods are stacked up to make a two-dimensional plot, with the vertical direction showing changes over time. As a system evolves over time, different kinds of events have different characteristic patterns. Creation of Time Contour plots is explained step-by-step. Examples are given in astrostatistics, biostatistics, econometrics, and demographics.

Changes in health insurance and other industries often have a spatial component. Maps can be used to convey this type of information to the user more quickly than tabular reports and other non-graphical formats. SAS^® provides programmers and analysts with the tools to not only create professional and colorful maps, but also the ability to display spatial data on these maps in a meaningful manner that aids in the understanding of the changes that have transpired. This paper illustrates the creation of a number of different maps for displaying change over time with examples from the health insurance arena.

As a longtime Base SAS^® programmer, whether to use a different application for programming is a constant question when powerful applications such as SAS^® Enterprise Guide^® are available. This paper provides some important tips for a programmer, such as the best way to use the code window and how to take advantage of system-generated code in SAS Enterprise Guide 5.1. This paper also explains the differences between some of the functions and procedures in Base SAS and SAS Enterprise Guide. It highlights features in SAS Enterprise Guide such as process flow, data access management, and report automation, including formatting using XML tag sets.

One of the most striking features separating SAS^® from other statistical languages is that SAS has native SQL (Structured Query Language) capacity. In addition to the merging or the querying that a SAS user commonly applies in daily practice, SQL significantly enhances the power of SAS in descriptive statistics and data management. In this paper, we show reproducible examples to introduce 10 useful tips for the SQL procedure in the BASE module.

The independent means t-test is commonly used for testing the equality of two population means. However, this test is very sensitive to violations of the population normality and homogeneity of variance assumptions. In such situations, Yuen s (1974) trimmed t-test is recommended as a robust alternative. The purpose of this paper is to provide a SAS^® macro that allows easy computation of Yuen s symmetric trimmed t-test. The macro output includes a table with trimmed means for each of two groups, Winsorized variance estimates, degrees of freedom, and obtained value of t (with two-tailed p-value). In addition, the results of a simulation study are presented and provide empirical comparisons of the Type I error rates and statistical power of the independent samples t-test, Satterthwaite s approximate t-test, and the trimmed t-test when the assumptions of normality and homogeneity of variance are violated.

As a retailer, your bottom line is determined by supply and demand. Are you supplying what your customer is demanding? Or do they have to go look somewhere else? Accurate allocation and size optimization mean your customer will find what they want more often. And that means more sales, higher profits, and fewer losses for your organization. In this session, Linda Canada will share how DSW went from static allocation models without size capability to precision allocation using intelligent, dynamic models that incorporate item plans and size optimization.

Understanding previous research in key domain areas can help R&D organizations focus new research in non-duplicative areas and ensure that future endeavors do not repeat the mistakes of the past. However, manual analysis of previous research efforts can prove insufficient to meet these ends. This paper highlights how a combination of SAS^® Text Analytics and SAS^® Visual Analytics can deliver the capability to understand key topics and patterns in previous research and how it applies to a current research endeavor. We will explore these capabilities in two use cases. The first will be in uncovering trends in publicly visible government funded research (SBIR) and how these trends apply to future research in nanotechnology. The second will be visualizing past research trends in publicly available NASA publications, and how these might impact the development of next-generation spacecraft.

The DOW-loop is not official terminology that one can find in SAS^® documentation, but it has been well known and widely used among experienced SAS programmers. The DOW-loop was developed over a decade ago by a few SAS gurus, including Don Henderson, Paul Dorfman, and Ian Whitlock. A common construction of the DOW-loop consists of a DO-UNTIL loop with a SET and a BY statement within the loop. This construction isolates actions that are performed before and after the loop from the action within the loop, which results in eliminating the need for retaining or resetting the newly created variables to missing in the DATA step. In this talk, in addition to explaining the DOW-loop construction, we review how to apply the DOW-loop to various applications.

Epidemic modeling is an increasingly important tool in the study of infectious diseases. As technology advances and more and more parameters and data are incorporated into models, it is easy for programs to get bogged down and become unacceptably slow. The use of arrays for importing real data and collecting generated model results in SAS^® can help to streamline the process so results can be obtained and analyzed more efficiently. This paper describes a stochastic mathematical model for transmission of influenza among residents and healthcare workers in long-term care facilities (LTCFs) in New Mexico. The purpose of the model was to determine to what extent herd immunity among LTCF residents could be induced by varying the vaccine coverage among LTCF healthcare workers. Using arrays in SAS made it possible to efficiently incorporate real surveillance data into the model while also simplifying analyses of the results, which ultimately held important implications for LTCF policy and practice.

Regression is a helpful statistical tool for showing relationships between two or more variables. However, many users can find the barrage of numbers at best unhelpful, and at worst undecipherable. Using the shipments and inventories historical data from the U.S. Census Bureau's office of Manufacturers' Shipments, Inventories, and Orders (M3), we can create a graphical representation of two time series with PROC GPLOT and map out reported and expected results. By combining this output with results from PROC REG, we are able to highlight problem areas that might need a second look. The resulting graph shows which dates have abnormal relationships between our two variables and presents the data in an easy-to-use format that even users unfamiliar with SAS^® can interpret. This graph is ideal for analysts finding problematic areas such as outliers and trend-breakers or for managers to quickly discern complications and the effect they have on overall results.

Existing health literacy assessment tools developed for research purposes have constraints that limit their utility for clinical practice. The measurement of health literacy in clinical practice can be impractical due to the time requirements of existing assessment tools. Single Item Literacy Screener (SILS) items, which are self-administered brief screening questions, have been developed to address this constraint. We developed a model to predict limited health literacy that consists of two SILS and demographic information (for example, age, race, and education status) using a sample of patients in a St. Louis emergency department. In this paper, we validate this prediction model in a separate sample of patients visiting a primary care clinic in St. Louis. Using the prediction model developed in the previous study, we use SAS/STAT^® software to validate this model based on three goodness of fit criteria: rescaled R-squared, AIC, and BIC. We compare models using two different measures of health literacy, Newest Vital Sign (NVS) and Rapid Assessment of Health Literacy in Medicine Revised (REALM-R). We evaluate the prediction model by examining the concordance, area under the ROC curve, sensitivity, specificity, kappa, and gamma statistics. Preliminary results show 69% concordance when comparing the model results to the REALM-R and 66% concordance when comparing to the NVS. Our conclusion is that validating a prediction model for inadequate health literacy would provide a feasible way to assess health literacy in fast-paced clinical settings. This would allow us to reach patients with limited health literacy with educational interventions and better meet their information needs.

There are yearly 2.35 million road accident cases recorded in the U.S. Among them, 37,000 were considered fatal. Road crashes cost USD 230.6 billion per year, or an average of USD 820 per person. Our efforts are to identify the important factors that lead to vehicle collisions and to predict the injury risk involved in them. Data was collected from National Automotive Sampling System (NASS), containing 20,247 cases with 19 variables. Input variables describe the factors involved in an accident like Height, Age, Weight, Gender, Vehicle model year, Speed limit, Energy absorption in Collision & Deformation location, etc. The target variable is nominal showing levels of injury. Missing values in interval variables were imputed using mean and class variables using the count method. Multivariate analysis suggests high correlation between tire footprint and wheelbase (Corr=0.97, P<0.0001) and original weight of car and curb weight of car (Corr=0.79, P<0.0001). Variables having high kurtosis values were transformed using range standardization. Variables were sorted using variable importance using decision tree analysis. Models like multiple regression, polynomial regression, neural network, and decision tree were applied in the dataset to identify the factors that are most significant in predicting the injury risk. Multilinear perception neural network came out to be the best model to predict injury risk index, with the least Average Squared Error 0.086 in validation dataset.

This paper discusses the techniques I used at the Census Bureau to overcome the issue of dealing with large amounts of data while modernizing some of their public-facing web applications by using service oriented architecture (SOA) to deploy Flex web applications powered by SAS^®. The paper covers techniques that resulted in reducing 142,293 XML lines (3.6 MB) down to 15,813 XML lines (1.8 MB), a 50% size reduction on the server side (HTTP Response), and 196,167 observations down to 283 observations, a reduction of 99.8% in summarized data on the client side (XML Lookup file).

The Affordable Care Act that is being implemented now is expected to fundamentally reshape the health care industry. All current participants--providers, subscribers, and payers--will operate differently under a new set of key performance indicators (KPIs). This paper uses public data and SAS^® software to establish a baseline for the health care industry today so that structural changes can be measured in the future to establish the impact of the new laws.

Health plans use wide-ranging interventions based on criteria set by nationally recognized organizations (for example, NCQA and CMS) to change health-related behavior in large populations. Evaluation of these interventions has become more important with the increased need to report patient-centered quality of care outcomes. Findings from evaluations can detect successful intervention elements and identify at-risk patients for further targeted interventions. This paper describes how SAS^® was applied to evaluate the effectiveness of a patient-directed intervention designed to increase medication adherence and a health plan s CMS Part D Star Ratings. Topics covered include querying data warehouse tables, merging pharmacy and eligibility claims, manipulating data to create outcome variables, and running statistical tests to measure pre-post intervention differences.

Comprehensive cancer centers have been mandated to engage communities in their work; thus, measurement of community engagement is a priority area. Siteman Cancer Center s Program for the Elimination of Cancer Disparities (PECaD) projects seek to align with 11 Engagement Principles (EP) previously developed in the literature. Participants in a PECaD pilot project were administered a survey with questions on community engagement in order to evaluate how well the project aligns with the EPs. Internal consistency is examined using PROC CORR with the ALPHA option to calculate Cronbach s alpha for questions that relate to the same EP. This allows items that have a lack of internal consistency to be identified and to be edited or removed from the assessment. EP-specific scores are developed on quantity and quality scales. Lack of internal consistency was found for six of the 16 EP s examined items (alpha<.70). After editing the items, all EP question groups had strong internal consistency (alpha>.85). There was a significant positive correlation between quantity and quality scores (r=.918, P<.001). Average EP-specific scores ranged from 6.87 to 8.06; this suggests researchers adhered to the 11 EPs between sometime and most of the time on the quantity scale and between good and very good on the quality scale. Examining internal consistency is necessary to develop measures that accurately determine how well PECaD projects align with EPs. Using SAS^® to determine internal consistency is an integral step in the development of community engagement scores.

The Patient-Centered Outcomes Research Institute (PCORI) was created as part of the Affordable Care Act. PCORI is authorized by Congress to conduct research to provide information about the best available evidence to help patients and their health care providers make more informed decisions. Community Care Behavioral Health Organization in Pittsburgh, Pennsylvania was awarded a PCORI research grant to investigate health care system improvements for adults with serious mental illness. The grant, titled Optimizing Behavioral Health Homes by Focusing on Outcomes that Matter Most for Adults with Serious Mental Illness, began in January of 2013 and is ongoing. Information Technology staff at Community Care have leveraged SAS^® solutions in providing real-time data extraction and reports to support the development and implementation of this research project. SAS tools have been used to merge data from multiple platforms and database sources, including web data sources. SAS has also enabled the formatting and traffic lighting of multiple Microsoft Excel data sets and files, in addition to the creation of many operational reports and data files needed for study implementation, administration, and maintenance. The challenges faced and the SAS solutions employed are the subject of this paper.

When providing lengthy cost and utilization data to medical providers, it is ideal to sort the report by descending cost (or utilization) so that the important categories are at the top. This task can be easily solved using PROC SORT. However, when you need other variables (such as unit cost per procedure or national average) to follow the sort but not be sorted themselves, the solution is not as intuitive. This paper looks at several sorting algorithms to solve this problem. First, we look at the basic bubble sort (which is still effective for smaller data sets), which sets up arrays for each variable and then sorts on just one of them. Next, we discuss the quicksort algorithm, which is effective for large data sets, too. The results of the sorts provide sorted data that is easy to read and makes for effective analysis.

Researchers often rely on self-report for survey based studies. The accuracy of this self-reported data is often unknown, particularly in a medical setting that serves an under-insured patient population with varying levels of health literacy. We recruited participants from the waiting room of a St. Louis primary care safety net clinic to participate in a survey investigating the relationship between health environments and health outcomes. The survey included questions regarding personal and family history of chronic disease (diabetes, heart disease, and cancer) as well as BMI and self-perceived weight. We subsequently accessed the participant s electronic medical record (EMR) and collected physician-reported data on the same variables. We calculated concordance rates between participant answers and information gathered from EMRs using McNemar s chi-squared test. Logistic regression was then performed to determine the demographic predictors of concordance. Three hundred thirty-two patients completed surveys as part of the pilot phase of the study; 64% female, 58% African American, 4% Hispanic, 15% with less than high school level education, 76% annual household income less than $20,000, and 29% uninsured. Preliminary findings suggest an 82-94% concordance rate between self-reported and medical record data across outcomes, with the exception of family history of cancer (75%) and heart disease (42%). Our conclusion is that determining the validity of the self-reported data in the pilot phase influences whether self-reported personal and family history of disease and BMI are appropriate for use in this patient population.

The world's first wind resource assessment buoy, residing in Lake Michigan, uses a pulsing laser wind sensor to accurately measure wind speed, direction, and turbulence offshore up to wind turbine hub-height and across the blade span every second. Understanding wind behavior would be tedious and fatiguing with such large data sets. However, SAS/GRAPH^® 9.4 helps the user grasp wind characteristics over time and at different altitudes by exploring the data visually. This paper covers graphical approaches to evaluate wind speed validity, seasonal wind speed variation, and storm systems to inform engineers on the candidacy of Lake Michigan offshore wind farms.

Volatility estimation plays an important role in the elds of statistics and nance. Many different techniques address the problem of estimating volatility of nancial assets. Autoregressive conditional heteroscedasticity (ARCH) models and the related generalized ARCH models are popular models for volatility. This talk will introduce the need for volatility modeling as well as introduce the framework of ARCH and GARCH models. A brief discussion about the structure of ARCH and GARCH models will then be compared to other volatility modeling techniques.

Missing observations caused by dropouts or skipped visits present a problem in studies of longitudinal data. When the analysis is restricted to complete cases and the missing data depend on previous responses, the generalized estimating equation (GEE) approach, which is commonly used when the population-average effect is of primary interest, can lead to biased parameter estimates. The new GEE procedure in SAS/STAT^® 13.2 implements a weighted GEE method, which provides consistent parameter estimates when the dropout mechanism is correctly specified. When none of the data are missing, the method is identical to the usual GEE approach, which is available in the GENMOD procedure. This paper reviews the concepts and statistical methods. Examples illustrate how you can apply the GEE procedure to incomplete longitudinal data.

Do you know everything you need to know about missing values? Do you know how to assign a missing value to multiple variables with one statement? Can you display missing values as something other than . or blank? How many types of missing numeric values are there? This paper reviews techniques for assigning, displaying, referencing, and summarizing missing values for numeric variables and character variables.

Over the last year, the SAS^® Enterprise Miner^™ development team has made numerous and wide-ranging enhancements and improvements. New utility nodes that save data, integrate better with open-source software, and register models make your routine tasks easier. The area of time series data mining has three new nodes. There are also new models for Bayesian network classifiers, generalized linear models (GLMs), support vector machines (SVMs), and more.

The DATA step allows one to read, write, and manipulate many types of data. As data evolves to a more free-form state, the ability of SAS^® to handle character data becomes increasingly important. This paper addresses character data from multiple vantage points. For example, what is the default length of a character string, and why does it appear to change under different circumstances? What type of formatting is available for character data? How can we examine and manipulate character data? The audience for this paper is beginner to intermediate, and the goal is to provide an introduction to the numerous character functions available in SAS, including the basic LENGTH and SUBSTR functions, plus many others.

SAS^® Visual Analytics is one of the newer SAS^® products with a lot of excitement surrounding it. But what is SAS Visual Analytics really? By examining the similarities, differences, and synergies between SAS Visual Analytics and other SAS offerings, we can more clearly understand this new product.

We receive a daily file with information about patients who use our drug. It s updated every day so that we have the most current information. Nearly every variable on a patient s record can be different from one day to the next. But what if you wanted to capture information that changed? For example, what if a patient switched doctors sometime along the way, and the original prescribing doctor is different than the patient's present doctor? With this type of daily file, that information is lost. To avoid losing these changes, you have to build a cumulative data set. I ll show you how to build it.