SAS Global Forum 2015 Proceedings

Imbalanced data are frequently seen in fraud detection, direct marketing, disease prediction, and many other areas. Rare events are sometimes of primary interest. Classifying them correctly is the challenge that many predictive modelers face today. In this paper, we use SAS^® Enterprise Miner™ on a marketing data set to demonstrate and compare several approaches that are commonly used to handle imbalanced data problems in classification models. The approaches are based on cost-sensitive measures and sampling measures. A rather novel technique called SMOTE (Synthetic Minority Over-sampling TEchnique), which has achieved the best result in our comparison, will be discussed.

Read the paper (PDF). | Download the data file (ZIP).

This paper illustrates a high-level infrastructure discussion with some explanation of the SAS^® codes used to implement a configurable batch framework for managing and updating the data rows and row-level permissions in SAS^® OLAP Cube Studio. The framework contains a collection of reusable, parameter-driven Base SAS^® macros, Base SAS custom programs, and UNIX or LINUX shell scripts. This collection manages the typical steps and processes used for manipulating SAS files and for executing SAS statements. The Base SAS macro collection contains a group of utility macros that includes: concurrent /parallel processing macros, SAS^® Metadata Repository macros, SAS^® Scalable Performance Data Engine table macros, table lookup macros, table manipulation macros, and other macros. There is also a group of OLAP-related macros that includes OLAP utility macros and OLAP permission table processing macros.

Read the paper (PDF).

When large amounts of data are available, choosing the variables for inclusion in model building can be problematic. In this analysis, a subset of variables was required from a larger set. This subset was to be used in a later cluster analysis with the aim of extracting dimensions of human flourishing. A genetic algorithm (GA), written in SAS^®, was used to select the subset of variables from a larger set in terms of their association with the dependent variable life satisfaction. Life satisfaction was selected as a proxy for an as yet undefined quantity, human flourishing. The data were divided into subject areas (health, environment). The GA was applied separately to each subject area to ensure adequate representation from each in the future analysis when defining the human flourishing dimensions.

Read the paper (PDF). | Download the data file (ZIP).

This session is intended to assist analysts in generating the best variables, such as monthly amount paid, daily number of received customer service calls, weekly worked hours on a project, or annual number total sales for a specific product, by using simple arithmetic operators (square root, log, loglog, exp, and rcp). During a statistical data modeling process, analysts are often confronted with the task of computing derived variables using the existing variables. The advantage of this methodology is that the new variables might be more significant than the original ones. This paper provides a new way to compute all the possible variables using a set of math transformations. The code includes many SAS^® features that are very useful tools for SAS programmers to incorporate in their future code such as %SYSFUNC, SQL, %INCLUDE, CALL SYMPUT, %MACRO, SORT, CONTENTS, MERGE, MACRO _NULL_, as well as %DO &%TO & and many more

Read the paper (PDF). | Download the data file (ZIP).

This paper introduces a macro that generates a calendar report in two different formats. The first format displays the entire month in one plot, which is called a month-by-month calendar report. The second format displays the entire month in one row and is called an all-in-one calendar report. To use the macro, you just need to prepare a simple data set that has three columns: one column identifies the ID, one column contains the date, and one column specifies the notes for the dates. On the generated calendar reports, you can include notes and add different styles to certain dates. Also, the macro provides the option for you to decide whether those months in your data set that do not contain data should be shown on the reports.

Read the paper (PDF).

Companies that offer subscription-based services (such as telecom and electric utilities) must evaluate the tradeoff between month-to-month (MTM) customers, who yield a high margin at the expense of lower lifetime, and customers who commit to a longer-term contract in return for a lower price. The objective, of course, is to maximize the Customer Lifetime Value (CLV). This tradeoff must be evaluated not only at the time of customer acquisition, but throughout the customer's tenure, particularly for fixed-term contract customers whose contract is due for renewal. In this paper, we present a mathematical model that optimizes the CLV against this tradeoff between margin and lifetime. The model is presented in the context of a cohort of existing customers, some of whom are MTM customers and others who are approaching contract expiration. The model optimizes the number of MTM customers to be swapped to fixed-term contracts, as well as the number of contract renewals that should be pursued, at various term lengths and price points, over a period of time. We estimate customer life using discrete-time survival models with time varying covariates related to contract expiration and product changes. Thereafter, an optimization model is used to find the optimal trade-off between margin and customer lifetime. Although we specifically present the contract expiration case, this model can easily be adapted for customer acquisition scenarios as well.

Read the paper (PDF). | Download the data file (ZIP).

When promoting metadata in large packages from SAS^® Data Integration Studio between environments, metadata and the underlying physical data can become out of sync. This can result in metadata items that cannot be opened by users because SAS^® has thrown an error. It often falls to the SAS administrator to resolve the synchronization issues when they might not have been responsible for promoting the metadata items in the first place. In this paper, we will discuss a simple macro that can be used to compare the table metadata to that of the physical tables, and any anomalies will be noted.

Read the paper (PDF). | Download the data file (ZIP).

Non-Gaussian outcomes are often modeled using members of the so-called exponential family. Notorious members are the Bernoulli model for binary data, leading to logistic regression, and the Poisson model for count data, leading to Poisson regression. Two of the main reasons for extending this family are (1) the occurrence of overdispersion, meaning that the variability in the data is not adequately described by the models, which often exhibit a prescribed mean-variance link, and (2) the accommodation of hierarchical structure in the data, stemming from clustering in the data which, in turn, might result from repeatedly measuring the outcome, for various members of the same family, and so on. The first issue is dealt with through a variety of overdispersion models such as the beta-binomial model for grouped binary data and the negative-binomial model for counts. Clustering is often accommodated through the inclusion of random subject-specific effects. Though not always, one conventionally assumes such random effects to be normally distributed. While both of these phenomena might occur simultaneously, models combining them are uncommon. This paper proposes a broad class of generalized linear models accommodating overdispersion and clustering through two separate sets of random effects. We place particular emphasis on so-called conjugate random effects at the level of the mean for the first aspect and normal random effects embedded within the linear predictor for the second aspect, even though our family is more general. The binary, count, and time-to-event cases are given particular emphasis. Apart from model formulation, we present an overview of estimation methods, and then settle for maximum likelihood estimation with analytic-numerical integration. Implications for the derivation of marginal correlations functions are discussed. The methodology is applied to data from a study of epileptic seizures, a clinical trial for a toenail infection named onychomycosis, and survival data in children with asthma.

Read the paper (PDF). | Watch the recording.

In big data, many variables are polytomous with many levels. The common method to deal with polytomous independent variables is to use a series of design variables, which correspond to the option class or by in the polytomous independent variable in PROC LOGISTIC, if the outcome is binary. If big data has many polytomous independent variables with many levels, using design variables makes the analysis processing very complicated in both computation time and result, which might provide little help on the prediction of outcome. This paper presents a new simple method for logistic regression with polytomous independent variables in big data analysis when analysis of big data is required. In the proposed method, the first step is to conduct an iteration statistical analysis from a SAS^® macro program. Similar to an algorithm in the creation of spline variables, this analysis searches for the proper aggregation groups with a statistical significant difference from all levels in a polytomous independent variable. In the SAS macro program for an iteration, processing of searching new level groups with statistical significant differences has been developed. The first is from level 1 with the smallest value of the outcome means. Then we can conduct a statistical test for the level 1 group with the level 2 group with the second smallest value of outcome mean. If these two groups have a statistical significant difference, we can start to test the level 2 group with the level 3 group. If level 1 and level 2 do not have a statistical significant difference, we can combine them into a new level group 1. Then we are going to test the new level group 1 with level 3. The processing continues until all the levels have been tested. Then we can replace the original level values of the polytomous variable by the new level values with the statistical significant difference. In this situation, the polytomous variable with new levels can be described by these means of all new levels because of the 1 to 1 equivalence relationship of a piecewise function in logit from the polytomous's levels to outcome means. It is very easy to approve that the conditional mean of an outcome y given a polytomous variable x is a very good approximation based on the maximum likelihood analysis. Compared with design variables, the new piecewise variable based on the information of all levels as a single independent variable can capture the impact of all levels in a much simpler way. We have used this method in the predictive models of customer attrition on the polytomous variables: state, business type, customer claim type, and so on. All of these polytomous variables show significant improvement on the prediction of customer attrition than without using them or using design variables in the model development.

Read the paper (PDF).

Modernizing SAS^® assets within an enterprise is key to reducing costs and improving productivity. Modernization implies consolidating multiple SAS environments into a single shared enterprise SAS deployment. While the benefits of modernization are clear, the management of a single-enterprise deployment is sometimes a struggle between business units who once had autonomy and IT that is now responsible for managing this shared infrastructure. The centralized management and control of a SAS deployment is based on SAS metadata. This paper provides a practical approach to the shared management of a centralized SAS deployment using SAS^® Management Console. It takes into consideration the day-to-day needs of the business and IT requirements including centralized security, monitoring, and management. This document defines what resources are contained in SAS metadata, what responsibilities should be centrally controlled, and the pros and cons of distributing the administration of metadata content across the enterprise. This document is intended as a guide for SAS administrators and assumes that you are familiar with the concepts and terminology introduced in SAS^® 9.4 Intelligence Platform: Security Administration Guide.

Read the paper (PDF).

All SAS^® data sets and variables have standard attributes. These include items such as creation date, engine, compression and sort information for data sets, and format and length information for variables. However, for the first time in SAS 9.4, the developer can add their own customized attributes to both data sets and variables. This paper shows how these extended attributes can be created, modified, and maintained. It suggests the sort of items that might be candidates for use as extended attributes and explains in what circumstances they can be used. It also provides a worked example of how they can be used to inform and aid the SAS programmer in creating SAS applications.

Read the paper (PDF).

Looking for a handy technique to have in your toolkit? Consider SAS^® Views^®, especially if you work with large data sets. After a brief introduction to SAS Views, I'll show you several cool ways to use them that will streamline your code and save workspace.

Read the paper (PDF).

The Centers for Medicare & Medicaid Services (CMS) uses the Proportion of Days Covered (PDC) to measure medication adherence. There is also some PDC-related research based on Medicare Part D Event (PDE) Data. However, Under Medicare rules, beneficiaries who receive care at an Inpatient (IP) [facility] may receive Medicare covered medications directly from the IP, rather than by filling prescriptions through their Part D contracts; thus, their medication fills during an IP stay would not be included in the PDE claims used to calculate the Patient Safety adherence measures. (Medicare 2014 Part C&D star rating technical notes). Therefore, the previous PDC calculation method underestimated the true PDC value. Starting with 2013 Star rating, PDC calculation was adjusted with IP stays. This is, when a patient has an inpatient admission during the measurement period, the inpatient stays are censored for the PDC calculation. If the patient also has measured drug coverage during the inpatient stay, the drug supplied during inpatient stay will be shifted after the inpatient stay. This shifting also causes a chain of shifting. This paper presents a SAS R Macro using the SAS Hash Object to match inpatient stays, censoring the inpatient stays, shifting the drug starting and ending dates, and calculating the adjusted PDC.

Read the paper (PDF). | Download the data file (ZIP).

Medical tests are used for various purposes including diagnosis, prognosis, risk assessment and screening. Statistical methodology is used often to evaluate such types of tests, most frequent measures used for binary data being sensitivity, specificity, positive and negative predictive values. An important goal in diagnostic medicine research is to estimate and compare the accuracies of such tests. In this paper I give a gentle introduction to measures of diagnostic test accuracy and introduce a SAS^® macro to calculate generalized score statistic and weighted generalized score statistic for comparison of predictive values using formula's generalized and proposed by Andrzej S. Kosinski.

Read the paper (PDF).

The paper introduces users to how they can use a set of SAS^® macros, %LIFETEST and %LIFETESTEXPORT, to generate survival analysis reports for data with or without competing risks. The macros provide a wrapper of PROC LIFETEST and an enhanced version of the SAS autocall macro %CIF to give users an easy-to-use interface to report both survival estimates and cumulative incidence estimates in a unified way. The macros also provide a number of parameters to enable users to flexibly adjust how the final reports should look without the need to manually input or format the final reports.

Read the paper (PDF). | Download the data file (ZIP).

Delimited text files are often plagued by appended and/or truncated records. Writing customized SAS^® code to import such a text file and break out into fields can be challenging. If only there was a way to fix the file before importing it. Enter the file_fixing_tool, a SAS^® Enterprise Guide^® project that uses the SAS PRX functions to import, fix, and export a delimited text file. This fixed file can then be easily imported and broken out into fields.

Read the paper (PDF). | Download the data file (ZIP).

SAS^® Visual Analytics provides self-service capabilities for users to analyze, explore, and report on their own data. As users explore their data, there is always a need to bring in more data sources, create new variables, combine data from multiple sources, and even update your data occasionally. SAS Visual Analytics provides targeted user capabilities to access, modify, and enhance data suitable for specific business needs. This paper provides a clear understanding of these capabilities and suggests best practices for self-service data management in SAS Visual Analytics.

Read the paper (PDF).

Based on work by Thall et al. (2012), we implement a method for randomizing patients in a Phase II trial. We accumulate evidence that identifies which dose(s) of a cancer treatment provide the most desirable profile, per a matrix of efficacy and toxicity combinations rated by expert oncologists (0-100). Experts also define the region of Good utility scores and criteria of dose inclusion based on toxicity and efficacy performance. Each patient is rated for efficacy and toxicity at a specified time point. Simulation work is done mainly using PROC MCMC in which priors and likelihood function for joint outcomes of efficacy and toxicity are defined to generate posteriors. Resulting joint probabilities for doses that meet the inclusion criteria are used to calculate the mean utility and probability of having Good utility scores. Adaptive randomization probabilities are proportional to the probabilities of having Good utility scores. A final decision of the optimal dose will be made at the end of the Phase II trial.

Read the paper (PDF).

With increasing regulatory emphasis on using more scientific statistical processes and procedures in the Bank Secrecy Act/Anti-Money Laundering (BSA/AML) compliance space, financial institutions are being pressured to replace their heuristic, rule-based customer risk rating models with well-established, academically supported, statistically based models. As part of their customer-enhanced due diligence, firms are expected to both rate and monitor every customer for the overall risk that the customer poses. Firms with ineffective customer risk rating models can face regulatory enforcement actions such as matters requiring attention (MRAs); the Office of the Comptroller of the Currency (OCC) can issue consent orders for federally chartered banks; and the Federal Deposit Insurance Corporation (FDIC) can take similar actions against state-chartered banks. Although there is a reasonable amount of information available that discusses the use of statistically based models and adherence to the OCC bulletin Supervisory Guidance on Model Risk Management (OCC 2011-12), there is only limited material about the specific statistical techniques that financial institutions can use to rate customer risk. This paper discusses some of these techniques; compares heuristic, rule-based models and statistically based models; and suggests ordinal logistic regression as an effective statistical modeling technique for assessing customer BSA/AML compliance risk. In discussing the ordinal logistic regression model, the paper addresses data quality and the selection of customer risk attributes, as well as the importance of following the OCC's key concepts for developing and managing an effective model risk management framework. Many statistical models can be used to assign customer risk, but logistic regression, and in this case ordinal logistic regression, is a fairly common and robust statistical method of assigning customers to ordered classifications (such as Low, Medium, High-Low, High-Medium, and High-High risk). Using ordinal logistic regression, a financial institution can create a customer risk rating model that is effective in assigning risk, justifiable to regulators, and relatively easy to update, validate, and maintain.

Read the paper (PDF).

Fitting mixed models to complicated data, such as data that include multiple sources of variation, can be a daunting task. SAS/STAT^® software offers several procedures and approaches for fitting mixed models. This paper provides guidance on how to overcome obstacles that commonly occur when you fit mixed models using the MIXED and GLIMMIX procedures. Examples are used to showcase procedure options and programming techniques that can help you overcome difficult data and modeling situations.

Read the paper (PDF).

Are we alone in this universe? This is a question that undoubtedly passes through every mind several times during a lifetime. We often hear a lot of stories about close encounters, Unidentified Flying Object (UFO) sightings and other mysterious things, but we lack the documented evidence for analysis on this topic. UFOs have been a matter of interest in the public for a long time. The objective of this paper is to analyze one database that has a collection of documented reports of UFO sightings to uncover any fascinating story related to the data. Using SAS^® Enterprise Miner™ 13.1, the powerful capabilities of text analytics and topic mining are leveraged to summarize the associations between reported sightings. We used PROC GEOCODE to convert addresses of sightings to the locations on the map. Then we used PROC GMAP procedure to produce a heat map to represent the frequency of the sightings in various locations. The GEOCODE procedure converts address data to geographic coordinates (latitude and longitude values). These geographic coordinates can then be used on a map to calculate distances or to perform spatial analysis. On preliminary analysis of the data associated with sightings, it was found that the most popular words associated with UFOs tell us about their shapes, formations, movements, and colors. The Text Profiler node in SAS Enterprise Miner 13.1 was leveraged to build a model and cluster the data into different levels of segment variable. We also explain how the opinions about the UFO sightings change over time using Text Profiling. Further, this analysis uses the Text Profile node to find interesting terms or topics that were used to describe the UFO sightings. Based on the feedback received at SAS^® analytics conference, we plan to incorporate a technique to filter duplicate comments and include weather in that location.

Read the paper (PDF). | Download the data file (ZIP).

In 2013, the University of North Carolina (UNC) at Chapel Hill initiated enterprise-wide use of SAS^® solutions for reporting and data transformations. Just over one year later, the initial rollout was scheduled to go live to an audience of 5,500 users as part of an adoption of PeopleSoft ERP for Finance, Human Resources, Payroll, and Student systems. SAS^® Visual Analytics was used for primary report delivery as an embedded resource within the UNC Infoporte, an existing portal. UNC made the date. With the SAS solutions, UNC delivered the data warehouse and initial reports on the same day that the ERP systems went live. After the success of the initial launch, UNC continues to develop and evolve the solution with additional technologies, data, and reports. This presentation touches on a few of the elements required for a medium to large size organization to integrate SAS solutions such as SAS Visual Analytics and SAS^® Enterprise Business Intelligence within their infrastructure.

Read the paper (PDF).

Since Maine established the first All Payer Claims Database (APCD) in 2003, 10 additional states have established APCDs and 30 others are in development or show strong interest in establishing APCDs. APCDs are generally mandated by legislation, though voluntary efforts exist. They are administered through various agencies, including state health departments or other governmental agencies and private not-for-profit organizations. APCDs receive funding from various sources, including legislative appropriations and private foundations. To ensure sustainability, APCDs must also consider the sale of data access and reports as a source of revenue. With the advent of the Affordable Care Act, there has been an increased interest in APCDs as a data source to aid in health care reform. The call for greater transparency in health care pricing and quality, development of Patient-Centered Medical Homes (PCMHs) and Accountable Care Organizations (ACOs), expansion of state Medicaid programs, and establishment of health insurance and health information exchanges have increased the demand for the type of administrative claims data contained in an APCD. Data collection, management, analysis, and reporting issues are examined with examples from implementations of live APCDs. Developing data intake, processing, warehousing, and reporting standards are discussed in light of achieving the triple aim of improving the individual experience of care; improving the health of populations; and reducing the per capita costs of care. APCDs are compared and contrasted with other sources of state-level health care data, including hospital discharge databases, state departments of insurance records, and institutional and consumer surveys. The benefits and limitations of administrative claims data are reviewed. Specific issues addressed with examples include implementing transparent reporting of service prices and provider quality, maintaining master patient and provider identifiers, validating APCD data and comparison with o ther state health care data available to researchers and consumers, defining data suppression rules to ensure patient confidentiality and HIPAA-compliant data release and reporting, and serving multiple end users, including policy makers, researchers, and consumers with appropriately consumable information.

Read the paper (PDF). | Watch the recording.

Ordinary least squares regression is one of the most widely used statistical methods. However, it is a parametric model and relies on assumptions that are often not met. Alternative methods of regression for continuous dependent variables relax these assumptions in various ways. This paper explores procedures such as QUANTREG, ADAPTIVEREG, and TRANSREG for these kinds of data.

Read the paper (PDF). | Watch the recording.

In Scotia - Colpatria Bank, the retail segment is very important. The quantity of lending applications makes it necessary to use statistical models and analytic tools in order to do an initial selection of good customers, who our credit analyst will study in depth to finally approve or deny a credit application. The construction of target vintages using the Cox model will generate past-due alerts in a shorter time, so the mitigation measures can be applied one or two months earlier than currently. This can reduce the losses by 100 bps in the new vintages. This paper makes the estimation of a proportional hazard model of Cox and compares the results with a logit model for a specific product of the bank. Additionally, we will estimate the objective vintage for the product.

Read the paper (PDF). | Download the data file (ZIP).

In our management and collection area, there was no methodology that provided the optimal number of collection calls to get the customer to make the minimum payment of his or her financial obligation. We wanted to determine the optimal number of calls using the data envelopment analysis (DEA) optimization methodology. Using this methodology, we obtained results that positively impacted the way our customers were contacted. We can maintain a healthy bank and customer relationship, keep management and collection at an operational level, and obtain a more effective and efficient portfolio recovery. The DEA optimization methodology has been successfully used in various fields of manufacturing production. It has solved multi-criteria optimization problems, but it has not been commonly used in the financial sector, especially in the collection area. This methodology requires specialized software, such as SAS^® Enterprise Guide^® and its robust optimization. In this presentation, we present the PROC OPTMODEL and show how to formulate the optimization problem, create the programming, and process the data available.

Read the paper (PDF).

Missing data are a common and significant problem that researchers and data analysts encounter in applied research. Because most statistical procedures require complete data, missing data can substantially affect the analysis and the interpretation of results if left untreated. Methods to treat missing data have been developed so that missing values are imputed and analyses can be conducted using standard statistical procedures. Among these missing data methods, multiple imputation has received considerable attention and its effectiveness has been explored (for example, in the context of survey and longitudinal research). This paper compares four multiple imputation approaches for treating missing continuous covariate data under MCAR, MAR, and NMAR assumptions, in the context of propensity score analysis and observational studies. The comparison of the four MI approaches in terms of bias in parameter estimates, Type I error rates, and statistical power is presented. In addition, complete case analysis (listwise deletion) is presented as the default analysis that would be conducted if missing data are not treated. Issues are discussed, and conclusions and recommendations are provided.

Read the paper (PDF).

This session will describe an innovative way to identify groupings of customer offerings using SAS^® software. The authors investigated the customer enrollments in nine different programs offered by a large energy utility. These programs included levelized billing plans, electronic payment options, renewable energy, energy efficiency programs, a home protection plan, and a home energy report for managing usage. Of the 640,788 residential customers, 374,441 had been solicited for a program and had adequate data for analysis. Nearly half of these eligible customers (49.8%) enrolled in some type of program. To examine the commonality among programs based on characteristics of customers who enroll, cluster analysis procedures and correlation matrices are often used. However, the value of these procedures was greatly limited by the binary nature of enrollments (enroll or no enroll), as well as the fact that some programs are mutually exclusive (limiting cross-enrollments for correlation measures). To overcome these limitations, PROC LOGISTIC was used to generate predicted scores for each customer for a given program. Then, using the same predictor variables, PROC LOGISTIC was used on each program to generate predictive scores for all customers. This provided a broad range of scores for each program, under the assumption that customers who are likely to join similar programs would have similar predicted scores for these programs. PROC FASTCLUS was used to build k-means cluster models based on these predicted logistic scores. Two distinct clusters were identified from the nine programs. These clusters not only aligned with the hypothesized model, but were generally supported by correlations (using PROC CORR) among program predicted scores as well as program enrollments.

Read the paper (PDF).

Whenever you travel, whether it's to a new destination or to your favorite vacation spot, it's nice to have a guide to assist you with planning and setting expectations. The ODS LAYOUT statement became production in SAS^® 9.4. For those intrepid programmers who used ODS LAYOUT in an earlier release of SAS^®, this paper contains tons of information about changes you need to know about. Programmers new to SAS 9.4 (or new to ODS LAYOUT) need to understand the basics. This paper reviews some common issues customers have reported to SAS Technical Support when migrating to the LAYOUT destination in SAS 9.4 and explores the basics for those who are making their first foray into the adventure that is ODS LAYOUT. This paper discusses some tips and tricks to ensure that your trip through the ODS LAYOUT statement will be a fun and rewarding one.

Read the paper (PDF).

In the very near future you will likely encounter Hadoop. It is rapidly displacing database management systems in the corporate world and is rearing its head in the SAS^® world. If you think now is the time to learn how to use SAS with Hadoop, you are in luck. This workshop is the jump start you need. This workshop introduces Hadoop and shows you how to access it by using SAS/ACCESS^® Interface to Hadoop. During the workshop, we show you how to do the following: how to configure your SAS environment so that you can access Hadoop data; how to use the Hadoop FILENAME statement; how to use the HADOOP procedure; and how to use SAS/ACCESS Interface to Hadoop (including performance tuning).

Testing for unit roots and determining whether a data set is nonstationary is important for the economist who does empirical work. SAS^® enables the user to detect unit roots using an array of tests: the Dickey-Fuller, Augmented Dickey-Fuller, Phillips-Perron, and the Kwiatkowski-Phillips-Schmidt-Shin test. This paper presents a brief overview of unit roots and shows how to test for a unit root using the example of U.S. national health expenditure data.

Read the paper (PDF). | Download the data file (ZIP).

A complex survey data set is one characterized by any combination of the following four features: stratification, clustering, unequal weights, or finite population correction factors. In this paper, we provide context for why these features might appear in data sets produced from surveys, highlight some of the formulaic modifications they introduce, and outline the syntax needed to properly account for them. Specifically, we explain why you should use the SURVEY family of SAS/STAT^® procedures, such as PROC SURVEYMEANS or PROC SURVEYREG, to analyze data of this type. Although many of the syntax examples are drawn from a fictitious expenditure survey, we also discuss the origins of complex survey features in three real-world survey efforts sponsored by statistical agencies of the United States government--namely, the National Ambulatory Medical Care Survey, the National Survey of Family of Growth, and the Consumer Building Energy Consumption Survey.

Read the paper (PDF). | Watch the recording.

The importance of econometrics in the analytics toolkit is increasing every day. Econometric modeling helps uncover structural relationships in observational data. This paper highlights the many recent changes to the SAS/ETS^® portfolio that increase your power to explain the past and predict the future. Examples show how you can use Bayesian regression tools for price elasticity modeling, use state space models to gain insight from inconsistent time series, use panel data methods to help control for unobserved confounding effects, and much more.

Read the paper (PDF).

This paper provides an overview of analysis of data derived from complex sample designs. General discussion of how and why analysis of complex sample data differs from standard analysis is included. In addition, a variety of applications are presented using PROC SURVEYMEANS, PROC SURVEYFREQ, PROC SURVEYREG, PROC SURVEYLOGISTIC, and PROC SURVEYPHREG, with an emphasis on correct usage and interpretation of results.

Read the paper (PDF). | Watch the recording.

This presentation provides an overview of the advancement of analytics in National Hockey League (NHL) hockey, including how it applies to areas such as player performance, coaching, and injuries. The speaker discusses his analysis on predicting concussions that was featured in the New York Times, as well as other examples of statistical analysis in hockey.

Read the paper (PDF).

Behind an e-commerce site selling many thousands of live events, with inventory from thousands of ticket suppliers who can and do change prices constantly, and all the historical data on prices for this and similar events, layer in customer bidding behavior and you have a big data opportunity on your hands. I will talk about the evolution of pricing at ScoreBig in this framework and the models we've developed to set our reserve pricing. These models and the underlying data are also used by our inventory partners to continue to refine their pricing. I will also highlight how having a name your own price framework helps with the development of pricing models.

Read the paper (PDF).

At the Multibanca Colpatria of Scotiabank, we offer a broad range of financial services and products in Colombia. In collection management, we currently manage more than 400,000 customers each month. In the call center, agents collect answers from each contact with the customer, and this information is saved in databases. However, this information has not been explored to know more about our customers and our own operation. The objective of this paper is to develop a classification model using the words in the answers from each customer from the call about receiving payment. Using a combination of text mining and cluster methodologies, we identify the possible conversations that can occur in each stage of delinquency. This knowledge makes developing specialized scripts for collection management possible.

Read the paper (PDF).

Data mining and predictive models are extensively used to find the optimal customer targets in order to maximize the return on investment. Direct marketing techniques target all the customers who are likely to buy regardless of the customer classification. In a real sense, this mechanism couldn't classify the customers who are going to buy even without a marketing contact, thereby resulting in a loss on investment. This paper focuses on the Incremental Lift modeling approach using Weight of Evidence Coding and Information Value followed by Incremental Response and Outcome model Diagnostics. This model identifies the additional purchases that would not have taken place without a marketing campaign. Modeling work was conducted using a combined model. The research work is carried out on Travel Center data. This data identifies the increase in average response rate by 2.8% and the number of fuel gallons by 244 when compared with the results from the traditional campaign, which targeted everyone. This paper discusses in detail the implementation of the 'Incremental Response' node to direct the marketing campaigns and its Incremental Revenue and Profit analysis.

Read the paper (PDF). | Download the data file (ZIP).

Approximately 80% of world trade at present uses the seaways, with around 110,000 merchant vessels and 1.25 million marine farers transported and almost 6 billion tons of goods transferred every year. Marine piracy stands as a serious challenge to sea trade. Understanding how the pirate attacks occur is crucial in effectively countering marine piracy. Predictive modeling using the combination of textual data with numeric data provides an effective methodology to derive insights from both structured and unstructured data. 2,266 text descriptions about pirate incidents that occurred over the past seven years, from 2008 to the second quarter of 2014, were collected from the International Maritime Bureau (IMB) website. Analysis of the textual data using SAS^® Enterprise Miner™ 12.3, with the help of concept links, answered questions on certain aspects of pirate activities, such as the following: 1. What are the arms used by pirates for attacks? 2. How do pirates steal the ships? 3. How do pirates escape after the attacks? 4. What are the reasons for occasional unsuccessful attacks? Topics are extracted from the text descriptions using a text topic node, and the varying trends of these topics are analyzed with respect to time. Using the cluster node, attack descriptions are classified into different categories based on attack style and pirate behavior described by a set of terms. A target variable called Attack Type is derived from the clusters and is combined with other structured input variables such as Ship Type, Status, Region, Part of Day, and Part of Year. A Predictive model is built with Attact Type as the target variable and other structured data variables as input predictors. The Predictive model is used to predict the possible type of attack given the details of the ship and its travel. Thus, the results of this paper could be very helpful for the shipping industry to become more aware of possible attack types for different vessel types when traversing different routes , and to devise counter-strategies in reducing the effects of piracy on crews, vessels, and cargo.

Read the paper (PDF).

Data comes from a rich variety of sources in a rich variety of types, shapes, sizes, and properties. The analysis can be challenged by data that is too tall or too wide; too full of miscodings, outliers, or holes; or that contains funny data types. Wide data, in particular, has many challenges, requiring the analysis to adapt with different methods. Making covariance matrices with 2.5 billion elements is just not practical. JMP^® 12 will address these challenges.

Read the paper (PDF).

This analysis is based on data for all transactions at four parking meters within a small area in central Copenhagen for a period of four years. The observations show the exact minute parking was bought and the amount of time for which parking was bought in each transaction. These series of at most 80,000 transactions are aggregated to the hour, day, week, and month using PROC TIMESERIES. The aggregated series of parking times and the number of transactions are analyzed for seasonality and interdependence by PROC X12, PROC UCM, and PROC VARMAX.

Read the paper (PDF).

In many spatial analysis applications (including crime analysis, epidemiology, ecology, and forestry), spatial point process modeling can help you study the interaction between different events and help you model the process intensity (the rate of event occurrence per unit area). For example, crime analysts might want to estimate where crimes are likely to occur in a city and whether they are associated with locations of public features such as bars and bus stops. Forestry researchers might want to estimate where trees grow best and test for association with covariates such as elevation and gradient. This paper describes the SPP procedure, new in SAS/STAT^® 13.2, for exploring and modeling spatial point pattern data. It describes methods that PROC SPP implements for exploratory analysis of spatial point patterns and for log-linear intensity modeling that uses covariates. It also shows you how to use specialized functions for studying interactions between points and how to use specialized analytical graphics to diagnose log-linear models of spatial intensity. Crime analysis, forestry, and ecology examples demonstrate key features of PROC SPP.

Read the paper (PDF).

The Ebola virus outbreak is producing some of the most significant and fastest trending news throughout the globe today. There is a lot of buzz surrounding the deadly disease and the drastic consequences that it potentially poses to mankind. Social media provides the basic platforms for millions of people to discuss the issue and allows them to openly voice their opinions. There has been a significant increase in the magnitude of responses all over the world since the death of an Ebola patient in a Dallas, Texas hospital. In this paper, we aim to analyze the overall sentiment that is prevailing in the world of social media. For this, we extracted the live streaming data from Twitter at two different times using the Python scripting language. One instance relates to the period before the death of the patient, and the other relates to the period after the death. We used SAS^® Text Miner nodes to parse, filter, and analyze the data and to get a feel for the patterns that exist in the tweets. We then used SAS^® Sentiment Analysis Studio to further analyze and predict the sentiment of the Ebola outbreak in the United States. In our results, we found that the issue was not taken very seriously until the death of the Ebola patient in Dallas. After the death, we found that prominent personalities across the globe were talking about the disease and then raised funds to fight it. We are continuing to collect tweets. We analyze the locations of the tweets to produce a heat map that corresponds to the intensity of the varying sentiment across locations.

Read the paper (PDF).

The merge is one of the SAS^® programmer's most commonly used tools. However, it can be fraught with pitfalls to the unwary user. In this paper, we look under the hood of the DATA step and examine how the program data vector works. We see what's really happening when data sets are merged and how to avoid subtle problems.

Read the paper (PDF).

Twitter is a powerful form of social media for sharing information about various issues and can be used to raise awareness and collect pointers about associated risk factors and preventive measures. Type-2 diabetes is a national problem in the US. We analyzed twitter feeds about Type-2 diabetes in order to suggest a rational use of social media with respect to an assertive study of any ailment. To accomplish this task, 900 tweets were collected using Twitter API v1.1 in a Python script. Tweets, follower counts, and user information were extracted via the scripts. The tweets were segregated into different groups on the basis of their annotations related to risk factors, complications, preventions and precautions, and so on. We then used SAS^® Text Miner to analyze the data. We found that 70% of the tweets stated the status quo, based on marketing and awareness campaigns. The remaining 30% of tweets contained various key terms and labels associated with type-2 diabetes. It was observed that influential users tweeted more about precautionary measures whereas non-influential people gave suggestions about treatments as well as preventions and precautions.

Read the paper (PDF).

Improving tourists' satisfaction and intention to revisit the festival is an ongoing area of interest to the tourism industry. Many organizers at the festival site strive very hard to attract and retain attendees by investing heavily in their marketing and promotion strategies for the festival. To meet this challenge, the advanced analytical model though data mining approach is proposed to answer the following research question: What are the most important factors that influence tourists' intentions to revisit the festival site? Cluster analysis, neural network, decision tree, stepwise regression, polynomial regression, and support vector machine are applied in this study. The main goal is to determine what it takes not only to retain the loyalty attendees, but also attract and encourage new attendees to be back at the site.

Read the paper (PDF).

Managing the large-scale displacement of people and communities caused by a natural disaster has historically been reactive rather than proactive. Following a disaster, data is collected to inform and prompt operational responses. In many countries prone to frequent natural disasters such as the Philippines, large amounts of longitudinal data are collected and available to apply to new disaster scenarios. However, because of the nature of natural disasters, it is difficult to analyze all of the data until long after the emergency has passed. For this reason, little research and analysis have been conducted to derive deeper analytical insight for proactive responses. This paper demonstrates the application of SAS^® analytics to this data and establishes predictive alternatives that can improve conventional storm responses. Humanitarian organizations can use this data to understand displacement patterns and trends and to optimize evacuation routing and planning. Identifying the main contributing factors and leading indicators for the displacement of communities in a timely and efficient manner prevents detrimental incidents at disaster evacuation sites. Using quantitative and qualitative methods, responding organizations can make data-driven decisions that innovate and improve approaches to managing disaster response on a global basis. The benefits of creating a data-driven analytical model can help reduce response time, improve the health and safety of displaced individuals, and optimize scarce resources in a more effective manner. The International Organization for Migration (IOM), an intergovernmental organization, is one of the first-response organizations on the ground that responds to most emergencies. IOM is the global co-load for the Camp Coordination and Camp Management (CCCM) cluster in natural disasters. This paper shows how to use SAS^® Visual Analytics and SAS^® Visual Statistics for the Philippines in response to Super Typhoon Haiyan in Nove mber 2013 to develop increasingly accurate models for better emergency-preparedness. Using data collected from IOM's Displacement Tracking Matrix (DTM), the final analysis shows how to better coordinate service delivery to evacuation centers sheltering large numbers of displaced individuals, applying accurate hindsight to develop foresight on how to better respond to emergencies and disasters. Predictive models build on patterns found in historical and transactional data to identify risks and opportunities. The capacity to predict trends and behavior patterns related to displacement and mobility has the potential to enable the IOM to respond in a more timely and targeted manner. By predicting the locations of displacement, numbers of persons displaced, number of vulnerable groups, and sites at most risk of security incidents, humanitarians can respond quickly and more effectively with the appropriate resources (material and human) from the outset. The end analysis uses the SAS^® Storm Optimization model combined with human mobility algorithms to predict population movement.

You know that you want to control the process flow of your program. When your program is executed multiple times, with slight variations, you will need to control the changes from iteration to iteration, the timing of the execution, and the maintenance of output and logs. Unfortunately, in order to achieve the control that you know that you need to have, you will need to make frequent, possibly time-consuming and potentially error-prone, manual corrections and edits to your program. Fortunately, the control you seek is available and it does not require the use of time-intensive manual techniques. List processing techniques are available that give you control and peace of mind and enable you to be a successful control freak. These techniques are not new, but there is often hesitancy on the part of some programmers to take full advantage of them. This paper reviews these techniques and demonstrates them through a series of examples.

Read the paper (PDF).

This presentation is an open-ended discussion about techniques for transferring data and analytical results from SAS^® to Microsoft Excel. There are some introductory comments, but this presentation does not have any set content. Instead, the topics discussed are dictated by attendee questions. Come prepared to ask and get answers to your questions. To submit your questions or suggestions for discussion in advance, go to http://support.sas.com/surveys/askvince.html.

With the increase in government and commissions incentivizing electric utilities to get consumers to save energy, there has been a large increase in the number of energy saving programs. Some are structural, incentivizing consumers to make improvements to their home that result in energy savings. Some, called behavioral programs, are designed to get consumers to change their behavior to save energy. Within behavioral programs, Home Energy Reports are a good method to achieve behavioral savings as well as to educate consumers on structural energy savings. This paper examines the different Home Energy Report communication channels (direct mail and e-mail) and the marketing channel effect on energy savings, using SAS^® for linear models. For consumer behavioral change, we often hear the questions: 1) Are the people that responded via direct mail solicitation saving at a higher rate than people who responded via an e-mail solicitation? 1a) Hypothesis: Because e-mail is easy to respond to, the type of customers that enroll through this channel will exert less effort for the behavior changes that require more time and investment toward energy efficiency changes and thus will save less. 2) Does the mode of that ongoing dialog (mail versus e-mail) impact the amount of consumer savings? 2a) Hypothesis: E-mail is more likely to be ignored and thus these recipients will save less. As savings is most often calculated by comparing the treatment group to a control group (to account for weather and economic impact over time), and by definition you cannot have a dialog with a control group, the answers are not a simple PROC FREQ away. Also, people who responded to mail look very different demographically than people who responded to e-mail. So, is the driver of savings differences the channel, or is it the demographics of the customers that happen to use those chosen channels? This study used clustering (PROC FASTCLUS) to segment the consumers by mail versus e-mail and append cluster assignment s to the respective control group. This study also used DID (Difference-in-Differences) as well as Billing Analysis (PROC GLM) to calculate the savings of these groups.

Read the paper (PDF).

The use of administrative databases for understanding practice patterns in the real world has become increasingly apparent. This is essential in the current health-care environment. The Affordable Care Act has helped us to better understand the current use of technology and different approaches to surgery. This paper describes a method for extracting specific information about surgical procedures from the Healthcare Cost and Utilization Project (HCUP) database (also referred to as the National (Nationwide) Inpatient Sample (NIS)).The analyses provide a framework for comparing the different modalities of surgerical procedures of interest. Using an NIS database for a single year, we want to identify cohorts based on surgical approach. We do this by identifying the ICD-9 codes specific to robotic surgery, laparoscopic surgery, and open surgery. After we identify the appropriate codes using an ARRAY statement, a similar array is created based on the ICD-9 codes. Any minimally invasive procedure (robotic or laparoscopic) that results in a conversion is flagged as a conversion. Comorbidities are identified by ICD-9 codes representing the severity of each subject and merged with the NIS inpatient core file. Using a FORMAT statement for all diagnosis variables, we create macros that can be regenerated for each type of complication. These created macros are compiled in SAS^® and stored in the library that contains the four macros that are called by tables. They call the macros for different macros variables. In addition, they create the frequencies of all cohorts and create the table structure with the title and number of the table. This paper describes a systematic method in SAS/STAT^® 9.2 to extract the data from NIS using the ARRAY statement for the specific ICD-9 codes, to format the extracted data for the analysis, to merge the different NIS databases by procedures, and to use automatic macros to generate the report.

Read the paper (PDF).

There is a goldmine of information that is available to you in SAS^® metadata. The challenge, however, is being able to retrieve and leverage that information. While there is useful functionality available in SAS^® Management Console as well as a collection of functional macros provided by SAS to help accomplish this, getting a complete metadata picture in an automated way has proven difficult. This paper discusses the methods we have used to find core information within SAS^® 9.2 metadata and how we have been able to pull this information in a programmatic way. We used Base SAS^®, SAS^® Data Integration Studio, PC SAS^®, and SAS^® XML Mapper to build a solution that now provides daily metadata reporting about our SAS Data Integration Studio jobs, job flows, tables, and so on. This information can now be used for auditing purposes as well as for helping us build our full metadata inventory as we prepare to migrate to SAS^® 9.4.

Read the paper (PDF).

Whether you manage computer systems in a small-to-medium environment (for example, in labs, workshops, or corporate training groups) or in a large-scale deployment, the ability to automate SAS^® 9.4 installations is important to the efficiency and success of your software deployments. For large-scale deployments, you can automate the installation process by using third-party provisioning software such as Microsoft System Center Configuration Manager (SCCM) or Symantec Altiris. But what if you have a small-to-medium environment and you do not have provisioning software to package deployment jobs? No worries! There is a solution. This paper presents a case study of just such a situation where a process was developed for SAS regional users groups (RUGs). Along with the case study, the paper offers a process for automating SAS 9.4 installations in workshop, lab, and corporate training (small-to-medium sized) environments. This process incorporates the new -srwonly option with the SAS^® Deployment Wizard, deployment-wizard commands that use response files, and batch-file implementation. This combination results in easy automation of an installation, even without provisioning software.

Read the paper (PDF).

Census data, such as education and income, has been extensively used for various purposes. The data is usually collected in percentages of census unit levels, based on the population sample. Such presentation of the data makes it hard to interpret and compare. A more convenient way of presenting the data is to use the geocoded percentage to produce counts for a pseudo-population. We developed a very flexible SAS^® macro to automatically generate the descriptive summary tables for the census data as well as to conduct statistical tests to compare the different levels of the variable by groups. The SAS macro is not only useful for census data but can be used to generate summary tables for any data with percentages in multiple categories.

Read the paper (PDF).

SAS^® Visual Analytics is deployed by many customers. IT departments are tasked with efficiently managing the server resources, achieving maximum usage of resources, optimizing availability, and managing costs. Business users expect the system to be available when needed and to perform to their expectations. Business executives who sponsor business intelligence (BI) and analytical projects like to see that their decision to support and finance the project meets business requirements. Business executives also like to know how different people in the organization are using SAS Visual Analytics. With the release of SAS Visual Analytics 7.1, new functionality is added to support the memory management of the SAS^® LASR™ Analytic Server. Also, new out-of-the-box usage and audit reporting is introduced. This paper covers BI-on-BI for SAS Visual Analytics. Also, all the new functionality introduced for SAS Visual Analytics administration and questions about the resource management, data compression, and out-of-the-box usage reporting of SAS Visual Analytics are also discussed. Key product capabilities are demonstrated.

Read the paper (PDF).

We have to pull data from several data files in creating our working databases. The simplest use of SAS^® hash objects greatly reduces the time required to draw data from many sources when compared to the use of multiple proc sorts and merges.

Read the paper (PDF). | Download the data file (ZIP).

Over the years, the SAS^® Business Intelligence platform has proved its importance in this big data world with its suite of applications that enable us to efficiently process, analyze, and transform huge amounts of business data. Within the data warehouse universe, 'batch execution' sits in the heart of SAS Data Integration technologies. On a day-to-day basis, batches run, and the current status of the batch is generally sent out to the team or to the client as a 'static' e-mail or as a report. From experience, we know that they don't provide much insight into the real 'bits and bytes' of a batch run. Imagine if the status of the running batch is automatically captured in one central repository and is presented on a beautiful web browser on your computer or on your iPad. All this can be achieved without asking anybody to send reports and with all 'post-batch' queries being answered automatically with a click. This paper aims to answer the same with a framework that is designed specifically to automate the reporting aspects of SAS batches and, yes, it is all about collecting statistics of the batch, and we call it - 'BatchStats.'

In observational data analyses, it is often helpful to use patients as their own controls by comparing their outcomes before and after some signal event, such as the initiation of a new therapy. It might be useful to have a control group that does not have the event but that is instead evaluated before and after some arbitrary point in time, such as their birthday. In this context, the change over time is a continuous outcome that can be modeled as a (possibly discontinuous) line, with the same or different slope before and after the event. Mixed models can be used to estimate random slopes and intercepts and compare patients between groups. A specific example published in a peer-reviewed journal is presented.

Read the paper (PDF).

Kaiser Permanente Northwest is contractually obligated for regulatory submissions to Oregon Health Authority, Health Share of Oregon, and Molina Healthcare in Washington. The submissions consist of Medicaid Encounter data for medical and pharmacy claims. SAS^® programs are used to extract claims data from Kaiser's claims data warehouse, process the data, and produce output files in HIPAA ASC X12 and NCPDP format. Prior to April 2014, programs were written in SAS^® 8.2 running on a VAX server. Several key drivers resulted in the conversion of the existing system to SAS^® Enterprise Guide^® 5.1 running on UNIX. These drivers were: the need to have a scalable system in preparation for the Affordable Care Act (ACA); performance issues with the existing system; incomplete process reporting and notification to business owners; and a highly manual, labor-intensive process of running individual programs. The upgraded system addressed these drivers. The estimated cost reduction was from $1.30 per reported encounter to $0.13 per encounter. The converted system provides for better preparedness for the ACA. One expected result of ACA is significant Medicaid membership growth. The program has already increased in size by 50% in the preceding 12 months. The updated system allows for the expected growth in membership.

Read the paper (PDF).

As the SAS^® platform becomes increasingly metadata-driven, it becomes increasingly important to get the structures and controls surrounding the metadata repository correct. This presentation aims to point out some of the considerations and potential pitfalls of working with the metadata infrastructure. It also suggests some solutions that have been used with the aim of making this process as simple as possible.

Read the paper (PDF).

The power of SAS^®9 applications allows information and knowledge creation from very large amounts of data. Analysis that used to consist of 10s-100s of gigabytes (GBs) of supporting data has rapidly grown into the 10s to 100s of terabytes (TBs). This data expansion has resulted in more and larger SAS data stores. Setting up file systems to support these large volumes of data with adequate performance, as well as ensuring adequate storage space for the SAS^® temporary files, can be very challenging. Technology advancements in storage and system virtualization, flash storage, and hybrid storage management require continual updating of best practices to configure I/O subsystems. This paper presents updated best practices for configuring the I/O subsystem for your SAS^®9 applications, ensuring adequate capacity, bandwidth, and performance for your SAS^®9 workloads. We have found that very few storage systems work ideally with SAS with their out-of-the-box settings, so it is important to convey these general guidelines.

Read the paper (PDF).

We regularly speak with organizations running established SAS^® 9.1.3 systems that have not yet upgraded to a later version of SAS^®. Often this is because their current SAS 9.1.3 environment is working fine, and no compelling event to upgrade has materialized. Now that SAS 9.1.3 has moved to a lower level of support and some very exciting technologies (Hadoop, cloud, ever-better scalability) are more accessible than ever using SAS^® 9.4, the case for migrating from SAS 9.1.3 is strong. Upgrading a large SAS ecosystem with multiple environments, an active development stream, and a busy production environment can seem daunting. This paper aims to demystify the process, suggesting outline migration approaches for a variety of the most common scenarios in SAS 9.1.3 to SAS 9.4 upgrades, and a scalable template project plan that has been proven at a range of organizations.

Read the paper (PDF).

You've worked for weeks or even months to produce an analysis suite for a project. Then, at the last moment, someone wants a subgroup analysis, and they inform you that they need it yesterday. This should be easy to do, right? So often, the programs that we write fall apart when we use them on subsets of the original data. This paper takes a look at some of the best practice techniques that can be built into a program at the beginning, so that users can subset on the fly without losing categories or creating errors in statistical tests. We review techniques for creating tables and corresponding titles with BY-group processing so that minimal code needs to be modified when more groups are created. And we provide a link to sample code and sample data that can be used to get started with this process.

Read the paper (PDF).

SAS^® provides a wealth of resources for creating useful, attractive metadata tables, including PROC CONTENTS listing output (to ODS destinations), the PROC CONTENTS OUT= SAS data set, and PROC CONTENTS ODS Output Objects. This paper and presentation explores some less well-known resources to create metadata such as %SYSFUNC, PROC DATASETS, and Dictionary Tables. All these options in conjunction with the use of the ExcelXP tagset (and, new in the second maintenance release for SAS^® 9.4, the Excel tagset) enable the creation of multi-tab metadata workbooks at the click of a mouse.

Read the paper (PDF).

Your data analysis projects can leverage the new HYPERGROUP action to mine relationships using Graph Theory. Discover which data entities are related and, conversely, sets of values are disjoint. In cases when the sets of values are not completely disjoint, HYPERGROUP can identify data that is strongly connected and identify neighboring data is weakly connected, or data that is a greater distance away. Each record is assigned a hypergroup number, and within hypergroups a color, community, or both. The GROUPBY facility, WHERE clauses, or both, can act on hypergroup number, color, or community, to conduct analytics using data that is close , related or more relevant . The algorithms used to determine hypergroups are based on Graph Theory. We show how the results of Hypergroup allow equivalent graphs to be displayed, useful information to be seen, and aid you in controlling what data is required to perform your analytics. Crucial data structure can be unearthed and seen.

Read the paper (PDF).

To bring order to the wild world of big data, EMC and its partners have joined forces to meet customer challenges and deliver a modern analytic architecture. This unified approach encompasses big data management, analytics discovery and deployment via end-to-end solutions that solve your big data problems. They are also designed to free up more time for innovation, deliver faster deployments, and help you find new insights from secure and properly managed data. The EMC Business Data Lake is a fully-engineered, enterprise-grade data lake built on a foundation of core data technologies. It provides pre-configured building blocks that enable self-service, end-to-end integration, management and provisioning of the entire big data environment. Major benefits include the ability to make more timely and informed business decisions and realize the vision of analytics in weeks instead of months.SAS enhances the Federation Business Data Lake by providing superior breadth and depth of analytics to tackle any big data analytics problem an organization might have, whether it's fraud detection, risk management, customer intelligence, predictive assets maintenance and others. SAS and EMC work together to deliver a robust and comprehensive big data solution with reduced risk, automated provisioning and configuration and is purpose-built for big data analytics workloads.

Learn how a new product from SAS enables you to easily build and compare multiple candidate models for all your business segments.

Dashboards are an effective tool for analyzing and summarizing the large volumes of data required to manage loan portfolios. Effective dashboards must highlight the most critical drivers of risk and performance within the portfolios and must be easy to use and implement. Developing dashboards often require integrating data, analysis, or tools from different software platforms into a single, easy-to-use environment. FI Consulting has developed a Credit Modeling Dashboard in Microsoft Access that integrates complex models based on SAS into an easy-to-use, point-and-click interface. The dashboard integrates, prepares, and executes back-end models based on SAS using command-line programming in Microsoft Access with Visual Basic for Applications (VBA). The Credit Modeling Dashboard developed by FI Consulting represents a simple and effective way to supply critical business intelligence in an integrated, easy-to-use platform without requiring investment in new software or to rebuild existing SAS tools already in use.

Read the paper (PDF).

Creating DataFlux^® jobs that can be executed from job scheduling software can be challenging. This presentation guides participants through the creation of a template job that accepts an email distribution list, subject, email verbiage, and attachment file name as macro variables. It also demonstrates how to call this job from a command line.

Read the paper (PDF).

This presentation focuses on building a graph template in an easy-to-follow, step-by-step manner. The presentation begins with using Graph Template Language to re-create a simple series plot, and then moves on to include a secondary y-axis as well as multiple overlaid block plots to tell a more complex and complete story than would be possible using only the SGPLOT procedure.

Read the paper (PDF).

What do going to the South Pole at the beginning of the 20th century, winning the 1980 gold medal in Olympic hockey, and delivering a successful project have in common? The answer: Good teams succeed when groups of highly talented individuals often do not. So, what is your Everest and how do you gather the right group to successfully scale it? Success often hinges on not just building a team, but on assembling the right team. Join Scott Sanders, a business and IT veteran who has effectively built, managed, and been part of successful teams throughout his 27-year career. Hear some of his best practices for how to put together a good team and keep them focused, engaged, and motivated to deliver a project.

So you are still writing SAS^® DATA steps and SAS macros and running them through a command-line scheduler. When work comes in, there is only one person who knows that code, and they are out--what to do? This paper shows how SAS applies extract, transform, load (ETL) modernization techniques with SAS^® Data Integration Studio to gain resource efficiencies and to break down the ETL black box. We are going to share the fundamentals (metadata foldering and naming standards) that ensure success, along with steps to ease into the pool while iteratively gaining benefits. Benefits include self-documenting code visualization, impact analysis on jobs and tables impacted by change, and being supportable by interchangeable bench resources. We conclude with demonstrating how SAS^® Visual Analytics is being used to monitor service-level agreements and provide actionable insights into job-flow performance and scheduling.

Read the paper (PDF).

The DATA step enables you to read, write, and manipulate many types of data. As data evolves to a more free-form state, the ability of SAS^® to handle character data becomes increasingly important. This presentation, expanded and enhanced from an earlier version, addresses character data from multiple vantage points. For example, what is the default length of a character string, and why does it appear to change under different circumstances? Special emphasis is given to the myriad functions that can facilitate the processing and manipulation of character data. This paper is targeted at a beginning to intermediate audience.

Read the paper (PDF).

With all the talk of 'big data' and 'visual analytics' we sometimes forget how important it is, and often how hard it is, to get external data into SAS^®. In this paper, we review some common data sources such as delimited sources (for example, CSV), as well as structured flat files, and the programming steps needed to successfully load these files into SAS. In addition to examining the INFILE and INPUT statements, we look at some methods for dealing with bad data. This paper assumes only basic SAS skills, although the topic can be of interest to anyone who needs to read external files.

Read the paper (PDF).

The success of an experimental study almost always hinges on how you design it. Does it provide estimates for everything you're interested in? Does it take all the experimental constraints into account? Does it make efficient use of limited resources? The OPTEX procedure in SAS/QC^® software enables you to focus on specifying your interests and constraints, and it takes responsibility for handling them efficiently. With PROC OPTEX, you skip the step of rifling through tables of standard designs to try to find the one that's right for you. You concentrate on the science and the analytics and let SAS^® do the computing. This paper reviews the features of PROC OPTEX and shows them in action using examples from field trials and food science experimentation. PROC OPTEX is a useful tool for all these situations, doing the designing and freeing the scientist to think about the food and the biology.

Read the paper (PDF). | Download the data file (ZIP).

This session is an introduction to predictive analytics and causal analytics in the context of improving outcomes. The session covers the following topics: 1) Basic predictive analytics vs. causal analytics; 2) The causal analytics framework; 3) Testing whether the outcomes improve because of an intervention; 4) Targeting the cases that have the best improvement in outcomes because of an intervention; and 5) Tweaking an intervention in a way that improves outcomes further.

Read the paper (PDF).

Data analysis begins with cleaning up data, calculating descriptive statistics, and examining variable distributions. Before more rigorous statistical analysis begins, many statisticians perform basic inferential statistical tests such as chi-square and t tests to assess unadjusted associations. These tests help guide the direction of the more rigorous statistical analysis. How to perform chi-square and t tests is presented. We explain how to interpret the output and where to look for the association or difference based on the hypothesis being tested. We propose the next steps for further analysis using example data.

Read the paper (PDF).

The impact of price on brand sales is not always linear or independent of other brand prices. We demonstrate, using sales information and SAS^® Enterprise Miner, how to uncover relative price bands where prices might be increased without losing market share or decreased slightly to gain share.

Read the paper (PDF).

In Bayesian statistics, Markov chain Monte Carlo (MCMC) algorithms are an essential tool for sampling from probability distributions. PROC MCMC is useful for these algorithms. However, it is often desirable to code an algorithm from scratch. This is especially present in academia where students are expected to be able to understand and code an MCMC. The ability of SAS^® to accomplish this is relatively unknown yet quite straightforward. We use SAS/IML^® to demonstrate methods for coding an MCMC algorithm with examples of a Gibbs sampler and Metropolis-Hastings random walk.

Read the paper (PDF).

There is a plethora of uses of the colon (:) in SAS^® programming. The colon is used as a data or variable name wild-card, a macro variable creator, an operator modifier, and so forth. The colon helps you write clear, concise, and compact code. The main objective of this paper is to encourage the effective use of the colon in writing crisp code. This paper presents real-time applications of the colon in day-to-day programming. In addition, this paper presents cases where the colon limits programmers' wishes.

Read the paper (PDF).

Competing risk arise in time to event data when the event of interest cannot be observed because of a preceding event i.e. a competing event occurring before. An example can be of an event of interest being a specific cause of death where death from any other cause can be termed as a competing event, if focusing on relapse, death before relapse would constitute a competing event. It is well studied and pointed out that in presence of competing risks, the standard product limit methods yield biased results due to violation of their basic assumption. The effect of competing events on parameter estimation depends on their distribution and frequency. Fine and Gray's sub-distribution hazard model can be used in presence of competing events which is available in PROC PHREG with the release of version 9.4 of SAS^® software.

Read the paper (PDF).

Survey research can provide a straightforward and effective means of collecting input on a range of topics. Survey researchers often like to group similar survey items into construct domains in order to make generalizations about a particular area of interest. Confirmatory Factor Analysis is used to test whether this pre-existing theoretical model underlies a particular set of responses to survey questions. Based on Structural Equation Modeling (SEM), Confirmatory Factor Analysis provides the survey researcher with a means to evaluate how well the actual survey response data fits within the a priori model specified by subject matter experts. PROC CALIS now provides survey researchers the ability to perform Confirmatory Factor Analysis using SAS^®. This paper provides a survey researcher with the steps needed to complete Confirmatory Factor Analysis using SAS. We discuss and demonstrate the options available to survey researchers in the handling of missing and not applicable survey responses using an ARRAY statement within a DATA step and imputation of item non-response. A simple demonstration of PROC CALIS is then provided with interpretation of key portions of the SAS output. Using recommendations provided by SAS from the PROC CALIS output, the analysis is then modified to provide a better fit of survey items into survey domains.

Read the paper (PDF).

In previous papers I have described how many standard SAS/GRAPH^® plots can be converted easily to ODS Graphics by using simple PROC SGPLOT or SGPANEL code. SAS/GRAPH Annotate code would appear, at first sight, to be much more difficult to convert to ODS Graphics, but by using its layering features, many Annotate plots can be replicated in a more flexible and repeatable way. This paper explains how to convert many of your Annotate plots, so they can be reproduced using Base SAS^®.

Read the paper (PDF).

Many organizations need to forecast large numbers of time series that are discretely valued. These series, called count series, fall approximately between continuously valued time series, for which there are many forecasting techniques (ARIMA, UCM, ESM, and others), and intermittent time series, for which there are a few forecasting techniques (Croston's method and others). This paper proposes a technique for large-scale automatic count series forecasting and uses SAS^® Forecast Server and SAS/ETS^® software to demonstrate this technique.

Read the paper (PDF).

This presentation explains how to use Base SAS^®9 software to create multi-sheet Excel workbooks. You learn step-by-step techniques for quickly and easily creating attractive multi-sheet Excel workbooks that contain your SAS^® output using the ExcelXP Output Delivery System (ODS) tagset. The techniques can be used regardless of the platform on which SAS software is installed. You can even use them on a mainframe! Creating and delivering your workbooks on-demand and in real time using SAS server technology is discussed. Although the title is similar to previous presentations by this author, this presentation contains new and revised material not previously presented.

Read the paper (PDF). | Download the data file (ZIP).

With the expansive new features in SAS^® Visual Analytics 7.1, you can now take control of the graph data while viewing a report. Using parameterized expressions, calculated items, custom categories, and prompt controls, you can now change the measures or categories on a graph from a mobile device or web viewer. View your data from different perspectives while using the same graph. This paper demonstrates how you can use these features in SAS^® Visual Analytics Designer to create reports in which graph roles can be dynamically changed with the click of a button.

Read the paper (PDF).

Hysteresis loops occur because of a time lag between input and output. In the pharmaceutical industry, hysteresis plots are tools to visualize the time lag between drug concentration and drug effects. Before SAS^® 9.2, SAS annotations were used to generate such plots. One of the criticisms was that SAS programmers had to write complex macros to automate the process; code management and validation tasks were not easy. With SAS 9.2, SAS programmers are able to generate such plots with ease. This paper demonstrates the generation of such plots with both Base SAS and SAS/GRAPH^® software.

Read the paper (PDF).

Because of the variety of card holders' behavior patterns and income sources, each consumer account can change to different states. Each consumer account can change to states such as non-active, transactor, revolver, delinquent, and defaulted, and each account requires an individual model for generated income prediction. The estimation of the transition probability between statuses at the account level helps to avoid the lack of memory in the MDP approach. The key question is which approach gives more accurate results: multinomial logistic regression or multistage decision tree with binary logistic regressions. This paper investigates the approaches to credit cards' profitability estimation at the account level based on multistates conditional probability by using the SAS/STAT procedure PROC LOGISTIC. Both models show moderate, but not strong, predictive power. Prediction accuracy for decision tree is dependent on the order of stages for conditional binary logistic regression. Current development is concentrated on discrete choice models as nested logit with PROC MDC.

Read the paper (PDF).

In today's competitive world, acquiring new customers is crucial for businesses but what if most of the acquired customers turn out to be defaulters? This decision would backfire on the business and might lead to losses. The extant statistical methods have enabled businesses to identify good risk customers rather than intuitively judging them. The objective of this paper is to build a credit risk scorecard using the Credit Risk Node inside SAS^® Enterprise Miner™ 12.3, which can be used by a manager to make an instant decision on whether to accept or reject a customer's credit application. The data set used for credit scoring was extracted from UCI Machine Learning repository and consisted of 15 variables that capture details such as status of customer's existing checking account, purpose of the credit, credit amount, employment status, and property. To ensure generalization of the model, the data set has been partitioned using the data partition node in two groups of 70:30 as training and validation respectively. The target is a binary variable, which categorizes customers into good risk and bad risk group. After identifying the key variables required to generate the credit scorecard, a particular score was assigned to each of its sub groups. The final model generating the scorecard has a prediction accuracy of about 75%. A cumulative cut-off score of 120 was generated by SAS to make the demarcation between good and bad risk customers. Even in case of future variations in the data, model refinement is easy as the whole process is already defined and does not need to be rebuilt from scratch.

Read the paper (PDF).

Statistical analysis that uses data from clinical or epidemiological studies include continuous variables such as patient's age, blood pressure, and various biomarkers. Over the years, there has been an increase in studies that focus on assessing associations between biomarkers and disease of interest. Many of the biomarkers are measured as continuous variables. Investigators seek to identify the possible cutpoint to classify patients as high risk versus low risk based on the value of the biomarker. Several data-oriented techniques such as median and upper quartile, and outcome-oriented techniques based on score, Wald, and likelihood ratio tests are commonly used in the literature. Contal and O'Quigley (1999) presented a technique that used log rank test statistic in order to estimate the cutpoint. Their method was computationally intensive and hence was overlooked due to the unavailability of built-in options in standard statistical software. In 2003, we provided the %FINDCUT macro that used Contal and O'Quigley's approach to identify a cutpoint when the outcome of interest was measured as time to event. Over the past decade, demand for this macro has continued to grow, which has led us to consider updating the %FINDCUT macro to incorporate new tools and procedures from SAS^® such as array processing, Graph Template Language, and the REPORT procedure. New and updated features include: results presented in a much cleaner report format, user-specified cutpoints, macro parameter error checking, temporary data set cleanup, preserving current option settings, and increased processing speed. We present the utility and added options of the revised %FINDCUT macro using a real-life data set. In addition, we critically compare this method to some of the existing methods and discuss the use and misuse of categorizing a continuous covariate.

Read the paper (PDF).

The Centers for Disease Control and Prevention (CDC) went through a large migration from a mainframe to a Windows platform. This e-poster will highlight the Data Automated Transfer Utility (DATU) that was developed to migrate historic files between the two file systems using SAS^® macros and SAS/CONNECT^®. We will demonstrate how this program identifies the type of file, transfers the file appropriately, verifies the successful transfer, and provides the details in a Microsoft Excel report. SAS/CONNECT code, special system options, and mainframe code will be shown. In 2009, the CDC made the decision to retire a mainframe that was used for years of primarily SAS work. The replacement platform is a SAS grid system, based on Windows, which is referred to as the Consolidated Statistical Platform (CSP). The change from mainframe to Windows required the migration of over a hundred thousand files totaling approximately 20 terabytes. To minimize countless man hours and human error, an automated solution was developed. DATU was developed for users to migrate their files from the mainframe to the new Windows CSP or other Windows destinations. Approximately 95% of the files on the CDC mainframe were one of three file types: SAS data sets, sequential text files, and partitioned data sets (PDS) libraries. DATU dynamically determines the file type and uses the appropriate method to transfer the file to the assigned Windows destination. Variations of files are detected and handled appropriately. File variations include multiple SAS versions of SAS data sets and sequential files that contain binary values such as packed decimal fields. To mitigate the loss of numeric precision during the migration, SAS numeric variables are identified and promoted to account for architectural differences between mainframe and Windows platforms. To aid users in verifying the accuracy of the file transfer, the program compares file information of the source and destination files. When a SAS file is d ownloaded, PROC CONTENTS is run on both files, and the PROC CONTENTS output is compared. For sequential text files, a checksum is generated for both files and the checksum file is compared. A PDS file transfer creates a list of the members in the PDS and destination Windows folder, and the file lists are compared. The development of this program and the file migration was a daunting task. This paper will share some of our lessons learned along the way and the method of our implementation.

Read the paper (PDF).

The DATA Step has served SAS^® programmers well over the years, and although it is handy, the new, exciting, and powerful DS2 is a significant alternative to the DATA Step by introducing an object-oriented programming environment. It enables users to effectively manipulate complex data and efficiently manage the programming through additional data types, programming structure elements, user-defined methods, and shareable packages, as well as threaded execution. This tutorial is developed based on our experiences with getting started with DS2 and learning to use it to access, manage, and share data in a scalable and standards-based way. It facilitates SAS users of all levels to easily get started with DS2 and understand its basic functionality by practicing the features of DS2.

Read the paper (PDF).

Texas is one of about 30 states that has recently passed laws requiring voters to produce valid IDs in an effort to avoid illegal voters. This new regulation, however, might negatively affect voting opportunities for students, low-income people, and minorities. To determine the actual effects of the regulation in Dallas County, voters were surveyed when exiting the polling offices during the November midterm election about difficulties that they might have encountered in the voting process. The database of the voting history of each registered voter in the county was examined, and the data set was converted into an analyzable structure prior to stratification. All of the polling offices were stratified by the residents' degrees of involvement in the past three general elections, namely, the proportion of people who have used early election and who have at least voted once. A two-phase sampling design was adopted for stratification. On election day, pollsters were sent to select polling offices and interviewed 20 voters at a selected time period. The number of people having difficulties was estimated when data was collected.

Read the paper (PDF).

Soon after the advent of the SAS^® hash object in SAS^® 9.0, its early adopters realized that the potential functionality of the new structure is much broader than basic 0(1)-time lookup and file matching. Specifically, they went on to invent methods of data aggregation based on the ability of the hash object to quickly store and update key summary information. They also demonstrated that the DATA step aggregation using the hash object offered significantly lower run time and memory utilization compared to the SUMMARY/MEANS or SQL procedures, coupled with the possibility of eliminating the need to write the aggregation results to interim data files and the programming flexibility that allowed them to combine sophisticated data manipulation and adjustments of the aggregates within a single step. Such developments within the SAS user community did not go unnoticed by SAS R&D, and for SAS^® 9.2 the hash object had been enriched with tag parameters and methods specifically designed to handle aggregation without the need to write the summarized data to the PDV host variable and update the hash table with new key summaries, thus further improving run-time performance. As more SAS programmers applied these methods in their real-world practice, they developed aggregation techniques fit to various programmatic scenarios and ideas for handling the hash object memory limitations in situations calling for truly enormous hash tables. This paper presents a review of the DATA step aggregation methods and techniques using the hash object. The presentation is intended for all situations in which the final SAS code is either a straight Base SAS DATA step or a DATA step generated by any other SAS product.

Read the paper (PDF).

Data sharing through healthcare collaboratives and national registries creates opportunities for secondary data analysis projects. These initiatives provide data for quality comparisons as well as endless research opportunities to external researchers across the country. The possibilities are bountiful when you join data from diverse organizations and look for common themes related to illnesses and patient outcomes. With these great opportunities comes great pain for data analysts and health services researchers tasked with compiling these data sets according to specifications. Patient care data is complex, and, particularly at large healthcare systems, might be managed with multiple electronic health record (EHR) systems. Matching data from separate EHR systems while simultaneously ensuring the integrity of the details of that care visit is challenging. This paper demonstrates how data management personnel can use traditional SAS PROCs in new and inventive ways to compile, clean, and complete data sets for submission to healthcare collaboratives and other data sharing initiatives. Traditional data matching methods such as SPEDIS are uniquely combined with iterative SQL joins using the SAS^® functions INDEX, COMPRESS, CATX, and SUBSTR to yield the most out of complex patient and physician name matches. Recoding, correcting missing items, and formatting data can be efficiently achieved by using traditional functions such as MAX, PROC FORMAT, and FIND in new and inventive ways.

Read the paper (PDF).

Many users would like to check the quality of data after the data integration process has loaded the data into a data set or table. The approach in this paper shows users how to develop a process that scores columns based on rules judged against a set of standards set by the user. Each rule has a standard that determines whether it passes, fails, or needs review (a green, red, or yellow score). A rule can be as simple as: Is the value for this column missing, or is this column within a valid range? Further, it includes comparing a column to one or more other columns, or checking for specific invalid entries. It also includes rules that compare a column value to a lookup table to determine whether the value is in the lookup table. Users can create their own rules and each column can have any number of rules. For example, a rule can be created to measure a dollar column to a range of acceptable values. The user can determine that it is expected that up to two percent of the values are allowed to be out of range. If two to five percent of the values are out of range, then data should be reviewed. And, if over five percent of the values are out of range, the data is not acceptable. The entire table has a color-coded scorecard showing each rule and its score. Summary reports show columns by score and distributions of key columns. The scorecard enables the user to quickly assess whether the SAS data set is acceptable, or whether specific columns need to be reviewed. Drill-down reports enable the user to drill into the data to examine why the column scored as it did. Based on the scores, the data set can be accepted or rejected, and the user will know where and why the data set failed. The process can store each scorecard data in a data mart. This data mart enables the user to review the quality of their data over time. It can answer questions such as: is the quality of the data improving overall? Are there specific columns that are improving or declining over time? What can we do to improve the qu ality of our data? This scorecard is not intended to replace the quality control of the data integration or ETL process. It is a supplement to the ETL process. The programs are written using only Base SAS^® and Output Delivery System (ODS), macro variables, and formats. This presentation shows how to: (1) use ODS HTML; (2) color code cells with the use of formats; (3) use formats as lookup tables; (4) use INCLUDE statements to make use of template code snippets to simplify programming; and (5) use hyperlinks to launch stored processes from the scorecard.

Read the paper (PDF). | Download the data file (ZIP).

A common problem when developing classifications models is the imbalance of classes in the classification variable. This imbalance means that a class is represented by a large number of cases while the other class is represented by very few. When this happens, the predictive power of the developed model could be biased. This is the case because classification methods tend to favor the majority class. And the classification methods are designed to minimize the error on the total data set regardless of the proportions or balance of the classes. Due to this problem, there are several techniques used to balance the distribution of the classification variable. One method is to reduce the size of the majority class (under-sampling), another is to increase the number of cases in the minority class (over-sampling); or a third method is to combine these two methods. There is also a more complex technique called SMOTE (Synthetic Minority Over-sampling Technique) that consists of intelligently generating new synthetic registers of the minority class using a closest-neighbors approach. In this paper, we present the development in SAS^® of a combination of SMOTE and under-sampling techniques as applied to a churn model. Then, we compare the predictive power of the model using this proposed balancing technique against other models developed with different data sampling techniques.

Read the paper (PDF).

Graduate students often need to explore data and summarize multiple statistical models into tables for a dissertation. The challenges of data summarization include coding multiple, similar statistical models, and summarizing these models into meaningful tables for review. The default method is to type (or copy and paste) results into tables. This often takes longer than creating and running the analyses. Students might spend hours creating tables, only to have to start over when a change or correction in the underlying data requires the analyses to be updated. This paper gives graduate students the tools to efficiently summarize the results of statistical models in tables. These tools include a macro-based SAS/STAT^® analysis and ODS OUTPUT statement to summarize statistics into meaningful tables. Specifically, we summarize PROC GLM and PROC LOGISTIC output. We convert an analysis of hospital-acquired delirium from hundreds of pages of output into three formatted Microsoft Excel files. This paper is appropriate for users familiar with basic macro language.

Read the paper (PDF).

Deep Learning is one of the most exciting research areas in machine learning today. While Deep Learning algorithms are typically very sophisticated, you may be surprised how much you can understand about the field with just a basic knowledge of neural networks. Come learn the fundamentals of this exciting new area and see some of SAS' newest technologies for neural networks.

As SAS^® programmers and statisticians, we rarely write programs that are run only once and then set aside. Instead, we are often asked to develop programs very early in a project, on immature data, following specifications that may be little more than a guess as to what the data is supposed to look like. These programs will then be run repeatedly on periodically updated data through the duration of the project. This paper offers strategies for not only making those programs more flexible, so they can handle some of the more commonly encountered variations in that data, but also for setting traps to identify unexpected data points that require further investigation. We will also touch upon some good programming practices that can benefit both the original programmer and others who might have to touch the code. In this paper, we will provide explicit examples of defensive coding that will aid in kicking the tires, pumping the breaks, checking your blind spots, and merging ahead for quality programming from the beginning.

Read the paper (PDF).

Using geocoded addresses from FDIC Summary of Deposits data with Census geospatial data including TIGER boundary files and population-weighted centroid shapefiles, we were able to calculate a reasonable distance threshold by metropolitan statistical area (MSA) (or metropolitan division, where applicable (MD)) through a series of SAS^® DATA steps and SQL joins. We first used the Cartesian join with PROC SQL on the data set containing population-weighted centroid coordinates. (The data set contained geocoded coordinates of approximately 91,000 full-service bank branches.) Using the GEODIST function in SAS, we were able to calculate the distance to the nearest bank branch from the population-weighted centroid of each Census tract. The tract data set was then grouped by MSA/MD and sorted in ascending order within each grouping (using the RETAIN function) by distance to the nearest bank branch. We calculated the cumulative population and cumulative population percent for each MSA/MD. The reasonable threshold distance is established where cumulative population percent is closest (in either direction +/-) to 90%.

Read the paper (PDF).

Intervals have been a feature of Base SAS^® for a long time, enabling SAS users to work with commonly (and not-so-commonly) defined periods of time such as years, months, and quarters. With the release of SAS^®9, there are more options and capabilities for intervals and their functions. This paper first discusses the basics of intervals in detail, and then discusses several of the enhancements to the interval feature, such as the ability to select how the INTCK function defines interval boundaries and the ability to create your own custom intervals beyond multipliers and shift operators.

Read the paper (PDF).

This paper puts forward an approach on how to create duplicate records from one Adverse Event (AE) datum based on study treatments. In order to fulfill this task, one flag was created to check if we need to produce duplicate records depending on an existing AE in different treatment periods. If yes, then we create these duplicate records and derive AE dates for these duplicate records based on treatment periods or discontinued dates.

Read the paper (PDF).

Learn how leading retailers are developing key findings in digital data to be leveraged across marketing, merchandising, and IT.

There is a widely forecast skills gap developing between the numbers of Big Data Analytics (BDA) graduates and the predicted jobs market. Many universities are developing innovative programs to increase the numbers of BDA graduates and postgraduates. The University of Derby has recently developed two new programs that aim to be unique and offer the applicants highly attractive and career-enhancing programs of study. One program is an undergraduate Joint Honours program that pairs analytics with a range of alternative subject areas; the other is a Master's program that has specific emphasis on governance and ethics. A critical aspect of both programs is the synthesis of a Personal Development Planning Framework that enables the students to evaluate their current status, identifies the steps needed to develop toward their career goals, and that provides a means of recording their achievements with evidence that can then be used in job applications. In the UK, we have two sources of skills frameworks that can be synthesized to provide a self-assessment matrix for the students to use as their Personal Development Planning (PDP) toolkit. These are the Skills Framework for the Information Age (SFIA-Plus) framework developed by the SFIA Foundation, and the Student Employability Profiles developed by the Higher Education Academy. A new set of National Occupational Skills (NOS) frameworks (Data Science, Data Management, and Data Analysis) have recently been released by the organization e-Skills UK for consultation. SAS^® UK has had significant input to this new set of NOSs. This paper demonstrates how curricula have been developed to meet the Big Data Analytics skills shortfall by using these frameworks and how these frameworks can be used to guide students in their reflective development of their career plans.

Read the paper (PDF).

Analyzing the key success factors for hit songs in the Billboard music charts is an ongoing area of interest to the music industry. Although there have been many studies over the past decades on predicting whether a song has the potential to become a hit song, the following research question remains, Can hit songs be predicted? And, if the answer is yes, what are the characteristics of those hit songs? This study applies data mining techniques using SAS^® Enterprise Miner™ to understand why some music is more popular than other music. In particular, certain songs are considered one-hit wonders, which are in the Billboard music charts only once. Meanwhile, other songs are acknowledged as masterpieces. With 2,139 data records, the results demonstrate the practical validity of our approach.

Read the paper (PDF).

We have many options for performing merges or joins these days, each with various advantages and disadvantages. Depending on how you perform your joins, different checks can help you verify whether the join was successful. In this presentation, we look at some sample data, use different methods, and see what kinds of tests can be done to ensure that the results are correct. If the join is performed with PROC SQL and two criteria are fulfilled (the number of observations in the primary data set has not changed [presuming a one-to-one or many-to-one situation], and a variable that should be populated is not missing), then the merge was successful.

Read the paper (PDF).

Data scientists and analytic practitioners have become obsessed with quantifying the unknown. Through text mining third-person posthumous narratives in SAS^® Enterprise Miner™ 12.1, we measured tangible aspects of personalities based on the broadly accepted big-five characteristics: extraversion, agreeableness, conscientiousness, neuroticism, and openness. These measurable attributes are linked to common descriptive terms used throughout our data to establish statistical relationships. The data set contains over 1,000 obituaries from newspapers throughout the United States, with individuals who vary in age, gender, demographic, and socio-economic circumstances. In our study, we leveraged existing literature to build the ontology used in the analysis. This literature suggests that a third person's perspective gives insight into one's personality, solidifying the use of obituaries as a source for analysis. We statistically linked target topics such as career, education, religion, art, and family to the five characteristics. With these taxonomies, we developed multivariate models in order to assign scores to predict an individual's personality type. With a trained model, this study has implications for predicting an individual's personality, allowing for better decisions on human capital deployment. Even outside the traditional application of personality assessment for organizational behavior, the methods used to extract intangible characteristics from text enables us to identify valuable information across multiple industries and disciplines.

Read the paper (PDF).

Being able to split SAS^® processing over multiple SAS processers on a single machine or over multiple machines running SAS, as in the case of SAS^® Grid Manager, enables you to get more done in less time. This paper looks at the methods of using SAS/CONNECT^® to process SAS code in parallel, including the SAS statements, macros, and PROCs available to make this processing easier for the SAS programmer. SAS products that automatically generate parallel code are also highlighted.

Read the paper (PDF). | Download the data file (ZIP).

Your project ended a year ago, and now you need to explain what you did, or rerun some of your code. Can you remember the process? Can you describe what you did? Can you even find those programs? Visually presented here are examples of tools and techniques that follow best practices to help us, as programmers, manage the flow of information from source data to a final product.

Read the paper (PDF).

Today, employers and universities alike must choose the most talented individuals from a large pool. However, it is difficult to determine whether a student's A+ in English means that he or she can write as proficiently as another student who writes as a hobby. As a result, there are now dozens of ways to compare individuals to one or another spectrum. For example, the ACT and SAT enable universities to view a student's performance on a test given to all applicants in order to help determine whether they will be successful. High schools use students' GPAs in order to compare them to one another for academic opportunities. The WorkKeys Exam enables employers to rate prospective employees on their abilities to perform day-to-day business activities. Rarely do standardized tests and in-class performance get compared to each other. We used SAS^® to analyze the GPAs and WorkKeys Exam results of 285 seniors who attend the Phillip O Berry Academy. The purpose was to compare class standing to what a student can prove he knows in a standardized environment. Emphasis is on the use of PROC SQL in SAS^® 9.3 rather than DATA step processing.

Read the paper (PDF). | Download the data file (ZIP).

This paper explains best practices for using temporary files in SAS^® programs. These practices include using the TEMP access method, writing to the WORK directory, and ensuring that you leave no litter files behind. An additional special temporary file technique is described for mainframe users.

Read the paper (PDF).

Most manuscripts in medical journals contain summary tables that combine simple summaries and between-group comparisons. These tables typically combine estimates for categorical and continuous variables. The statistician generally summarizes the data using the FREQ procedure for categorical variables and compares percentages between groups using a chi-square or a Fisher's exact test. For continuous variables, the MEANS procedure is used to summarize data as either means and standard deviation or medians and quartiles. Then these statistics are generally compared between groups by using the GLM procedure or NPAR1WAY procedure, depending on whether one is interested in a parametric test or a non-parametric test. The outputs from these different procedures are then combined and presented in a concise format ready for publications. Currently there is no straightforward way in SAS^® to build these tables in a presentable format that can then be customized to individual tastes. In this paper, we focus on presenting summary statistics and results from comparing categorical variables between two or more independent groups. The macro takes the dataset, the number of treatment groups, and the type of test (either chi-square or Fisher's exact) as input and presents the results in a publication-ready table. This macro automates summarizing data to a certain extent and minimizes risky typographical errors when copying results or typing them into a table.

Read the paper (PDF).

It has always been a million-dollar question, What inhibits a donor to donate? Many successful universities have deep roots in annual giving. We know donor sentiment is a key factor in drawing attention to engage donors. This paper is a summary of findings about donor behaviors using textual analysis combined with the power of predictive modeling. In addition to identifying the characteristics of donors, the paper focuses on identifying the characteristics of a first-time donor. It distinguishes the features of the first-time donor from the general donor pattern. It leverages the variations in data to provide deeper insights into behavioral patterns. A data set containing 247,000 records was obtained from the XYZ University Foundation alumni database, Facebook, and Twitter. Solicitation content such as email subject lines sent to the prospect base was considered. Time-dependent data and time-independent data were categorized to make unbiased predictions about the first-time donor. The predictive models use input such as age, educational records, scholarships, events, student memberships, and solicitation methods. Models such as decision trees, Dmine regression, and neural networks were built to predict the prospects. SAS^® Sentiment Analysis Studio and SAS^® Enterprise Miner™ were used to analyze the sentiment.

Read the paper (PDF).

The purpose of this paper is to introduce a SAS^® macro named %DOUBLEGLM that enables users to model the mean and dispersion jointly using double generalized linear models described in Nelder (1991) and Lee (1998). The R functions FITJOINT and DGLM (R Development Core Team, 2011) were used to verify the suitability of the %DOUBLEGLM macro estimates. The results showed that estimates were closer than the R functions.

Read the paper (PDF). | Download the data file (ZIP).

During the cementing and pumps-off phase of oil drilling, drilling operations need to know, in real time, about any loss of hydrostatic or mechanical well integrity. This phase involves not only big data, but also high-velocity data. Today's state-of-the-art drilling rigs have tens of thousands of sensors. These sensors and their data output must be correlated and analyzed in real time. This paper shows you how to leverage SAS^® Asset Performance Analytics and SAS^® Enterprise Miner™ to build a model for drilling and well control anomalies, fingerprint key well control measures of the transienct fluid properties, and how to operationalize these analytics on the drilling assets with SAS^® event stream processing. We cover the implementation and results from the Deepwater Horizon case study, demonstrating how SAS analytics enables the rapid differentiation between safe and unsafe modes of operation.

Read the paper (PDF).

Programming SAS^® has just been made easier, now that SAS 9.4 has incorporated the Lua programming language into the heart of the SAS System. With its elegant syntax, modern design, and support for data structures, Lua offers you a fresh way to write SAS programs, getting you past many of the limitations of the SAS macro language. This paper shows how you can get started using Lua to drive SAS, via a quick introduction to Lua and a tour through some of the features of the Lua and SAS combination that make SAS programming easier. SAS macro programming is also compared with Lua, so that you can decide where you might benefit most by using either language.

Read the paper (PDF).

Dynamic, interactive visual displays known as dashboards are most effective when they show essential graphs, tables, statistics, and other information where data is the star. The first rule for creating an effective dashboard is to keep it simple. Striking a balance between content and style, a dashboard should be devoid of excessive clutter so as not to distract from and obscure the information displayed. The second rule of effective dashboard design involves displaying data that meets one or more business or organizational objectives. To accomplish this, the elements in a dashboard should convey a format easily understood by its intended audience. Attendees learn how to create dynamic, interactive user- and data-driven dashboards, graphical and table-driven dashboards, statistical dashboards, and drill-down dashboards with a purpose.

Read the paper (PDF).

With the latest release of SAS^® Business Rules Manager, decision-making using SAS^® Stored Processes is now easier with simplified deployment via a web service for integration with your applications and business processes. This paper shows you how a user can publish analytics and rules as SOAP-based web services, track its usage, and dynamically update these decisions using SAS Business Rules Manager. In addition, we demonstrate how to integrate with SAS^® Model Manager using SAS^® Workflow to demonstrate how your other SAS^® applications and solutions can also simplify real-time decision-making through business rules.

Read the paper (PDF).

Do you need to deliver business insight and analytics to support decision-making? Using SAS^® Enterprise Guide^®, you can access the full power of SAS^® for analytics, without needing to learn the details of SAS programming. This presentation focuses on the following uses of SAS Enterprise Guide: Exploring and understanding--getting a feel for your data and for its issues and anomalies Visualizing--looking at the relationships, trends, surprises Consolidating--starting to piece together the story Presenting--building the insight and analytics into a presentation using SAS Enterprise Guide

Read the paper (PDF).

Programmers can create keyboard macros to perform common editing tasks in SAS^® Enterprise Guide^®. This paper introduces how to record keystrokes, save a keyboard macro, edit the commands, and assign a shortcut key. Sample keyboard macros are included. Techniques to share keyboard macros are also covered.

Read the paper (PDF). | Download the data file (ZIP).

At NC State University, our motto is Think and Do. When it comes to educating students in the Poole College of Management, that means that we want them to not only learn to think critically but also to gain hands-on experience with the tools that will enable them to be successful in their careers. And, in the era of big data, we want to ensure that our students develop skills that will help them to think analytically in order to use data to drive business decisions. One method that lends itself well to thinking and doing is the case study approach. In this paper, we discuss the case study approach for teaching analytical skills and highlight the use of SAS^® software for providing practical, hands-on experience with manipulating and analyzing data. The approach is illustrated with examples from specific case studies that have been used for teaching introductory and intermediate courses in business analytics.

Read the paper (PDF).

Respondent Driven Sampling (RDS) is both a sampling method and a data analysis technique. As a sampling method, RDS is a chain referral technique with strategic recruitment quotas and specific data gathering requirements. Like other chain referral techniques (for example, snowball sampling), the chains and waves are the start point to conduct analysis. But building the chains and waves still would be a daunting task because it involves too many transpositions and merges. This paper provides an efficient method of using Base SAS^® to build up chains and waves.

Read the paper (PDF).

For the Research Data Centers (RDCs) of the United States Census Bureau, the demand for disk space substantially increases with each passing year. Efficiently using the SAS^® data view might successfully address the concern about disk space challenges within the RDCs. This paper discusses the usage and benefits of the SAS data view to save disk space and reduce the time and effort required to manage large data sets. The ability and efficiency of the SAS data view to process regular ASCII, compressed ASCII, and other commonly used file formats are analyzed and evaluated in detail. The authors discuss ways in which using SAS data views is more efficient than the traditional methods in processing and deploying the large census and survey data in the RDCs.

Read the paper (PDF).

The use of Bayesian methods has become increasingly popular in modern statistical analysis, with applications in numerous scientific fields. In recent releases, SAS^® software has provided a wealth of tools for Bayesian analysis, with convenient access through several popular procedures in addition to the MCMC procedure, which is designed for general Bayesian modeling. This paper introduces the principles of Bayesian inference and reviews the steps in a Bayesian analysis. It then uses examples from the GENMOD and PHREG procedures to describe the built-in Bayesian capabilities, which became available for all platforms in SAS/STAT^® 9.3. Discussion includes how to specify prior distributions, evaluate convergence diagnostics, and interpret the posterior summary statistics.

Read the paper (PDF).

As many leadership experts suggest, growth happens only when it is intentional. Growth is vital to the immediate and long-term success of our employees as well as our employers. As SAS^® programming leaders, we have a responsibility to encourage individual growth and to provide the opportunity for it. With an increased workload yet fewer resources, initial and ongoing training seem to be deemphasized as we are pressured to meet project timelines. The current workforce continues to evolve with time and technology. More important than simply providing the opportunity for training, individual trainees need the motivation for any training program to be successful. Although many existing principles for growth remain true, how such principles are applied needs to evolve with the current generation of SAS programmers. The primary goal of this poster is to identify the critical components that we feel are necessary for the development of an effective training program, one that meets the individual needs of the current workforce. Rather than proposing a single 12-step program that works for everyone, we think that identifying key components for enhancing the existing training infrastructure is a step in the right direction.

Read the paper (PDF).

Proper management of master data is a critical component of any enterprise information system. However, effective master data management (MDM) requires that both IT and Business understand the life cycle of master data and the fundamental principles of entity resolution (ER). This presentation provides a high-level overview of current practices in data matching, record linking, and entity information life cycle management that are foundational to building an effective strategy to improve data integration and MDM. Particular areas of focus are: 1) The need for ongoing ER analytics--the systematic and quantitative measurement of ER performance; 2) Investing in clerical review and asserted resolution for continuous improvement; and 3) Addressing the large-scale ER challenge through distributed processing.

Read the paper (PDF). | Watch the recording.

My SAS^® Global Forum 2013 paper 'Variable Reduction in SAS^® by Using Weight of Evidence (WOE) and Information Value (IV)' has become the most sought-after online article on variable reduction in SAS since its publication. But the methodology provided by the paper is limited to reduction of numeric variables for logistic regression only. Built on a similar process, the current paper adds several major enhancements: 1) The use of WOE and IV has been expanded to the analytics and modeling for continuous dependent variables. After the standardization of a continuous outcome, all records can be divided into two groups: positive performance (outcome y above sample average) and negative performance (outcome y below sample average). This treatment is rigorously consistent with the concept of entropy in Information Theory: the juxtaposition of two opposite forces in one equation, and a stronger contrast between the two suggests a higher intensity , that is, more information delivered by the variable in question. As the standardization keeps the outcome variable continuous and quantified, the revised formulas for WOE and IV can be used in the analytics and modeling for continuous outcomes such as sales volume, claim amount, and so on. 2) Categorical and ordinal variables can be assessed together with numeric ones. 3) Users of big data usually need to evaluate hundreds or thousands of variables, but it is not uncommon that over 90% of variables contain little useful information. We have added a SAS macro that trims these variables efficiently in a broad-brushed manner without a thorough examination. Afterward, we examine the retained variables more carefully on their behaviors to the target outcome. 4) We add Chi-Square analysis for categorical/ordinal variables and Gini coefficients for numeric variable in order to provide additional suggestions for segmentation and regression. With the above enhancements added, a SAS macro program is provided at the end of the paper as a complete suite for variable reduction/selection that efficiently evaluates all variables together. The paper provides a detailed explanation for how to use the SAS macro and how to read the SAS outputs that provide useful insights for subsequent linear regression, logistic regression, or scorecard development.

Read the paper (PDF).

Proving difference is the point of most statistical testing. In contrast, the point of equivalence and noninferiority tests is to prove that results are substantially the same, or at least not appreciably worse. An equivalence test can show that a new treatment, one that is less expensive or causes fewer side effects, can replace a standard treatment. A noninferiority test can show that a faster manufacturing process creates no more product defects or industrial waste than the standard process. This paper reviews familiar and new methods for planning and analyzing equivalence and noninferiority studies in the POWER, TTEST, and FREQ procedures in SAS/STAT^® software. Techniques that are discussed range from Schuirmann's classic method of two one-sided tests (TOST) for demonstrating similar normal or lognormal means in bioequivalence studies, to Farrington and Manning's noninferiority score test for showing that an incidence rate (such as a rate of mortality, side effects, or product defects) is no worse. Real-world examples from clinical trials, drug development, and industrial process design are included.

Read the paper (PDF).

Medicaid programs are the second largest line item in each state's budget. In 2012, they contributed $421.2 billion, or 15 percent of total national healthcare expenditures. With US health care reform at full speed, state Medicaid programs must establish new initiatives that will reduce the cost of healthcare, while providing coordinated, quality care to the nation's most vulnerable populations. This paper discusses how states can implement innovative reform through the use of data analytics. It explains how to establish a statewide health analytics framework that can create novel analyses of health data and improve the health of communities. With solutions such as SAS^® Claims Analytics, SAS^® Episode Analytics, and SAS^® Fraud Framework, state Medicaid programs can transform the way they make business and clinical decisions. Moreover, new payment structures and delivery models can be successfully supported through the use of healthcare analytics. A statewide health analytics framework can support initiatives such as bundled and episodic payments, utilization studies, accountable care organizations, and all-payer claims databases. Furthermore, integrating health data into a single analytics framework can provide the flexibility to support a unique analysis that each state can customize with multiple solutions and multiple sources of data. Establishing a health analytics framework can significantly improve the efficiency and effectiveness of state health programs and bend the healthcare cost curve.

Read the paper (PDF).

When faced with a difficult data reduction problem, a SAS^® programmer has many options for how to solve the problem. In this presentation, three different methods are reviewed and compared in terms of processing time, debugging, and ease of understanding. The three methods include linearizing the data, using SQL Cartesian joins, and using sequential data processing. Inconsistencies in the raw data caused the data linearization to be problematic. The large number of records and the need for many-to-many merges resulted in a long run time for the SQL code. The sequential data processing, although older technology, provided the most time efficient and error-free results.

Read the paper (PDF).

Design of experiments (DOE) is an essential component of laboratory, greenhouse, and field research in the natural sciences. It has also been an integral part of scientific inquiry in diverse social science fields such as education, psychology, marketing, pricing, and social works. The principle and practices of DOE are among the oldest and the most advanced tools within the realm of statistics. DOE classification schemes, however, are diverse and, at times, confusing. In this presentation, we provide a simple conceptual classification framework in which experimental methods are grouped into classical and statistical approaches. The classical approach is further divided into pre-, quasi-, and true-experiments. The statistical approach is divided into one, two, and more than two factor experiments. Within these broad categories, we review several contemporary and widely used designs and their applications. The optimal use of Base SAS^® and SAS/STAT^® to analyze, summarize, and report these diverse designs is demonstrated. The prospects and challenges of such diverse and critically important analytics tools on business insight extraction in marketing and pricing research are discussed.

Read the paper (PDF).

Long before analysts began mining large data sets, mathematicians sought truths hidden within the set of natural numbers. This exploration was formalized into the mathematical subfield known as number theory. Though this discipline has proven fruitful for many applied fields, number theory delights in numerical truth for its own sake. The austere and abstract beauty of number theory prompted nineteenth century mathematician Carl Friedrich Gauss to dub it The Queen of Mathematics.' This session and the related paper demonstrate that the analytical power of the SAS^® engine is well-suited for exploring concepts in number theory. In Base SAS^®, students and educators will find a powerful arsenal for exploring, illustrating, and visualizing the following: prime numbers, perfect numbers, the Euclidean algorithm, the prime number theorem, Euler's totient function, Chebyshev's theorem, the Chinese remainder theorem, Gauss circle problem, and more! The paper delivers custom SAS procedures and code segments that generate data sets relevant to the exploration of topics commonly found in elementary number theory texts. The efficiency of these algorithms is discussed and an emphasis is placed on structuring data sets to maximize flexibility and ease in exploring new conjectures and illustrating known theorems. Last, the power of SAS plotting is put to use in unexpected ways to visualize and convey number-theoretic facts. This session and the related paper are geared toward the academic user or SAS user who welcomes and revels in mathematical diversions.

Read the paper (PDF).

Geospatial analysis plays an important role for data visualization in Business Intelligence (BI). The use of geospatial data with business data on maps, provides a visual context to understanding the business patterns which are influence by location sensitive information. When analytics like correlation, forecasting, decision trees are integrated with the location based data, it creates new business insights that are not common in a traditional BI applications. The advanced analytics of SAS offered in SAS Visual Analytics and SAS Visual Statistics pushes the limits of a common mapping usage seen in BI applications. SAS is working on a new level of integration with ESRI which is a leader in Geospatial analytics. The new features through this integration will bring the best of both technologies and provide new insights to business analyst and BI customers. This session will host a demo of the new features that will be seen in the future release of SAS Visual Analytics.

Building and maintaining a data warehouse can require a complex series of jobs. Having an ETL flow that is reliable and well integrated is one big challenge. An ETL process might need some pre- and post-processing operations on the database to be well integrated and reliable. Some might handle this via maintenance windows. Others like us might generate custom transformations to be included in SAS^® Data Integration Studio jobs. Custom transformations in SAS Data Integration Studio can be used to speed ETL process flows and reduce the database administrator's intervention after ETL flows are complete. In this paper, we demonstrate the use of custom transformations in SAS Data Integration Studio jobs to handle database-specific tasks for improving process efficiency and reliability in ETL flows.

Read the paper (PDF).

From large holding companies with multiple subsidiaries to loosely affiliated state educational institutions, security domains are being federated to enable users from one domain to access applications in other domains and ultimately save money on software costs through sharing. Rather than rely on centralized security, applications must accept claims-based authentication from trusted authorities and support open standards such as Security Assertion Markup Language (SAML) instead of proprietary security protocols. This paper introduces SAML 2.0 and explains how the open source SAML implementation known as Shibboleth can be integrated with the SAS^® 9.4 security architecture to support SAML. It then describes in detail how to set up Microsoft Active Directory Federation Services (AD FS) as the SAML Identity Provider, how to set up the SAS middle tier as the relying party, and how to troubleshoot problems.

Read the paper (PDF).

As organizations strive to do more with fewer resources, many modernize their disparate PC operations to centralized server deployments. Administrators and users share many concerns about using SAS^® on a Microsoft Windows server. This paper outlines key guidelines, plus architecture and performance considerations, that are essential to making a successful transition from PC to server. This paper outlines the five key considerations for SAS customers who will change their configuration from PC-based SAS to using SAS on a Windows server: 1) Data and directory references; 2) Interactive and surrounding applications; 3) Usability; 4) Performance; 5) SAS Metadata Server.

Read the paper (PDF).

SAS^® Enterprise Guide^® is a great interface for businesses running SAS^® in a shared server environment. However, interacting with the shared server outside of SAS can require costly third-party software and knowledge of specific server programming languages. This can create a barrier between the SAS program and the server, which can be frustrating for even the best SAS programmers. This paper reviews the X and SYSTASK commands and creates a template of SAS code to pass commands from SAS to the server. By writing the server log to a text file, we demonstrate how to display critical server information in the code results. Using macros and the prompt functionality of SAS Enterprise Guide, we form stored procedures, allowing SAS users of all skill levels to interact with the server environment. These stored procedures can improve programming efficiency by providing a quick in-program solution to complete common server tasks such as copying folders or changing file permissions. They might also reduce the need for third-party programs to communicate with the server, which could potentially reduce software costs.

Read the paper (PDF).

Are you looking to track changes to your SAS^® programs? Do you wish you could easily find errors, warnings, and notes in your SAS logs? Looking for a convenient way to find point-and-click tasks? Want to search your SAS^® Enterprise Guide^® project? How about a point-and-click way to view SAS system options and SAS macro variables? Or perhaps you want to upload data to the SAS^® LASR™ Analytics Server, view SAS^® Visual Analytics reports, or run SAS^® Studio tasks, all from within SAS Enterprise Guide? You can find these capabilities and more in SAS Enterprise Guide. Knowing what tools are at your disposal and how to use them will put you a step ahead of the rest. Come learn about some of the newer features in SAS Enterprise Guide 7.1 and how you can leverage them in your work.

Read the paper (PDF).

The PROPCASE function is useful when you are cleansing a database of names and addresses in preparation for mailing. But it does not know the difference between a proper name (in which initial capitalization should be used) and an acronym (which should be all uppercase). This paper explains an algorithm that determines with reasonable accuracy whether a word is an acronym and, if it is, converts it to uppercase.

Read the paper (PDF). | Download the data file (ZIP).

As pollution and population continue to increase, new concepts of eco-friendly commuting evolve. One of the emerging concepts is the bicycle sharing system. It is a bike rental service on a short-term basis at a moderate price. It provides people the flexibility to rent a bike from one location and return it to another location. This business is quickly gaining popularity all over the globe. In May 2011, there were only 375 bike rental schemes consisting of nearly 236,000 bikes. However, this number jumped to 535 bike sharing programs with approximately 517,000 bikes in just a couple of years. It is expected that this trend will continue to grow at a similar pace in the future. Most of the businesses involved in this system of bike rental are faced with the challenge of balancing supply and inconsistent demand. The number of bikes needed on a particular day can vary on several factors such as season, time, temperature, wind speed, humidity, holiday and day of the week. In this paper, we have tried to solve this problem using SAS^® Forecast Studio. Incorporating the effects of all the above factors and analyzing the demand trends of the last two years, we have been able to precisely forecast the number of bikes needed on any day in the future. Also, we are able to do the scenario analysis to observe the effect of particular variables on the demand.

Read the paper (PDF).

A powerful tool for visually analyzing regression analysis is the forest plot. Model estimates, ratios, and rates with confidence limits are graphically stacked vertically in order to show how they overlap with each other and to show values of significance. The ability to see whether two values are significantly different from each other or whether a covariate has a significant meaning on its own is made much simpler in a forest plot rather than sifting through numbers in a report table. The amount of data preparation needed in order to build a high-quality forest plot in SAS^® can be tremendous because the programmer needs to run analyses, extract the estimates to be plotted, structure the estimates in a format conducive to generating a forest plot, and then run the correct plotting procedure or create a graph template using the Graph Template Language (GTL). While some SAS procedures can produce forest plots using Output Delivery System (ODS) Graphics automatically, the plots are not generally publication-ready and are difficult to customize even if the programmer is familiar with GTL. The macro %FORESTPLOT is designed to perform all of the steps of building a high-quality forest plot in order to save time for both experienced and inexperienced programmers, and is currently set up to perform regression analyses common to the clinical oncology research areas, Cox proportional hazards and logistic, as well as calculate Kaplan-Meier event-free rates. To improve flexibility, the user can specify a pre-built data set to transform into a forest plot if the automated analysis options of the macro do not fit the user's needs.

Read the paper (PDF).

The SAS^® Global Forum paper 'Best Practices for Configuring Your I/O Subsystem for SAS^®9 Applications' provides general guidelines for configuring I/O subsystems for your SAS^® applications. The paper reflects updated storage and virtualization technology. This companion paper ('Frequently Asked Questions Regarding Storage Configurations') is commensurately updated, including new storage technologies such as storage virtualization, storage tiers (including automated tier management), and flash storage. The subject matter is voluminous, so a frequently asked questions (FAQ) format is used. Our goal is to continually update this paper as additional field needs arise and technology dictates.

Read the paper (PDF).

During grad school, students learn SAS^® in class or on their own for a research project. Time is limited, so faculty have to focus on what they know are the fundamental skills that students need to successfully complete their coursework. However, real-world research projects are often multifaceted and require a variety of SAS skills. When students transition from grad school to a paying job, they might find that in order to be successful, they need more than the basic SAS skills that they learned in class. This paper highlights 10 insights that I've had over the past year during my transition from grad school to a paying SAS research job. I hope this paper will help other students make a successful transition. Top 10 insights: 1. You still get graded, but there is no syllabus. 2. There isn't time for perfection. 3. Learn to use your resources. 4. There is more than one solution to every problem. 5. Asking for help is not a weakness. 6. Working with a team is required. 7. There is more than one type of SAS^®. 8. The skills you learned in school are just the basics. 9. Data is complicated and often frustrating. 10. You will continue to learn both personally and professionally.

Read the paper (PDF).

Data quality is now more important than ever before. According to Gartner (2011), poor data quality is the primary reason why 40% of all business initiatives fail to achieve their targeted benefits. As a response, Deloitte Belgium has created an agile data quality framework using SAS^® Data Management to rapidly identify and resolve root causes of data quality issues to jump-start business initiatives, especially a data migration one. Moreover, the approach uses both standard SAS Data Management functionalities (such as standardization, parsing, etc.) and advanced features (such as using macros, dynamic profiling in deployment mode, extracting the profiling results, etc.), allowing the framework to be agile and flexible and to maximize the reusability of specific components built in SAS Data Management.

Read the paper (PDF).

Data access collisions occur when two or more processes attempt to gain concurrent access to a single data set. Collisions are a common obstacle to SAS^® practitioners in multi-user environments. As SAS instances expand to infrastructures and ultimately empires, the inherent increased complexities must be matched with commensurately higher code quality standards. Moreover, permanent data sets will attract increasingly more devoted users and automated processes clamoring for attention. As these dependencies increase, so too does the likelihood of access collisions that, if unchecked or unmitigated, lead to certain process failure. The SAS/SHARE^® module offers concurrent file access capabilities, but causes a (sometimes dramatic) reduction in processing speed, must be licensed and purchased separately from Base SAS^®, and is not a viable solution for many organizations. Previously proposed solutions in Base SAS use a busy-wait spinlock cycle to repeatedly attempt file access until process success or timeout. While effective, these solutions are inefficient because they generate only read-write locked data sets that unnecessarily prohibit access by subsequent read-only requests. This presentation introduces the %LOCKITDOWN macro that advances previous solutions by affording both read-write and read-only lock testing and deployment. Moreover, recognizing the responsibility for automated data processes to be reliable, robust, and fault tolerant, %LOCKITDOWN is demonstrated in the context of a macro-based exception handling paradigm.

Read the paper (PDF).

In many studies, a continuous response variable is repeatedly measured over time on one or more subjects. The subjects might be grouped into different categories, such as cases and controls. The study of resulting observation profiles as functions of time is called functional data analysis. This paper shows how you can use the SSM procedure in SAS/ETS^® software to model these functional data by using structural state space models (SSMs). A structural SSM decomposes a subject profile into latent components such as the group mean curve, the subject-specific deviation curve, and the covariate effects. The SSM procedure enables you to fit a rich class of structural SSMs, which permit latent components that have a wide variety of patterns. For example, the latent components can be different types of smoothing splines, including polynomial smoothing splines of any order and all L-splines up to order 2. The SSM procedure efficiently computes the restricted maximum likelihood (REML) estimates of the model parameters and the best linear unbiased predictors (BLUPs) of the latent components (and their derivatives). The paper presents several real-life examples that show how you can fit, diagnose, and select structural SSMs; test hypotheses about the latent components in the model; and interpolate and extrapolate these latent components.

Read the paper (PDF). | Download the data file (ZIP).

Quality measurement is increasingly important in the health-care sphere for both performance optimization and reimbursement. Treatment of chronic conditions is a key area of quality measurement. However, medication compendiums change frequently, and health-care providers often free text medications into a patient's record. Manually reviewing a complete medications database is time consuming. In order to build a robust medications list, we matched a pharmacist-generated list of categorized medications to a raw medications database that contained names, name-dose combinations, and misspellings. The matching procedure we used is called PROC COMPGED. We were able to combine a truncation function and an upcase function to optimize the output of PROC COMPGED. Using these combinations and manipulating the scoring metric of PROC COMPGED enabled us to narrow the database list to medications that were relevant to our categories. This process transformed a tedious task for PROC COMPARE or an Excel macro into a quick and efficient method of matching. The task of sorting through relevant matches was still conducted manually, but the time required to do so was significantly decreased by the fuzzy match in our application of PROC COMPGED.

Read the paper (PDF).

Effect sizes are strongly encouraged to be reported in addition to statistical significance and should be considered in evaluating the results of a study. The choice of an effect size for ANOVA models can be confusing because indices might differ depending on the research design as well as the magnitude of the effect. Olejnik and Algina (2003) proposed the generalized eta-squared and omega-squared effect sizes, which are comparable across a wide variety of research designs. This paper provides a SAS^® macro for computing the generalized omega-squared effect size associated with analysis of variance models by using data from PROC GLM ODS tables. The paper provides the macro programming language, as well as results from an executed example of the macro.

Read the paper (PDF).

Companies spend vast amounts of resources developing and enhancing proprietary software to clean their business data. Save time and obtain more accurate results by leveraging the SAS^® Quality Knowledge Base (QKB), formerly a DataFlux^® Data Quality technology. Tap into the existing QKB rules for cleansing contact information or product data, or easily design your own custom rules using the QKB editing tools. The QKB enables data management operations such as parsing, standardization, and fuzzy matching for contact information such as names, organizations, addresses, and phone numbers, or for product data attributes such as materials, colors, and dimensions. The QKB supports data in native character sets in over 38 locales. A single QKB can be shared by multiple SAS^® Data Management installations across your enterprise, ensuring consistent results on workstations, servers, and massive parallel processing systems such as Hadoop. In this breakout, a SAS R&D manager demonstrates the power and flexibility of the QKB, and answers your questions about how to deploy and customize the QKB for your environment.

Read the paper (PDF).

While there has been tremendous progress in technologies related to data storage, high-performance computing, and advanced analytic techniques, organizations have only recently begun to comprehend the importance of parallel strategies that help manage the cacophony of concerns around access, quality, provenance, data sharing, and use. While data governance is not new, the drumbeat around it, along with master data management and data quality, is approaching a crescendo. Intensified by the increase in consumption of information, expectations about ubiquitous access, and highly dynamic visualizations, these factors are also circumscribed by security and regulatory constraints. In this paper, we provide a summary of what data governance is and its importance. We go beyond the obvious and provide practical guidance on what it takes to build out a data governance capability appropriate to the scale, size, and purpose of the organization and its culture. Moreover, we discuss best practices in the form of requirements that highlight what we think is important to consider as you provide that tactical linkage between people, policies, and processes to the actual data lifecycle. To that end, our focus includes the organization and its culture, people, processes, policies, and technology. Further, we include discussions of organizational models as well as the role of the data steward, and provide guidance on how to formalize data governance into a sustainable set of practices within your organization.

Read the paper (PDF). | Watch the recording.

A SAS^® Grid Manager environment provides your organization with a powerful and flexible way to manage many forms of SAS^® computing workloads. For the business and IT user community, the benefits can range from data management jobs effectively utilizing the available processing resources, complex analyses being run in parallel, and reassurance that statutory reports are generated in a highly available environment. This workshop begins the process of familiarizing users with the core concepts of how to grid-enable tasks within SAS^® Studio, SAS^® Enterprise Guide^®, SAS^® Data Integration Studio, and SAS^® Enterprise Miner™ client applications.

This presentation provides a brief introduction to logistic regression analysis in SAS. Learn differences between Linear Regression and Logistic Regression, including ordinary least squares versus maximum likelihood estimation. Learn to: understand LOGISTIC procedure syntax, use continuous and categorical predictors, and interpret output from ODS Graphics.

For decades, mixed models been used by researchers to account for random sources of variation in regression-type models. Now they are gaining favor in business statistics to give better predictions for naturally occurring groups of data, such as sales reps, store locations, or regions. Learn about how predictions based on a mixed model differ from predictions in ordinary regression, and see examples of mixed models with business data.

Text data constitutes more than half of the unstructured data held in organizations. Buried within the narrative of customer inquiries, the pages of research reports, and the notes in servicing transactions are the details that describe concerns, ideas and opportunities. The historical manual effort needed to develop a training corpus is now no longer required, making it simpler to gain insight buried in unstructured text. With the ease of machine learning refined with the specificity of linguistic rules, SAS Contextual Analysis helps analysts identify and evaluate the meaning of the electronic written word. From a single, point-and-click GUI interface the process of developing text models is guided and visually intuitive. This presentation will walk through the text model development process with SAS Contextual Analysis. The results are in SAS format, ready for text-based insights to be used in any other SAS application.

SAS/ETS provides many tools to improve the productivity of the analyst who works with time series data. This tutorial will take an analyst through the process of turning transaction-level data into a time series. The session will then cover some basic forecasting techniques that use past fluctuations to predict future events. We will then extend this modeling technique to include explanatory factors in the prediction equation.

Learning the Graph Template Language (GTL) might seem like a daunting task. However, creating customized graphics with SAS^® is quite easy using many of the tools offered with Base SAS^® software. The point-and-click interface of ODS Graphics Designer provides us with a tool that can be used to generate highly polished graphics and to store the GTL-based code that creates them. This opens the door for users who would like to create canned graphics that can be used on various data sources, variables, and variable types. In this hands-on training, we explore the use of ODS Graphics Designer to create sophisticated graphics and to save the template code. We then discuss modifications using basic SAS macros in order to create stored graphics code that is flexible enough to accommodate a wide variety of situations.

Read the paper (PDF).

Do you have a SAS^® program that requires adding filenames to the input every time you run it? Aren't you tired of having to check for the files, check the names, and type them in? Check out how my SAS^® Enterprise Guide^® project checks for files, figures out the file names, and saves me from having to type in the file names for the input data files!

Read the paper (PDF). | Download the data file (ZIP).

When ODS Graphics was introduced a few years ago, it gave SAS users an array of new ways to generate graphs. One of those ways is with Statistical Graphics procedures. Now, with just a few lines of simple code, you can create a wide variety of high-quality graphs. This paper shows how to produce single-celled graphs using PROC SGPLOT and paneled graphs using PROC SGPANEL. This paper also shows how to send your graphs to different ODS destinations, how to apply ODS styles to your graphs, and how to specify properties of graphs, such as format, name, height, and width.

Read the paper (PDF).

Axis tables, polygon plot, text plot, and more features have been added to Statistical Graphics (SG) procedures and Graph Template Language (GTL) for SAS^® 9.4. These additions are a direct result of your feedback and are designed to make creating graphs easier. Axis tables let you add multiple tables of data to your graphs and to correctly align with the axis values with the right colors for group values in your data. Text plots can have rotated and aligned text anywhere in the graph. You can overlay jittered markers on box plots, use images and font glyphs as markers, specify group attributes without making style changes, and create entirely new custom graphs using the polygon plot. All this without using the annotation facility, which is now supported both for SG procedures and GTL. This paper guides you through these exciting new features now available in SG procedures and GTL.

Read the paper (PDF). | Download the data file (ZIP).

At the University of North Carolina at Chapel Hill, we had the pleasure of rolling out a strong enterprise-wide SAS^® Visual Analytics environment in 10 months, with strong support from SAS. We encountered many bumps in the road, moments of both mountain highs and worrisome lows, as we learned what we could and could not do, and new ways to accomplish our goals. Our journey started in December of 2013 when a decision was made to try SAS Visual Analytics for all reporting, and incorporate other solutions only if and when we hit an insurmountable obstacle. We are still strongly using SAS Visual Analytics and are augmenting the tools with additional products. Along the way, we learned a number of things about the SAS Visual Analytics environment that are gems, whether one is relatively new to SAS^® or an old hand. Measuring what is happening is paramount to knowing what constraints exist in the system before trying to enhance performance. Targeted improvements help if measurements can be made before and after each alteration. There are a few architectural alterations that can help in general, but we have seen that measuring is the guaranteed way to know what the problems are and whether the cures were effective.

Read the paper (PDF).

Predicting the profitability of future sales orders in a price-sensitive, highly competitive make-to-order market can create a competitive advantage for an organization. Order size and specifications vary from order to order and customer to customer, and might or might not be repeated. While it is the intent of the sales groups to take orders for a profit, because of the volatility of steel prices and the competitive nature of the markets, gross margins can range dramatically from one order to the next and in some cases can be negative. Understanding the key factors affecting the gross margin percent and their impact can help the organization to reduce the risk of non-profitable orders and at the same time improve their decision-making ability on market planning and forecasting. The objective of this paper is to identify the best model amongst multiple predictive models inside SAS^® Enterprise Miner™, which could accurately predict the gross margin percent for future orders. The data used for the project consisted of over 30,000 transactional records and 33 input variables. The sales records have been collected from multiple manufacturing plants of the steel manufacturing company. Variables such as order quantity, customer location, sales group, and others were used to build predictive models. The target variable gross margin percent is the net profit on the sales, considering all the factors such as labor cost, cost of raw materials, and so on. The model comparison node of SAS Enterprise Miner was used to determine the best among different variations of regression models, decision trees, and neural networks, as well as ensemble models. Average squared error was used as the fit statistic to evaluate each model's performance. Based on the preliminary model analysis, the ensemble model outperforms other models with the least average square error.

Read the paper (PDF).

Business managers are seeing the value of incorporating business information and analytics in daily decision-making with real-time information, when and where it is needed during business meetings and customer engagements. Real-time access of customer and business information reduces the latency in decision-making with confidence and accuracy, increasing the overall efficiency of the company. SAS is introducing new product options with HTML5 and adding advanced features in SAS^® Mobile BI in SAS^® Visual Analytics 7.2 to enhance the reach and experience of business managers to SAS^® analytics and dashboards from SAS Visual Analytics. With SAS Mobile BI 7.2, SAS will push the limits of a business user's ability to author and change the content of dashboards and reports on mobile devices. This presentation focuses on both the new HTML5-based product options and the new advancements made with SAS Mobile BI that empower business users. We present in detail the scope and new features that are offered with the HTML5-based viewer and with SAS Mobile BI from SAS Visual Analytics. Since the new HTML5-based viewer and SAS Mobile BI are the viewer options for business users to visualize and consume the content from SAS Visual Analytics, this presentation demonstrates the two products in detail. Key product capabilities are demoed.

Read the paper (PDF).

As a SAS^® Intelligence Platform Administrator, have your eyes ever glazed over as you performed repetitive tasks in SAS^® Management Console or some other administrative user interface? Perhaps you're setting up metadata for a new department, managing a set of backups, or promoting content between dev, test, and prod environments. Did you know there is a large library of batch utilities to help you automate many of these common administration tasks? This paper explores content reporting and management utilities, such as viewing authorizations or relationships between content, as well as administrative tasks such as analyzing, creating, or deleting metadata repositories or performing a backup of the system. The batch utilities can be incorporated into scripts so that you can run them repeatedly on either an ad hoc or scheduled basis. Give your mouse a rest and save yourself some time.

Read the paper (PDF).

The SAS^® Macro Language is a powerful tool for extending the capabilities of the SAS^® System. This hands-on workshop teaches essential macro coding concepts, techniques, tips, and tricks to help beginning users learn the basics of how the macro language works. Using a collection of proven macro language coding techniques, attendees learn how to write and process macro statements and parameters; replace text strings with macro (symbolic) variables; generate SAS code using macro techniques; manipulate macro variable values with macro functions; create and use global and local macro variables; construct simple arithmetic and logical expressions; interface the macro language with the SQL procedure; store and reuse macros; troubleshoot and debug macros; and develop efficient and portable macro language code.

Read the paper (PDF).

Graduate students encounter many challenges when conducting health services research using real world data obtained from electronic health records (EHRs). These challenges include cleaning and sorting data, summarizing and identifying present-on-admission diagnosis codes, identifying appropriate metrics for risk-adjustment, and determining the effectiveness and cost effectiveness of treatments. In addition, outcome variables commonly used in health service research are not normally distributed. This necessitates the use of nonparametric methods in statistical analyses. This paper provides graduate students with the basic tools for the conduct of health services research with EHR data. We will examine SAS^® tools and step-by-step approaches used in an analysis of the effectiveness and cost-effectiveness of the ABCDE (Awakening and Breathing Coordination, Delirium monitoring/management, and Early exercise/mobility) bundle in improving outcomes for intensive care unit (ICU) patients. These tools include the following: (1) ARRAYS; (2) lookup tables; (3) LAG functions; (4) PROC TABULATE; (5) recycled predictions; and (6) bootstrapping. We will discuss challenges and lessons learned in working with data obtained from the EHR. This content is appropriate for beginning SAS users.

Read the paper (PDF).

A group tasked with testing SAS^® software from the customer perspective has gathered a number of helpful hints for SAS^® 9.4 that will smooth the transition to its new features and products. These hints will help with the 'huh?' moments that crop up when you are getting oriented and will provide short, straightforward answers. We also share insights about changes in your order contents. Gleaned from extensive multi-tier deployments, SAS^® Customer Experience Testing shares insiders' practical tips to ensure that you are ready to begin your transition to SAS 9.4. The target audience for this paper is primarily system administrators who will be installing, configuring, or administering the SAS 9.4 environment. (This paper is an updated version of the paper presented at SAS Global Forum 2014 and includes new features and software changes since the original paper was delivered, plus any relevant content that still applies. This paper includes information specific to SAS 9.4 and SAS 9.4 maintenance releases.)

Read the paper (PDF).

SAS^® users are already familiar with the FCMP procedure and the flexibility it provides them in writing their own functions and subroutines. However, did you know that FCMP also allows you to call functions written in C? Did you know that you can create and populate complex C structures and use C types in FCMP? With the PROTO procedure, you can define function prototypes, structures, enumeration types, and even small bits of C code. This paper gets you started on how to use the PROTO procedure and, in turn, how to call your C functions from within FCMP and SAS.

Read the paper (PDF). | Download the data file (ZIP).

In this session, we discuss the advantages of SAS^® Federation Server and how it makes it easier for business users to access secure data for reports and use analytics to drive accurate decisions. This frees up IT staff to focus on other tasks by giving them a simple method of sharing data using a centralized, governed, security layer. SAS Federation Server is a data server that provides scalable, threaded, multi-user, and standards-based data access technology in order to process and seamlessly integrate data from multiple data repositories. The server acts as a hub that provides clients with data by accessing, managing, and sharing data from multiple relational and non-relational data sources as well as from SAS^® data. Users can view data in big data sources like Hadoop, SAP HANA, Netezza, or Teradata, and blend them with existing database systems like Oracle or DB2. Security and governance features, such as data masking, ensure that the right users have access to the data and reduce the risk of exposure. Finally, data services are exposed via a REST API for simpler access to data from third-party applications.

Read the paper (PDF).

Learning a new programming language is not an easy task, especially for someone who does not have any programming experience. Learning the SAS^® programming language can be even more challenging. One of the reasons is that the SAS System consists of a variety of languages, such as the DATA step language, SAS macro language, Structured Query Language for the SQL procedure, and so on. Furthermore, each of these languages has its own unique characteristics and simply learning the syntax is not sufficient to grasp the language essence. Thus, it is not unusual to hear about someone who has learned SAS for several years and has never become a SAS programming expert. By using the DATA step language as an example, I would like to share some of my experiences on effectively teaching the SAS language.

Read the paper (PDF).

Big data is quickly moving from buzzword to critical tool for today's analytics applications. It can be easy to get bogged down by Apache Hadoop terminology, but when you get down to it, big data is about empowering organizations to deliver the right message or product to the right audience at the right time. Find out how Epsilon built a next-generation marketing application, leveraging Cloudera and taking advantage of SAS^® capabilities by our data science/analytics team, that provides its clients with a 360-degree view of their customers. Join Bob Zurek, Senior Vice President of Products at Epsilon to hear how this new big data solution is enhancing customer service and providing a significant competitive differentiation.

No matter what type of programming you do in a pharmaceutical environment, there will eventually be a need to combine your data with a lookup table. This lookup table could be a code list for adverse events, a list of names for visits, one of your own summary data sets containing totals that you will be using to calculate percentages, or you might have your favorite way to incorporate it. This paper describes and discusses the reasons for using five different simple ways to merge data sets with lookup tables, so that when you take over the maintenance of a new program, you will be ready for anything!

Read the paper (PDF).

This study looks at several ways to investigate latent variables in longitudinal surveys and their use in regression models. Three different analyses for latent variable discovery are briefly reviewed and explored. The procedures explored in this paper are PROC LCA, PROC LTA, PROC CATMOD, PROC FACTOR, PROC TRAJ, and PROC SURVEYLOGISTIC. The analyses defined through these procedures are latent profile analyses, latent class analyses, and latent transition analyses. The latent variables are included in three separate regression models. The effect of the latent variables on the fit and use of the regression model compared to a similar model using observed data is briefly reviewed. The data used for this study was obtained via the National Longitudinal Study of Adolescent Health, a study distributed and collected by Add Health. Data was analyzed using SAS^® 9.3. This paper is intended for any level of SAS^® user. This paper is also aimed at an audience with a background in behavioral science or statistics.

Read the paper (PDF).

SAS^® blogs (hosted at http://blogs.sas.com/content) attract millions of page views annually. With hundreds of authors, thousands of posts, and constant chatter within the blog comments, it's impossible for one person to keep track of all of the activity. In this paper, you learn how SAS technology is used to gather data and report on SAS blogs from the inside out. The beneficiaries include personnel from all over the company, including marketing, technical support, customer loyalty, and executives. The author describes the business case for tracking and reporting on the activity of blogging. You learn how SAS tools are used to access the WordPress database and how to create a 'blog data mart' for reporting and analytics. The paper includes specific examples of the insight that you can gain from examining the blogs analytically, and which techniques are most useful for achieving that insight. For example, the blog transactional data are combined with social media metrics (also gathered by using SAS) to show which blog entries and authors yield the most engagement on Twitter, Facebook, and LinkedIn. In another example, we identified the growing trend of 'blog comment spam' on the SAS blog properties and measured its cost to the business. These metrics helped to justify the investment in a solution. Many of the tools used are part of SAS^® Foundation, including SAS/ACCESS^®, the DATA step and SQL, PROC REPORT, PROC SGPLOT, and more. The results are shared in static reports, automated daily email summaries, dynamic reports hosted in SAS/IntrNet^®, and even a corporate dashboard hosted in SAS^® Visual Analytics.

Read the paper (PDF).

With the constant need to inform researchers about neighborhood health data, the Santa Clara County Health Department created socio-demographic and health profiles for 109 neighborhoods in the county. Data was pulled from many public and county data sets, compiled, analyzed, and automated using SAS^®. With over 60 indicators and 109 profiles, an efficient set of macros was used to automate the calculation of percentages, rates, and mean statistics for all of the indicators. Macros were also used to automate individual census tracts into pre-decided neighborhoods to avoid data entry errors. Simple SQL procedures were used to calculate and format percentages within the macros, and output was pushed out using Output Delivery System (ODS) Graphics. This output was exported to Microsoft Excel, which was used to create a sortable database for end users to compare cities and/or neighborhoods. Finally, the automated SAS output was used to map the demographic data using geographic information system (GIS) software at three geographies: city, neighborhood, and census tract. This presentation describes the use of simple macros and SAS procedures to reduce resources and time spent on checking data for quality assurance purposes. It also highlights the simple use of ODS Graphics to export data to an Excel file, which was used to mail merge the data into 109 unique profiles. The presentation is aimed at intermediate SAS users at local and state health departments who might be interested in finding an efficient way to run and present health statistics given limited staff and resources.

Read the paper (PDF).

Storage space on a UNIX platform is a costly--and finite--resource to maintain, even under ideal conditions. By regularly monitoring and promptly responding to space limitations that might occur during production, an organization can mitigate the risk of wasted expense, time and effort caused by this problem. SAS^® programmers at Truven Health Analytics have designed a reporting tool to measure space usage by a number of distinct factors over time. Using tabular and graphical output, the tool provides a full picture of what often contributes to critical reductions of available hardware space. It enables managers and users to respond appropriately and effectively whenever this occurs. It also helps to identify ways to encourage more efficient practices, thereby minimizing the likelihood of this occurring in the future. Operating System: RHEL 5.4 (Red Hat Enterprise Linux), Oracle Sun Fire X4600 M2 SAS^® 9.3 TS1M1.

The widely used method to convert RSR XML data to some standard, ready-to-process database uses a Visual Basic mapper as a buffer tool when reading XML data (for example, into an MS Access database). This paper describes the shortcomings of this method with respect to the different schemas of RSR data and offers a SAS^® macro that enables users to read any schema of RSR data directly into a SAS relational database. This macro entirely eliminates the step of creating an MS Access database. Using our macro, the user can cut the time of processing of Ryan White data by 99% and more, depending on the number of files that need to be processed in one run.

Read the paper (PDF).

Your electricity usage patterns reveal a lot about your family and routines. Information collected from electrical smart meters can be mined to identify patterns of behavior that can in turn be used to help change customer behavior for the purpose of altering system load profiles. Demand Response (DR) programs represent an effective way to cope with rising energy needs and increasing electricity costs. The Federal Energy Regulatory Commission (FERC) defines demand response as changes in electric usage by end-use customers from their normal consumption patterns in response to changes in the price of electricity over time, or to incentive payments designed to lower electricity use at times of high wholesale market prices or when system reliability of jeopardized. In order to effectively motivate customers to voluntarily change their consumptions patterns, it is important to identify customers whose load profiles are similar so that targeted incentives can be directed toward these customers. Hence, it is critical to use tools that can accurately cluster similar time series patterns while providing a means to profile these clusters. In order to solve this problem, though, hardware and software that is capable of storing, extracting, transforming, loading and analyzing large amounts of data must first be in place. Utilities receive customer data from smart meters, which track and store customer energy usage. The data collected is sent to the energy companies every fifteen minutes or hourly. With millions of meters deployed, this quantity of information creates a data deluge for utilities, because each customer generates about three thousand data points monthly, and more than thirty-six billion reads are collected annually for a million customers. The data scientist is the hunter, and DR candidate patterns are the prey in this cat-and-mouse game of finding customers willing to curtail electrical usage for a program benefit. The data scientist must connect large siloed data sources, external data , and even unstructured data to detect common customer electrical usage patterns, build dependency models, and score them against their customer population. Taking advantage of Hadoop's ability to store and process data on commodity hardware with distributed parallel processing is a game changer. With Hadoop, no data set is too large, and SAS^® Visual Statistics leverages machine learning, artificial intelligence, and clustering techniques to build descriptive and predictive models. All data can be usable from disparate systems, including structured, unstructured, and log files. The data scientist can use Hadoop to ingest all available data at rest, and analyze customer usage patterns, system electrical flow data, and external data such as weather. This paper will use Cloudera Hadoop with Apache Hive queries for analysis on platforms such as SAS^® Visual Analytics and SAS Visual Statistics. The paper will showcase optionality within Hadoop for querying large data sets with open-source tools and importing these data into SAS^® for robust customer analytics, clustering customers by usage profiles, propensity to respond to a demand response event, and an electrical system analysis for Demand Response events.

Read the paper (PDF).

The first task to accomplish our SAS^® 9.4 installation goal is to create an Amazon Web Services (AWS) secured EC2 (Elastic Compute Cloud 2) instance called a Virtual Private Cloud (VPC). Through a series of wizard-driven dialog boxes, the SAS administrator selects virtual CPUs (vCPUs, which have about a 2:1 ratio to cores ), memory, storage, and network performance considerations via regional availability zones. Then, there is a prompt to create a VPC that will be housed within the EC2 instance, along with a major component called subnets. A step to create a security group is next, which enables the SAS administrator to specify all of the VPC firewall port rules required for the SAS 9.4 application. Next, the EC2 instance is reviewed and a security key pair is either selected or created. Then the EC2 launches. At this point, Internet connectivity to the EC2 instance is granted by attaching an Internet gateway and its route table to the VPC and allocating and associating an elastic IP address along with a public DNS. The second major task involves establishing connectivity to the EC2 instance and a method of download for SAS software. In the case of the Linux Red Hat instance created here, putty is configured to use the EC2's security key pair (.ppk file). In order to transfer files securely to the EC2 instance, a tool such as WinSCP is installed and uses the putty connection for secure FTP. The Linux OS is then updated, and then VNCServer is installed and configured so that the SAS administrator can use a GUI. Finally, a Firefox web browser is installed to download the SAS^® Download Manager. After downloading the SAS Download Manager, a SAS depot directory is created on the Linux file system and the SAS Download Manager is run once we have provided the software order number and SAS installation key. Once the SAS software depot has been loaded, we can verify the success of the SAS software depot's download by running the SAS depot checker. The next pre-installatio n task is to take care of some Linux OS housekeeping. Local users (for example, the SAS installation ID), sas, and other IDs such as sassrv, lsfadmin, lsfuser, and sasdemo are created. Specific directory permissions are set for the installer ID sas. The ulimit setting for open files and max user processes are increased and directories are created for a SAS installation home and configuration directory. Some third-party tools such as python, which are required for SAS 9.4, are installed. Then Korn shell and other required Linux packages are installed. Finally, the SAS Deployment Manager installation wizard is launched and the multiple dialog boxes are filled out, with many defaults accepted and Next clicked. SAS administrators should consider running the SAS Deployment Manager twice, first to solely install the SAS software, and then later to configure. Finally, after SAS Deployment Manager completion, SAS post-installation tasks are completed.

Read the paper (PDF).

Is it a better business decision to determine profitability of all business units/kiosks and then decide to prune the nonprofitable ones? Or does model performance improve if we decide to first find the units that meet the break-even point and then try to calculate their profits? In our project, we did a two-stage regression process due to highly skewed distribution of the variables. First, we performed logistic regression to predict which kiosks would be profitable. Then, we used linear regression to predict the average monthly revenue at each kiosk. We used SAS^® Enterprise Guide^® and SAS^® Enterprise Miner™ for the modeling process. The effectiveness of the linear regression model is much more for predicting the target variable at profitable kiosks as compared to unprofitable kiosks. The two-phase regression model seemed to perform better than simply performing a linear regression, particularly when the target variable has too many levels. In real-life situations, the dependent and independent variables can have highly skewed distributions, and two-phase regression can help improve model performance and accuracy. Some results: The logistic regression model has an overall accuracy of 82.9%, sensitivity of 92.6%, and specificity of 61.1% with comparable figures for the training data set at 81.8%, 90.7%, and 63.8% respectively. This indicates that the regression model seems to be consistently predicting the profitable kiosks at a reasonably good level. Linear regression model: For the training data set, the MAPE (mean absolute percentage errors in prediction) is 7.2% for the kiosks that earn more than $350 whereas the MAPE (mean absolute percentage errors in prediction) for kiosks that earn less than $350 is -102% for the predicted values (not log-transformed) of the target versus the actual value of the target respectively. For the validation data set, the MAPE (mean absolute percentage errors in prediction) is 7.6% for the kiosks that earn more than $350 whereas the MAPE (mean absolute percentage errors in prediction) for kiosks that earn less than $350 is -142% for the predicted values (not log-transformed) of the target versus the actual value of the target respectively. This means that the average monthly revenue figures seem to be better predicted for the model where the kiosks were earning higher than the threshold value of $350--that is, for those kiosk variables with a flag variable of 1. The model seems to be predicting the target variable with lower APE for higher values of the target variable for both the training data set above and the entire data set below. In fact, if the threshold value for the kiosks is moved to even say $500, the predictive power of the model in terms of APE will substantially increase. The validation data set (Selection Indicator=0) has fewer data points, and, therefore, the contrast in APEs is higher and more varied.

Read the paper (PDF).

Today's SAS^® environment has large numbers of concurrent SAS processes and ever-growing data volumes. To help SAS users remain productive, SAS administrators must ensure that SAS applications have sufficient computer resources, properly configured and monitored often. Understanding how all the components of SAS work and how they will be used by your users is the first step. The guidance offered in this paper will help SAS administrators evaluate hardware, operating system, and infrastructure options for a SAS environment that will keep their SAS applications running at optimal performance and their user community happy.

Read the paper (PDF).

How do you engage your report viewer on an emotional and intellectual level and tell the story of your data? You create a perfect graphic to tell that story using SAS^® Visual Analytics Graph Builder. This paper takes you on a journey by combining and manipulating graphs to refine your data's best possible story. This paper shows how layering visualizations can create powerful and insightful viewpoints on your data. You will see how to create multiple overlay graphs, single graphs with custom options, data-driven lattice graphs, and user-defined lattice graphs to vastly enhance the story-telling power of your reports and dashboards. Some examples of custom graphs covered in this paper are: resource timelines combined with scatter plots and bubble plots to enhance project reporting, butterfly charts combined with bubble plots to provide a new way to show demographic data, and bubble change plots to highlight the journey your data has traveled. This paper will stretch your imagination and showcase the art of the possible and will take your dashboard from mediocre to miraculous. You will definitely want to share your creative graph templates with your colleagues in the global SAS^® community.

Read the paper (PDF).

The telecommunications industry is the fastest changing business ecosystem in this century. Therefore, handset campaigning to increase loyalty is the top issue for telco companies. However, these handset campaigns have great fraud and payment risks if the companies do not have the ability to classify and assess customers properly according to their risk propensity. For many years, telco companies managed the risk with business rules such as customer tenure until the launch of analytics solutions into the market. But few business rules restrict telco companies in the sales of handsets to new customers. On the other hand, with increasing competition pressure in telco companies, it is necessary to use external credit data to sell handsets to new customers. Credit bureau data was a good opportunity to measure and understand the behaviors of the applicants. But using external data required system integration and real-time decision systems. For those reasons, we need a solution that enables us to predict risky customers and then integrate risk scores and all information into one real-time decision engine for optimized handset application vetting. After an assessment period, SAS^® Analytics platform and RTDM were chosen as the most suitable solution because they provide a flexible user friendly interface, high integration, and fast deployment capability. In this project, we build a process that includes three main stages to transform the data into knowledge. These stages are data collection, predictive modelling, and deployment and decision optimization. a) Data Collection: We designed a specific daily updated data mart that connects internal payment behavior, demographics, and customer experience data with external credit bureau data. In this way, we can turn data into meaningful knowledge for better understanding of customer behavior. b) Predictive Modelling: For using the company potential, it is critically important to use an analytics approach that is based on state-of-the-art tec hnologies. We built nine models to predict customer propensity to pay. As a result of better classification of customers, we obtain satisfied results in designing collection scenarios and decision model in handset application vetting. c) Deployment and Decision Optimization: Knowledge is not enough to reach success in business. It should be turned into optimized decision and deployed real time. For this reason, we have been using SAS^® Predictive Analytics Tools and SAS^® Real-Time Decision Manager to primarily turn data into knowledge and turn knowledge into strategy and execution. With this system, we are now able to assess customers properly and to sell handset even to our brand-new customers as part of the application vetting process. As a result of this, while we are decreasing nonpayment risk, we generated extra revenue that is coming from brand-new contracted customers. In three months, 13% of all handset sales was concluded via RTDM. Another benefit of the RTDM is a 30% cost saving in external data inquiries. Thanks to the RTDM, Avea has become the first telecom operator that uses bureau data in Turkish Telco industry.

Read the paper (PDF).

In longitudinal data, it is important to account for the correlation due to repeated measures and time-dependent covariates. Generalized method of moments can be used to estimate the coefficients in longitudinal data, although there are currently limited procedures in SAS^® to produce GMM estimates for correlated data. In a recent paper, Lalonde, Wilson, and Yin provided a GMM model for estimating the coefficients in this type of data. SAS PROC IML was used to generate equations that needed to be solved to determine which estimating equations to use. In addition, this study extended classifications of moment conditions to include a type IV covariate. Two data sets were evaluated using this method, including re-hospitalization rates from a Medicare database as well as body mass index and future morbidity rates among Filipino children. Both examples contain binary responses, repeated measures, and time-dependent covariates. However, while this technique is useful, it is tedious and can also be complicated when determining the matrices necessary to obtain the estimating equations. We provide a concise and user-friendly macro to fit GMM logistic regression models with extended classifications.

Read the paper (PDF).

The DATA step has served SAS^® programmers well over the years, and although it is powerful, it has not fundamentally changed. With DS2, SAS introduced a significant alternative to the DATA step by providing an object-oriented programming environment. In this paper, we share our experiences with getting started with DS2 and learning to use it to access, manage, and share data in a scalable, threaded, and standards-based way.

Read the paper (PDF).

Research has shown that the top five percent of patients can account for nearly fifty percent of the total healthcare expenditure in the United States. Using SAS^® Enterprise Guide^® and PROC LOGISTIC, a statistical methodology was developed to identify factors (for example, patient demographics, diagnostic symptoms, comorbidity, and the type of procedure code) associated with the high cost of healthcare. Analyses were performed using the FAIR Health National Private Insurance Claims (NPIC) database, which contains information about healthcare utilization and cost in the United States. The analyses focused on treatments for chronic conditions, such as trans-myocardial laser revascularization for the treatment of coronary heart disease (CHD) and pressurized inhalation for the treatment of asthma. Furthermore, bubble plots and heat maps were created using SAS^® Visual Analytics to provide key insights into potentially high-cost treatments for heart disease and asthma patients across the nation.

Read the paper (PDF). | Download the data file (ZIP).

SAS^® University Edition is a great addition to the world of freely available analytic software, and this 'how-to' presentation shows you how to implement a discrete event simulation using Base SAS^® to model future US Veterans population distributions. Features include generating a slideshow using ODS output to PowerPoint.

Read the paper (PDF). | Download the data file (ZIP).

SAS^® Grid Computing promises many benefits that the SAS^® community has been demanding for years, including workload management of SAS applications, a highly available infrastructure, higher resource utilization, flexibility for IT infrastructure, and potentially improved performance of SAS applications. But to implement these benefits, you need to have a good definition of what you need and an understanding of what is involved in enabling the SAS tasks to take advantage of all the SAS grid nodes. In addition to haivng this understanding of SAS, the underlying hardware infrastructure (cores to storage) must be configured and tuned correctly. This paper discusses the most important things (or misunderstandings) that SAS customers need to know before they deploy SAS^® Grid Manager.

Read the paper (PDF).

Just as research is built on existing research, the references section is an important part of a research paper. The purpose of this study is to find the differences between professionals and academicians with respect to the references section of a paper. Data is collected from SAS^® Global Forum 2014 Proceedings. Two research hypotheses are supported by the data. First, the average number of references in papers by academicians is higher than those by professionals. Second, academicians follow standards for citing references more than professionals. Text mining is performed on the references to understand the actual content. This study suggests that authors of SAS Global Forum papers should include more references to increase the quality of the papers.

Read the paper (PDF).

In data mining modelling, data preparation is the most crucial, most difficult, and longest part of the mining process. A lot of steps are involved. Consider the simple distribution analysis of the variables, the diagnosis and reduction of the influence of variables' multicollinearity, the imputation of missing values, and the construction of categories in variables. In this presentation, we use data mining models in different areas like marketing, insurance, retail and credit risk. We show how to implement data preparation through SAS^® Enterprise Miner™, using different approaches. We use simple code routines and complex processes involving statistical insights, cluster variables, transform variables, graphical analysis, decision trees, and more.

Read the paper (PDF).

Over the years, very few published studies have discussed ways to improve the performance of two-stage predictive models. This study, based on 10 years (1999-2008) of data from 130 US hospitals and integrated delivery networks, is an attempt to demonstrate how we can leverage the Association node in SAS^® Enterprise Miner™ to improve the classification accuracy of the two-stage model. We prepared the data with imputation operations and data cleaning procedures. Variable selection methods and domain knowledge were used to choose 43 key variables for the analysis. The prominent association rules revealed interesting relationships between prescribed medications and patient readmission/no-readmission. The rules with lift values greater than 1.6 were used to create dummy variables for use in the subsequent predictive modeling. Next, we used two-stage sequential modeling, where the first stage predicted if the diabetic patient was readmitted and the second stage predicted whether the readmission happened within 30 days. The backward logistic regression model outperformed competing models for the first stage. After including dummy variables from an association analysis, many fit indices improved, such as the validation ASE to 0.228 from 0.238, cumulative lift to 1.56 from 1.40. Likewise, the performance of the second stage was improved after including dummy variables from an association analysis. Fit indices such as the misclassification rate improved to 0.240 from 0.243 and the final prediction error to 0.17 from 0.18.

Read the paper (PDF).

Missing data is an unfortunate reality of statistics. However, there are various ways to estimate and deal with missing data. This paper explores the pros and cons of traditional imputation methods versus maximum likelihood estimation as well as singular versus multiple imputation. These differences are displayed through comparing parameter estimates of a known data set and simulating random missing data of different severity. In addition, this paper uses PROC MI and PROC MIANALYZE and shows how to use these procedures in a longitudinal data set.

Read the paper (PDF).

Since the financial crisis of 2008, banks and bank holding companies in the United States have faced increased regulation. One of the recent changes to these regulations is known as the Comprehensive Capital Analysis and Review (CCAR). At the core of these new regulations, specifically under the Dodd-Frank Wall Street Reform and Consumer Protection Act and the stress tests it mandates, are a series of what-if or scenario analyses requirements that involve a number of scenarios provided by the Federal Reserve. This paper proposes frequentist and Bayesian time series methods that solve this stress testing problem using a highly practical top-down approach. The paper focuses on the value of using univariate time series methods, as well as the methodology behind these models.

Read the paper (PDF).

Why did my merge fail? How did that variable get truncated? Why am I getting unexpected results? Understanding how the DATA step actually works is the key to answering these and many other questions. In this paper, two independent consultants with a combined three decades of SAS^® programming experience share a treasure trove of knowledge aimed at helping the novice SAS programmer take his or her game to the next level by peering behind the scenes of the DATA step. We touch on a variety of topics, including compilation versus execution, the program data vector, and proper merging techniques, with a focus on good programming practices that help the programmer steer clear of common pitfalls.

Read the paper (PDF).

In SAS^® software development, data specifications and process requirements can be built into user-defined control data set functioning as components of ETL routines. A control data set provides comprehensive definition on the data source, relationship, logic, description, and metadata of each data element. This approach facilitates auto-generated SAS codes during program execution to perform data ingestion, transformation, and loading procedures based on rules defined in the table. This paper demonstrates the application of using a control data set for the following: (1) data table initialization and integration; (2) validation and quality control; (3) element transformation and creation; (4) data loading; and (5) documentation. SAS programmers and business analysts would find programming development and maintenance of business rules more efficient with this standardized method.

Read the paper (PDF).

Microsoft SharePoint has been adopted by a number of companies today as their content management tool because of its ability to create and manage documents, records, and web content. It is described as an enterprise collaboration platform with a variety of capabilities, and thus it stands to reason that this platform should also be used to surface content from analytical applications such as SAS^® and the R language. SAS provides various methods for surfacing SAS content through SharePoint. This paper describes one such methodology that is both simple and elegant, requiring only SAS Foundation. It also explains how SAS and R can be used together to form a robust solution for delivering analytical results. The paper outlines the approach for integrating both languages into a single security model that uses Microsoft Active Directory as the primary authentication mechanism for SharePoint. It also describes how to extend the authorization to SAS running on a Linux server where LDAP is used. Users of this system are blissfully ignorant of the back-end technology components, as we offer up a seamless interface where they simply authenticate to the SharePoint site and the rest is, as they say, magic.

Read the paper (PDF).

This paper explains how to build a linear regression model using the variable transformation method. Testing the assumptions, which is required for linear modeling and testing the fit of a linear model, is included. This paper is intended for analysts who have limited exposure to building linear models. This paper uses the REG, GLM, CORR, UNIVARIATE, and GPLOT procedures.

Read the paper (PDF). | Download the data file (ZIP).

Generalized linear models are highly useful statistical tools in a broad array of business applications and scientific fields. How can you select a good model when numerous models that have different regression effects are possible? The HPGENSELECT procedure, which was introduced in SAS/STAT^® 12.3, provides forward, backward, and stepwise model selection for generalized linear models. In SAS/STAT 14.1, the HPGENSELECT procedure also provides the LASSO method for model selection. You can specify common distributions in the family of generalized linear models, such as the Poisson, binomial, and multinomial distributions. You can also specify the Tweedie distribution, which is important in ratemaking by the insurance industry and in scientific applications. You can run the HPGENSELECT procedure in single-machine mode on the server where SAS/STAT is installed. With a separate license for SAS^® High-Performance Statistics, you can also run the procedure in distributed mode on a cluster of machines that distribute the data and the computations. This paper shows you how to use the HPGENSELECT procedure both for model selection and for fitting a single model. The paper also explains the differences between the HPGENSELECT procedure and the GENMOD procedure.

Read the paper (PDF).

This presentation teaches the audience how to use ODS Graphics. Now part of Base SAS^®, ODS Graphics are a great way to easily create clear graphics that enable any user to tell their story well. SGPLOT and SGPANEL are two of the procedures that can be used to produce powerful graphics that used to require a lot of work. The core of the procedures is explained, as well as some of the many options available. Furthermore, we explore the ways to combine the individual statements to make more complex graphics that tell the story better. Any user of Base SAS on any platform will find great value in the SAS ODS Graphics procedures.

Organizations are loading data into Hadoop platforms at an extraordinary rate. However, in order to extract value from these platforms, the data must be prepared for analytic exploit. As the volume of data grows, it becomes increasingly more important to reduce data movement, as well as to leverage the computing power of these distributed systems. This paper provides a cursory overview of SAS^® Data Loader, a product specifically aimed at these challenges. We cover the underlying mechanisms of how SAS Data Loader works, as well as how it's used to profile, cleanse, transform, and ultimately prepare data for analytics in Hadoop.

Read the paper (PDF).

The SAS^® hash object is an incredibly powerful technique for integrating data from two or more data sets based on a common key. This session describes the basic methodology for defining, populating, and using a hash object to perform lookups within the DATA step and provides examples of situations in which the performance of SAS programs is improved by their use. Common problems encountered when using hash objects are explained, and tools and techniques for optimizing hash objects within your SAS program are demonstrated.

Read the paper (PDF).

This paper introduces Jeffreys interval for one-sample proportion using SAS^® software. It compares the credible interval from a Bayesian approach with the confidence interval from a frequentist approach. Different ways to calculate the Jeffreys interval are presented using PROC FREQ, the QUANTILE function, a SAS program of the random walk Metropolis sampler, and PROC MCMC.

Read the paper (PDF).

Examples include: how to join when your data is perfect, how to join when your data does not match and you have to manipulate it in order to join, how to create a subquery, how to use a subquery.

Read the paper (PDF).

The research areas of pharmaceuticals and oncology clinical trials greatly depend on time-to-event endpoints such as overall survival and progression-free survival. One of the best graphical displays of these analyses is the Kaplan-Meier curve, which can be simple to generate with the LIFETEST procedure but difficult to customize. Journal articles generally prefer that statistics such as median time-to-event, number of patients, and time-point event-free rate estimates be displayed within the graphic itself, and this was previously difficult to do without an external program such as Microsoft Excel. The macro %NEWSURV takes advantage of the Graph Template Language (GTL) that was added with the SG graphics engine to create this level of customizability without the need for back-end manipulation. Taking this one step further, the macro was improved to be able to generate a lattice of multiple unique Kaplan-Meier curves for side-by-side comparisons or for condensing figures for publications. This paper describes the functionality of the macro and describes how the key elements of the macro work.

Read the paper (PDF).

Educational administrators sometimes have to make decisions based on what they believe is in the best interest of their students because they do not have the data they need at the time. Some administrators do not even know that the data exist to help them make their decisions. However, well-intentioned policies that are not based on facts can sometimes do more harm than good for the students and the institution. This presentation discusses the results of the policy analyses conducted by the Office of Institutional Research at Western Kentucky University using Base SAS^®, SAS/STAT^®, SAS^® Enterprise Miner™, and SAS^® Visual Analytics. The researchers analyzed Western Kentucky University's math course placement procedure for incoming students and assessed the criteria used for admissions decisions, including those for first-time first-year students, transfer students, and students readmitted to the University after being dismissed for unsatisfactory academic progress--procedures and criteria previously designed with the students' best interests at heart. The presenters discuss the statistical analyses used to evaluate the policies and the use of SAS Visual Analytics to present their results to administrators in a visual manner. In addition, the presenters discuss subsequent changes in the policies, and where possible, the results of the policy changes.

Read the paper (PDF).

The cyclical coordinate descent method is a simple algorithm that has been used for fitting generalized linear models with lasso penalties by Friedman et al. (2007). The coordinate descent algorithm can be implemented in Base SAS^® to perform efficient variable selection and shrinkage for GLMs with the L1 penalty (the lasso).

Read the paper (PDF).

SAS^® Visual Analytics opens up a world of intuitive interactions, providing report creators the ability to develop more efficient ways to deliver information. Business-related hierarchies can be defined dynamically in SAS Visual Analytics to group data more efficiently--no more going back to the developers. Visualizations can interact with each other, with other objects within other sections, and even with custom applications and SAS^® stored processes. This paper provides a blueprint to streamline and consolidate reporting efforts using these interactions available in SAS Visual Analytics. The goal of this methodology is to guide users down information pathways that can progressively subset data into smaller, more understandable chunks of data, while summarizing each layer to provide insight along the way. Ultimately the final destination of the information pathway holds a reasonable subset of data so that a user can take action and facilitate an understood outcome.

Read the paper (PDF).

SAS^® customers benefit greatly when they are using the functionality, performance, and stability available in the latest version of SAS. However, the task of moving all SAS collateral such as programs, data, catalogs, metadata (stored processes, maps, queries, reports, and so on), and content to SAS^® 9.4 can seem daunting. This paper provides an overview of the steps required to move all SAS collateral from systems based on SAS^® 9.2 and SAS^® 9.3 to the current release of SAS^® 9.4.

Read the paper (PDF).

From stock price histories to hospital stay records, analysis of time series data often requires the use of lagged (and occasionally lead) values of one or more analysis variables. For the SAS^® user, the central operational task is typically getting lagged (lead) values for each time point in the data set. Although SAS has long provided a LAG function, it has no analogous lead function--an especially significant problem in the case of large data series. This paper reviews the LAG function (in particular, the powerful but non-intuitive implications of its queue-oriented basis), demonstrates efficient ways to generate leads with the same flexibility as the LAG function (but without the common and expensive recourse of data re-sorting), and shows how to dynamically generate leads and lags through the use of the hash object.

Read the paper (PDF). | Download the data file (ZIP).

Across the languages of SAS^® are many golden nuggets--functions, formats, and programming features just waiting to impress your friends and colleagues. Learning SAS over 30+ years, I have collected a few, and I offer them to you in this presentation.

Read the paper (PDF).

Although today's marketing teams enjoy large-scale campaign relationship management systems, many are still left with the task of bridging the well-known gap between campaigns and customer purchasing decisions. During this session, we discuss how Slalom Consulting and Celebrity Cruises decided to take a bold step and bridge that gap. We show how marketing efforts are distorted when a team considers only the last campaign sent to a customer that later booked a cruise. Then we lay out a custom-built SAS 9.3 solution that scales to process thousands of campaigns per month using a stochastic attribution technique. This approach considers all of the campaigns that touch the customer, assigning a single campaign or a set of campaigns that contributed to their decision.

There are many pedagogic theories and practices that academics research and follow as they strive to ensure excellence in their students' achievements. In order to validate the impact of different approaches, there is a need to apply analytical techniques to evaluate the changing levels of achievements that occur as a result of changes in applied pedagogy. The analytics used should be easily accessible to all academics with minimal overhead in terms of the collection of new data. This paper is based on a case study of the changing pedagogical approaches of the author over the past five years, using grade profiles from a wide range of modules taught by the author in both the School of Computing and Maths and the Business School at the University of Derby. Base SAS^® and SAS^® Studio were used to evaluate and demonstrate the impact of the change from a pedagogical position of Academic as Domain Expert to a pedagogical position of Academic as Learning-to-Learn Expert . This change resulted in greater levels of research that supported learning along with better writing skills. The application of Learning Analytics in this case study demonstrates a very significant improvement in grade profiles of all students of between 15% and 20%. More surprisingly, it demonstrates that it also eliminates a significant grade deficit in the black and minority ethnic student population, which is typically about 15% in a large number of UK universities.

Read the paper (PDF).

In-database processing refers to the integration of advanced analytics into the data warehouse. With this capability, analytic processing is optimized to run where the data reside, in parallel, without having to copy or move the data for analysis. From a data governance perspective there are many good reasons to embrace in-database processing. Many analytical computing solutions and large databases use this technology because it provides significant performance improvements over more traditional methods. Come learn how Blue Cross Blue Shield of Tennessee (BCBST) uses in-database processing from SAS and Teradata.

SAS^® Environment Manager helps SAS^® administrators and system administrators manage SAS resources and effectively monitor the environment. SAS Environment Manager provides administrators with a centralized location for accessing and monitoring the SAS^® Customer Intelligence environment. This enables administrators to identify problem areas and to maintain an in-depth understanding of the day-to-day activities on the system. It is also an excellent way to predict the usage and growth of the environment for scalability. With SAS Environment Manager, administrators can set up monitoring for CI logs (for example, SASCustIntelCore6.3.log, SASCustIntelStudio6.3.log) and other general logs from the SAS^® Intelligence Platform. This paper contains examples for administrators who support SAS Customer Intelligence to set up this type of monitoring. It provides recommendations for approaches and for how to interpret the results from SAS Environment Manager.

Read the paper (PDF).

There are various economic factors that affect retail sales. One important factor that is expected to correlate is overall customer sentiment toward a brand. In this paper, we analyze how location-specific customer sentiment could vary and correlate with sales at retail stores. In our attempt to find any dependency, we have used location-specific Twitter feeds related to a national-brand chain retail store. We opinion-mine their overall sentiment using SAS^® Sentiment Analysis Studio. We estimate correlation between the opinion index and retail sales within the studied geographic areas. Later in the analysis, using ArcGIS Online from Esri, we estimate whether other location-specific variables that could potentially correlate with customer sentiment toward the brand are significantly important to predict a brand's retail sales.

Read the paper (PDF).

A forest plot is a common visualization for meta-analysis. Some popular versions that use subgroups with indented text and bold fonts can seem outright daunting to create. With SAS^® 9.4, the Graph Template Language (GTL) has introduced the AXISTABLE statement, specifically designed for including text data columns into a graph. In this paper, we demonstrate the simplicity of creating various forest plots using AXISTABLE statements. Come and see how to create forest plots as clear as day!

Read the paper (PDF). | Download the data file (ZIP).

SAS^® provides a number of tools for creating customized professional reports. While SAS provides point-and-click interfaces through products such as SAS^® Web Report Studio, SAS^® Visual Analytics or even SAS^® Enterprise Guide^®, unfortunately, many users do not have access to the high-end tools and require customization beyond the SAS Enterprise Guide point-and-click interface. Fortunately, base SAS procedures such as the REPORT procedure, combined with graphics procedures, macros, ODS, and Annotate can be used to create very customized professional reports. When toggling together different solutions such as SAS Statistical Graphics, the REPORT procedure, ODS, and SAS/GRAPH^®, different techniques need to be used to keep the same look and feel throughout the report package. This presentation looks at solutions that can be used to keep a consistent look and feel in a report package created with different SAS products.

Read the paper (PDF).

Automated decision-making systems are now found everywhere, from your bank to your government to your home. For example, when you inquire for a loan through a website, a complex decision process likely runs combinations of statistical models and business rules to make sure you are offered a set of options for tantalizing terms and conditions. To make that happen, analysts diligently strive to encode their complex business logic into these systems. But how do you know if you are making the best possible decisions? How do you know if your decisions conform to your business constraints? For example, you might want to maximize the number of loans that you provide while balancing the risk among different customer categories. Welcome to the world of optimization. SAS^® Business Rules Manager and SAS/OR^® software can be used together to manage and optimize decisions. This presentation demonstrates how to build business rules and then optimize the rule parameters to maximize the effectiveness of those rules. The end result is more confidence that you are delivering an effective decision-making process.

Read the paper (PDF). | Download the data file (ZIP).

SAS^® 9.4 introduced extended attributes, which are name-value pairs that can be attached to either the data set or to individual variables. Extended attributes are managed through PROC DATASETS and can be viewed through PROC CONTENTS or through Dictionary.XATTRS. This paper describes the development of a SAS^® Enterprise Guide^® custom add-in that allows for the entry and editing of extended attributes, with the possibility of using a controlled vocabulary. The controlled vocabulary used in the initial application is derived from the lifecycle branch of the Data Documentation Initiative metadata standard (DDI-L).

Read the paper (PDF).

File management is a tedious process that can be automated by using SAS^® to create and execute a Windows command script. The macro in this paper copies files from one location to another, identifies obsolete files by the version number, and then moves them to an archive folder. Assuming that some basic conditions are met, this macro is intended to be easy to use and robust. Windows users who run routine programs for projects with rework might want to consider this solution.

Read the paper (PDF).

Qualtrics is an online survey tool that offers a variety of features useful to researchers. In this paper, we show you how to implement the different options available for distributing surveys and downloading survey responses. We use the FILENAME statement (URL access method) and process the API responses with SAS^® XML Mapper. In addition, we show an approach for how to keep track of active and inactive respondents.

Read the paper (PDF).

The SAS^® Web Application Server is a lightweight server that provides enterprise-class features for running SAS^® middle-tier web applications. This server can be configured to use the SAS^® Web Infrastructure Platform Data Server for a transactional storage database. You can meet the high-availability data requirement in your business plan by implementing a SAS Web Infrastructure Data Server cluster. This paper focuses on how the SAS Web Infrastructure Data Server on the SAS middle tier can be configured for load balancing, and data replication involving multiple nodes. SAS^® Environment Manager and pgpool-II are used to enable these high-availability strategies, monitor the server status, and initiate failover as needed.

Read the paper (PDF).

This paper describes the main functions of the SAS^® SG procedures and their relations. It also offers a way to create data-colored maps using these procedures. Here are the basics of the SG procedures. For a few years, the SG procedures (PROC SGPLOT, PROC SGSCATTER, PROC SGPANEL, and so on) have been part of Base SAS^® and thus available for everybody. SG originated as Statistical Graphics , but nowadays the procedures are often referred to as SAS^® ODS Graphics. With the syntax in a 1000+ page document, it is quite a challenge to start using them. Also, SAS^® Enterprise Guide^® currently has no graphics tasks that generate code for the SG procedures (except those in the statistical arena). For a long time SAS/GRAPH^® has been the vehicle for producing presentation-ready graphs of your data. In particular, the SAS users that have experience with those SAS/GRAPH procedures will hesitate to change over. But the SG procedures continue to be enhanced with new features. And, because the appearance of many elements is governed by the ODS styles, they are very well suited to provide a consistent style across all your output, text and graphics. PROC SGPLOT - PROC SGPANEL - PROC SGSCATTER: The paper first describes the basic procedures that a user will start with; PROC SGPLOT is the first place. Then the more elaborate possibilities of PROC SGPANEL and PROC SGSCATTER are described. Both these procedures can create a matrix or panel of graphs. The different goals of these two procedures will be explained: comparing a group of variables versus comparing the levels of two variables. PROC SGPLOT can create many different graphs: histograms, time series, scatterplots, and so on. PROC SGPANEL has essentially the same possibilities. The nature of PROC SGSCATTER (and the name says it already) limits it to scatter-like graphs. But many statements and options are common to lots of types of graphs. This paper groups them logically , making clear what the procedures have in common and where they differ. Related to the SG procedures are also two utilities (the Graphics Editor and the Graphics Designer), which are delivered as SAS^® Foundation applications. The paper describes the relations between these utilities and the objects they produce, and the relevant SG procedures and related utilities. Creating a map for virtually all tasks that can be performed with the well known SAS/GRAPH procedures, the counterpart in the SG procedures is easily pointed out, often with more extensive features. This is not the case for the maps produced with PROC GMAP. This paper shows the mere few steps that are necessary to convert the data sets that contain your data and your map coordinates into data sets that enable you to use the power and features of PROC SGPLOT to create your map in any projection system and any coordinate window.

Read the paper (PDF). | Download the data file (ZIP).

There are few business environments more dynamic than that of a casino. Serving a multitude of entertainment options to thousands of patrons every day results in a lot of customer interaction points. All of these interactions occur in a highly competitive environment where, if a patron doesn't feel that he is getting the recognition that he deserves, he can easily walk across the street to a competitor. Add to this the expected amount of reinvestment per patron in the forms of free meals and free play. Making high-quality real-time decisions during each customer interaction is critical to the success of a casino. Such decisions need to be relevant to customers' needs and values, reflect the strategy of the business, and help maximize the organization's profitability. Being able to make those decisions repeatedly is what separates highly successful businesses from those that flounder or fail. Casinos have a great deal of information about a patron's history, behaviors, and preferences. Being able to react in real time to newly gathered information captured in ongoing dialogues opens up new opportunities about what offers should be extended and how patrons are treated. In this session, we provide an overview of real-time decisioning and its capabilities, review the various opportunities for real-time interaction in a casino environment, and explain how to incorporate the outputs of analytics processes into a real-time decision engine.

Read the paper (PDF).

It's well known that SAS^® is the leader in advanced analytics but often overlooked is the intelligent data preparation that combines information from disparate sources to enable confident creation and deployment of compelling models. Improving data-based decision making is among the top reasons why organizations decide to embark on master data management (MDM) projects and why you should consider incorporating MDM functionality into your analytics-based processes. MDM is a discipline that includes the people, processes, and technologies for creating an authoritative view of core data elements in enterprise operational and analytic systems. This paper demonstrates why MDM functionality is a natural fit for many SAS solutions that need to have access to timely, clean, and unique master data. Because MDM shares many of the same technologies that power SAS analytic solutions, it has never been easier to add MDM capabilities to your advanced analytics projects.

Read the paper (PDF).

Predictive analytics has been widely studied in recent years, and it has been applied to solve a wide range of real-world problems. Nevertheless, current state-of-the-art predictive analytics models are not well aligned with managers' requirements in that the models fail to include the real financial costs and benefits during the training and evaluation phases. Churn predictive modeling is one of those examples in which evaluating a model based on a traditional measure such as accuracy or predictive power does not yield the best results when measured by investment per subscriber in a loyalty campaign and the financial impact of failing to detect a real churner versus wrongly predicting a non-churner as a churner. In this paper, we propose a new financially based measure for evaluating the effectiveness of a voluntary churn campaign, taking into account the available portfolio of offers, their individual financial cost, and the probability of acceptance depending on the customer profile. Then, using a real-world churn data set, we compared different cost-insensitive and cost-sensitive predictive analytics models and measured their effectiveness based on their predictive power and cost optimization. The results show that using a cost-sensitive approach yields to an increase in profitability of up to 32.5%.

The need to measure slight changes in healthcare costs and utilization patterns over time is vital in predictive modeling, forecasting, and other advanced analytics. At BlueCross BlueShield of Tennessee, a method for developing member-level regression slopes creates a better way of identifying these changes across various time spans. The goal is to create multiple metrics at the member level that will indicate when an individual is seeking more or less medical or pharmacy services. Significant increases or decreases in utilization and cost are used to predict the likelihood of acquiring certain conditions, seeking services at particular facilities, and self-engaging in health and wellness. Data setup and compilation consists of calculating a member's eligibility with the health plan and then aggregating cost and utilization of particular services (for example, primary care visits, Rx costs, ER visits, and so on). A member must have at least six months of eligibility for a valid regression slope to be calculated. Linear regression is used to build single-factor models for 6, 12, 18 and 24 month time spans if the appropriate amount of data is available for the member. Models are built at the member-metric time period resulting in the possibility of over 75 regression coefficients per member per monthly run. The computing power needed to execute such a vast amount of calculations requires in-database processing of various macro processes. SAS^® Enterprise Guide^® is used to structure the data and SAS^® Forecast Studio is used to forecast trends at a member level. Algorithms are run the first of each month. Data is stored so that each metric and corresponding slope is appended on a monthly basis. Because the data is setup up for the member regression algorithm, slopes are interpreted in the following manner: a positive value for -1*slope indicates an increase in utilization/cost; a negative value for -1*slope indicates a decrease in utilization/cost. The ac tual slope value indicates the intensity of the change in cost in utilization. The insight provided by this member-level regression methodology replaces subjective methods that used arbitrary thresholds of change to measure differences in cost and utilization.

Read the paper (PDF).

A utility's meter data is a valuable asset that can be daunting to leverage. Consider that one household or premise can produce over 35,000 rows of information, consisting of over 8 MB of data per year. Thirty thousand meters collecting fifteen-minute-interval data with forty variables equates to 1.2 billion rows of data. Using SAS^® Visual Analytics, we provide examples of leveraging smart meter data to address business around revenue protection, meter operations, and customer analysis. Key analyses include identifying consumption on inactive meters, potential energy theft, and stopped or slowing meters; and support of all customer classes (for example, residential, small commercial, and industrial) and their data with different time intervals and frequencies.

Read the paper (PDF).

With the move to value-based benefit and reimbursement models, it is essential toquantify the relative cost, quality, and outcome of a service. Accuratelymeasuring the cost and quality of doctors, practices, and health systems iscritical when you are developing a tiered network, a shared savings program, ora pay-for-performance incentive. Limitations in claims payment systems requiredeveloping methodological and statistical techniques to improve the validityand reliability of provider's scores on cost and quality of care. This talkdiscusses several key concepts in the development of a measurement systemfor provider performance, including measure selection, risk adjustment methods,and peer group benchmark development.

Read the paper (PDF). | Watch the recording.

The goal of this session is to describe the whole process of model creation from the business request through model specification, data preparation, iterative model creation, model tuning, implementation, and model servicing. Each mentioned phase consists of several steps in which we describe the main goal of the step, the expected outcome, the tools used, our own SAS codes, useful nodes, and settings in SAS^® Enterprise Miner™, procedures in SAS^® Enterprise Guide^®, measurement criteria, and expected duration in man-days. For three steps, we also present deep insights with examples of practical usage, explanations of used codes, settings, and ways of exploring and interpreting the output. During the actual model creation process, we suggest using Microsoft Excel to keep all input metadata along with information about transformations performed in SAS Enterprise Miner. To get faster information about model results, we combine an automatic SAS^® code generator implemented in Excel, and then we input this code to SAS Enterprise Guide and create a specific profile of results directly from the nodes output tables of SAS Enterprise Miner. This paper also focuses on an example of a binary model stability check-in time performed in SAS Enterprise Guide through measuring optimal cut-off percentage and lift. These measurements are visualized and automatized using our own codes. By using this methodology, users would have direct contact with transformed data along with the possibility to analyze and explore any semi-results. Furthermore, the proposed approach could be used for several types of modeling (for example, binary and nominal predictive models or segmentation models). Generally, we have summarized our best practices of combining specific procedures performed in SAS Enterprise Guide, SAS Enterprise Miner, and Microsoft Excel to create and interpret models faster and more effectively.

Read the paper (PDF).

Banks can create a competitive advantage in their business by using business intelligence (BI) and by building models. In the credit domain, the best practice is to build risk-sensitive models (Probability of Default, Exposure at Default, Loss-given Default, Unexpected Loss, Concentration Risk, and so on) and implement them in decision-making, credit granting, and credit risk management. There are models and tools on the next level built on these models and that are used to help in achieving business targets, risk-sensitive pricing, capital planning, optimizing of ROE/RAROC, managing the credit portfolio, setting the level of provisions, and so on. It works remarkably well as long as the models work. However, over time, models deteriorate and their predictive power can drop dramatically. Since the global financial crisis in 2008, we have faced a tsunami of regulation and accelerated frequency of changes in the business environment, which cause models to deteriorate faster than ever before. As a result, heavy reliance on models in decision-making (some decisions are automated following the model's results--without human intervention) might result in a huge error that can have dramatic consequences for the bank's performance. In my presentation, I share our experience in reducing model risk and establishing corporate governance of models with the following SAS^® tools: model monitoring, SAS^® Model Manager, dashboards, and SAS^® Visual Analytics.

Read the paper (PDF).

Effect modification occurs when the association between a predictor of interest and the outcome is differential across levels of a third variable--the modifier. Effect modification is statistically tested as the interaction effect between the predictor and the modifier. In repeated measures studies (with more than two time points), higher-order (three-way) interactions must be considered to test effect modification by adding time to the interaction terms. Custom fitting and constructing these repeated measures models are difficult and time consuming, especially with respect to estimating post-fitting contrasts. With the advancement of the LSMESTIMATE statement in SAS^®, a simplified approach can be used to custom test for higher-order interactions with post-fitting contrasts within a mixed model framework. This paper provides a simulated example with tips and techniques for using an application of the nonpositional syntax of the LSMESTIMATE statement to test effect modification in repeated measures studies. This approach, which is applicable to exploring modifiers in randomized controlled trials (RCTs), goes beyond the treatment effect on outcome to a more functional understanding of the factors that can enhance, reduce, or change this relationship. Using this technique, we can easily identify differential changes for specific subgroups of individuals or patients that subsequently impact treatment decision making. We provide examples of conventional approaches to higher-order interaction and post-fitting tests using the ESTIMATE statement and compare and contrast this to the nonpositional syntax of the LSMESTIMATE statement. The merits and limitations of this approach are discussed.

Read the paper (PDF). | Download the data file (ZIP).

Electricity is an extremely important product for society. In Brazil, the electric sector is regulated by ANEEL (Ag ncia Nacional de Energia El trica), and one of the regulated aspects is power loss in the distribution system. In 2013, 13.99% of all injected energy was lost in the Brazilian system. Commercial loss is one of the power loss classifications, which can be countered by inspections of the electrical installation in a search for irregularities in power meters. CEMIG (Companhia Energ tica de Minas Gerais) currently serves approximately 7.8 million customers, which makes it unfeasible (in financial and logistic terms) to inspect all customer units. Thus, the ability to select potential inspection targets is essential. In this paper, logistic regression models, decision tree models, and the Ensemble model were used to improve the target selection process in CEMIG. The results indicate an improvement in the positive predictive value from 35% to 50%.

Read the paper (PDF).

Operational risk losses are heavy tailed and likely to be asymmetric and extremely dependent among business lines and event types. We propose a new methodology to assess, in a multivariate way, the asymmetry and extreme dependence between severity distributions and to calculate the capital for operational risk. This methodology simultaneously uses several parametric distributions and an alternative mix distribution (the lognormal for the body of losses and the generalized Pareto distribution for the tail) via the extreme value theory using SAS^®; the multivariate skew t-copula applied for the first time to operational losses; and the Bayesian inference theory to estimate new n-dimensional skew t-copula models via Markov chain Monte Carlo (MCMC) simulation. This paper analyzes a new operational loss data set, SAS^® Operational Risk Global Data (SAS OpRisk Global Data), to model operational risk at international financial institutions. All of the severity models are constructed in SAS^® 9.2. We implement PROC SEVERITY and PROC NLMIXED and this paper describes this implementation.

Read the paper (PDF).

SAS^® Forecast Server provides easy and automatic large-scale forecasting, which enables organizations to commit fewer resources to the process, reduce human touch interaction and minimize the biases that contaminate forecasts. SAS Forecast Server Client represents the modernization of the graphical user interface for SAS Forecast Server. This session will describe and demonstrate this new client, including new features, such as demand classification, and overall functionality.

Organisations find SAS^® upgrades and migration projects come with risk, costs, and challenges to solve. The benefits are enticing new software capabilities such as SAS^® Visual Analytics, which help maintain your competitive advantage. An interesting conundrum. This paper explores how to evaluate the benefits and plan the project, as well as how the cloud option impacts modernisation. The author presents with the experience of leading numerous migration and modernisation projects from the leading UK SAS Implementation Partner.

Read the paper (PDF).

This paper describes how we reduced elapsed time for the third maintenance release for SAS^® 9.4 by as much as 22% by using the High Performance FICON for IBM System z (zHPF) facility to perform I/O for SAS^® files on IBM mainframe systems. The paper details the performance improvements, internal testing to quantify improvements, and the customer actions needed to enable zHPF on their system. The benefits of zHPF are discussed within the larger context of other techniques that a customer can use to accelerate processing of SAS files.

Read the paper (PDF).

Multilevel models (MLMs) are frequently used in social and health sciences where data are typically hierarchical in nature. However, the commonly used hierarchical linear models (HLMs) are appropriate only when the outcome of interest is normally distributed. When you are dealing with outcomes that are not normally distributed (binary, categorical, ordinal), a transformation and an appropriate error distribution for the response variable needs to be incorporated into the model. Therefore, hierarchical generalized linear models (HGLMs) need to be used. This paper provides an introduction to specifying HGLMs using PROC GLIMMIX, following the structure of the primer for HLMs previously presented by Bell, Ene, Smiley, and Schoeneberger (2013). A brief introduction into the field of multilevel modeling and HGLMs with both dichotomous and polytomous outcomes is followed by a discussion of the model-building process and appropriate ways to assess the fit of these models. Next, the paper provides a discussion of PROC GLIMMIX statements and options as well as concrete examples of how PROC GLIMMIX can be used to estimate (a) two-level organizational models with a dichotomous outcome and (b) two-level organizational models with a polytomous outcome. These examples use data from High School and Beyond (HS&B), a nationally representative longitudinal study of American youth. For each example, narrative explanations accompany annotated examples of the GLIMMIX code and corresponding output.

Read the paper (PDF).

Customer Long-Term Value (LTV) is a concept that is readily explained at a high level to marketing management of a company, but its analytic development is complex. This complexity involves the need to forecast customer behavior well into the future. This behavior includes the timing, frequency, and profitability of a customer's future purchases of products and services. This paper describes a method for computing LTV. First, a multinomial logistic regression provides probabilities for time-of-first-purchase, time-of-second-purchase, and so on, for each customer. Then the profits for the first purchase, second purchase, and so on, are forecast but only after adjustment for non-purchaser selection bias. Finally, these component models are combined in the LTV formula.

Read the paper (PDF).

This presentation emphasizes use of SAS^® 9.4 to perform multiple imputation of missing data using the PROC MI Fully Conditional Specification (FCS) method with subsequent analysis using PROC SURVEYLOGISTIC and PROC MIANALYZE. The data set used is based on a complex sample design. Therefore, the examples correctly incorporate the complex sample features and weights. The demonstration is then repeated in Stata, IVEware, and R for a comparison of major software applications that are capable of multiple imputation using FCS or equivalent methods and subsequent analysis of imputed data sets based on complex sample design data.

Read the paper (PDF).

Retailers proactively seek a data-driven approach to provide customized product recommendations to guarantee sales increase and customer loyalty. Product affinity models have been recognized as one of the vital tools for this purpose. The algorithm assigns a customer to a product affinity group when the likelihood of purchasing is the highest and the likelihood meets the minimum and absolute requirement. However, in practice, valuable customers, up to 30% of the total universe, who buy across multiple product categories with two or more balanced product affinity likelihoods, are undefined and unable to be effectively product recommended. This paper presents multiple product affinity models that are developed using SAS^® macro language to address the problem. In this paper, we demonstrate how the innovative assignment algorithm successfully assigns the undefined customers to appropriate multiple product affinity groups using nationwide retailer transactional data. In addition, the result shows that potential customers establish loyalty through migration from a single to multiple product affinity groups. This comprehensive and insightful business solution will be shared in this paper. Also, this paper provides a clustering algorithm and nonparametric tree model for model building. The customer assignment for using SAS macro code is provided in an appendix.

Read the paper (PDF).

Differential item functioning (DIF), as an assessment tool, has been widely used in quantitative psychology, educational measurement, business management, insurance, and health care. The purpose of DIF analysis is to detect response differences of items in questionnaires, rating scales, or tests across different subgroups (for example, gender) and to ensure the fairness and validity of each item for those subgroups. The goal of this paper is to demonstrate several ways to conduct DIF analysis by using different SAS^® procedures (PROC FREQ, PROC LOGISITC, PROC GENMOD, PROC GLIMMIX, and PROC NLMIXED) and their applications. There are three general methods to examine DIF: generalized Mantel-Haenszel (MH), logistic regression, and item response theory (IRT). The SAS^® System provides flexible procedures for all these approaches. There are two types of DIF: uniform DIF, which remains consistent across ability levels, and non-uniform DIF, which varies across ability levels. Generalized MH is a nonparametric method and is often used to detect uniform DIF while the other two are parametric methods and examine both uniform and non-uniform DIF. In this study, I first describe the underlying theories and mathematical formulations for each method. Then I show the SAS statements, input data format, and SAS output for each method, followed by a detailed demonstration of the differences among the three methods. Specifically, PROC FREQ is used to calculate generalized MH only for dichotomous items. PROC LOGISITIC and PROC GENMOD are used to detect DIF by using logistic regression. PROC NLMIXED and PROC GLIMMIX are used to examine DIF by applying an exploratory item response theory model. Finally, I use SAS/IML^® to call two R packages (that is, difR and lordif) to conduct DIF analysis and then compare the results between SAS procedures and R packages. An example data set, the Verbal Aggression assessment, which includes 316 subjects and 24 items, is used in this stud y. Following the general DIF analysis, the male group is used as the reference group, and the female group is used as the focal group. All the analyses are conducted by SAS^® 9.3 and R 2.15.3. The paper closes with the conclusion that the SAS System provides different flexible and efficient ways to conduct DIF analysis. However, it is essential for SAS users to understand the underlying theories and assumptions of different DIF methods and apply them appropriately in their DIF analyses.

Read the paper (PDF).

You might be familiar with or experienced in writing or running reports using PROC REPORT, PROC TABULATE, or other methods of report generation. These reporting methods are often very flexible, but they can be limited in the statistics that are available as options for inclusion in the resulting output. SAS^® provides the capability to produce a variety of statistics through Base SAS^® and SAS/STAT^® procedures by using ODS OUTPUT. These procedures include statistics from PROC CORR, PROC FREQ, and PROC UNIVARIATE in Base SAS, as well as PROC GLM, PROC LIFETEST, PROC MIXED, PROC LOGISTIC, and PROC TTEST in SAS/STAT. A number of other procedures can also produce useful ODS OUTPUT objects. Commonly requested statistics for reports include p-values, confidence intervals, and test statistics. These values can be computed with the appropriate procedure, and then use ODS OUTPUT to output the desired information to a data set and include the new information with the other data used to produce the report. Examples that demonstrate how to easily generate the desired statistics or other information and include it to produce the requested final reports are provided and discussed.

Read the paper (PDF).

There are times when the objective is to provide a summary table and graph for several quality improvement measures on a single page to allow leadership to monitor the performance of measures over time. The challenges were to decide which SAS^® procedures to use, how to integrate multiple SAS procedures to generate a set of plots and summary tables within one page, and how to determine whether to use box plots or series plots of means or medians. We considered the SGPLOT and SGPANEL procedures, and Graph Template Language (GTL). As a result, given the nature of the request, the decision led us to use GTL and the SGRENDER procedure in the %BXPLOT2 macro. For each measure, we used the BOXPLOTPARM statement to display a series of box plots and the BLOCKPLOT statement for a summary table. Then we used the LAYOUT OVERLAY statement to combine the box plots and summary tables on one page. The results display a summary table (BLOCKPLOT) above each box plot series for each measure on a single page. Within each box plot series, there is an overlay of a system-level benchmark value and a series line connecting the median values of each box plot. The BLOCKPLOT contains descriptive statistics per time period illustrated in the associated box plot. The discussion points focus on techniques for nesting the lattice overlay with box plots and BLOCKPLOTs in GTL and some reasons for choosing box plots versus series plots of medians or means.

Read the paper (PDF).

This paper describes the new features added to the macro facility in SAS^® 9.3 and SAS^® 9.4. New features described include the /READONLY option for macro variables, the %SYSMACEXIST macro function, the %PUT &= feature, and new automatic macro variables such as &SYSTIMEZONEOFFSET.

Read the paper (PDF).

This paper takes you through the steps for ways to modernize your analytical business processes using SAS^® Decision Manager, a centrally managed, easy-to-use interface designed for business users. See how you can manage your data, business rules, and models, and then combine those components to test and deploy as flexible decisions options within your business processes. Business rules, which usually exist today in SAS^® code, Java code, SQL scripts, or other types of scripts, can be managed as corporate assets separate from the business process. This will add flexibility and speed for making decisions as policies, customer base, market conditions, or other business requirements change. Your business can adapt quickly and still be compliant with regulatory requirements and support overall process governance and risk. This paper shows how to use SAS Decision Manager to build business rules using a variety of methods including analytical methods and straightforward explicit methods. In addition, we demonstrate how to manage or monitor your operational analytical models by using automation to refresh your models as data changes over time. Then we show how to combine your data, business rules, and analytical models together in a decision flow, test it, and learn how to deploy in batch or real time to embed decision results directly into your business applications or processes at the point of decision.

Read the paper (PDF).

SAS^® SQL is so powerful that you hardly miss using Oracle PL/SQL. One SAS SQL forte can be found in using the SQL reflexive join. Another area of SAS SQL strength is the SQL subquery concept. The focus of this paper is to show alternative approaches to data reporting and to show how to surface data quality problems using reflexive join and subquery SQL concepts. The target audience for this paper is the intermediate SAS programmer or the experienced ANSI SQL programmer new to SAS programming.

Read the paper (PDF).

Well, Hadoop community, now that you have your data in Hadoop, how are you staging your analytical base tables? In my discussions with clients about this, we all agree on one thing: Data sizes stored in Hadoop prevent us from moving that data to a different platform in order to generate the analytical base tables. To address this dilemma, I want to introduce to you the SAS^® In-Database Code Accelerator for Hadoop.

Read the paper (PDF).

The DS2 programming language was introduced as part of the SAS^® 9.4 release. Although this new language introduced many significant advancements, one of the most overlooked features is the addition of object-oriented programming constructs. Specifically, the addition of user-defined packages and methods enables programmers to create their own objects, greatly increasing the opportunity for code reuse and decreasing both development and QA duration. In addition, using this object-oriented approach provides a powerful design methodology where objects closely resemble the real-world entities that they model, leading to programs that are easier to understand and maintain. This paper introduces the object-oriented programming paradigm in a three-step manner. First, the key object-oriented features found in the DS2 language are introduced, and the value each provides is discussed. Next, these object-oriented concepts are demonstrated through the creation of a blackjack simulation where the players, the dealer, and the deck are modeled and coded as objects. Finally, a credit risk scoring object is presented to demonstrate the application of this approach in a real-world setting.

Read the paper (PDF).

SAS^® Visual Analytics provides users with a unique view of their company by monitoring products, and identifying opportunities and threats, making it possible to hold recommendations, set a price strategy, and accelerate or brake product growth. In SAS Visual Analytics, you can see in one report the return required, a competitor analysis, and a comparison of realized results versus predicted results. Reports can be used to obtain a vision of the whole company and include several hierarchies (for example, by business unit, by segment, by product, by region, and so on). SAS Visual Analytics enables senior executives to easily and quickly view information. You can also use tracking indicators that are used by the insurance market.

Read the paper (PDF).

EBI administrators who are new to SAS^® Visual Analytics and used to the logging capability of the SAS^® OLAP Server might be wondering how they can get their SAS^® LASR™ Analytic Server to produce verbose log files. While the SAS LASR Analytic Server logs differ from those produced by the SAS OLAP Server, the SAS LASR Analytic Server log contains information about each request made to LASR tables and can be a great data source for administrators looking to learn more about how their SAS Visual Analytics deployments are being used. This session will discuss how to quickly enable logging for your SAS LASR Analytic Server in SAS Visual Analytics 6.4. You will see what information is available to a SAS administrator in these logs, how they can be parsed into data sets with SAS code, then loaded back into the SAS LASR Analytic Server to create SAS Visual Analytics explorations and reports.

Read the paper (PDF).

Use SAS^® to communicate with your colleagues and customers anywhere in the world, even if you do not speak the same language! In today's global economy, most of us can no longer assume that everyone in our company has an office in the same building, works in the same country, or speaks the same language. While it is vital to quickly analyze and report on large amounts of data, we must present our reports in a way that our readers can understand. New features in SAS^® Visual Analytics 7.1 give you the power to generate reports quickly and translate them easily so that your readers can comprehend the results. This paper describes how SAS^® Visual Analytics Designer 7.1 delivers the Power to Know^® in the language preferred by the report reader!

Read the paper (PDF).

The SAS^® Environment Manager Service Architecture expands on the core monitoring capabilities of SAS^® Environment Manager delivered in SAS^® 9.4. Multiple sources of data available in the SAS^® Environment Manager Data Mart--traditional operational performance metrics, events, and ARM, audit, and access logs--together with built-in and custom reports put powerful capabilities into the hands of IT operations. This paper introduces the concept of service-oriented even identification and discusses how to use the new architecture and tools effectively as well as the wealth of data available in the SAS Environment Manager Data Mart. In addition, extensions for importing new data, writing custom reports, instrumenting batch SAS^® jobs, and leveraging and extending auditing capabilities are explored.

Read the paper (PDF).

Walt Disney World Resort is home to four theme parks, two water parks, five golf courses, 26 owned-and-operated resorts, and hundreds of merchandise and dining experiences. Every year millions of guests stay at Disney resorts to enjoy the Disney Experience. Assigning physical rooms to resort and hotel reservations is a key component to maximizing operational efficiency and guest satisfaction. Solutions can range from automation to optimization programs. The volume of reservations and the variety and uniqueness of guest preferences across the Walt Disney World Resort campus pose an opportunity to solve a number of reasonably difficult room assignment problems by leveraging operations research techniques. For example, a guest might prefer a room with specific bedding and adjacent to certain facilities or amenities. When large groups, families, and friends travel together, they often want to stay near each other using specific room configurations. Rooms might be assigned to reservations in advance and upon request at check-in. Using mathematical programming techniques, the Disney Decision Science team has partnered with the SAS^® Advanced Analytics R&D team to create a room assignment optimization model prototype and implement it in SAS/OR^®. We describe how this collaborative effort has progressed over the course of several months, discuss some of the approaches that have proven to be productive for modeling and solving this problem, and review selected results.

Faced with diminishing forecast returns from the forecast engine within the existing replenishment application, Tractor Supply Company (TSC) engaged SAS^® Institute to deliver a fully integrated forecasting solution that promised a significant improvement of chain-wide forecast accuracy. The end-to-end forecast implementation including problems faced, solutions delivered, and results realized will be explored.

Read the paper (PDF).

SAS/QC^® provides procedures, such as PROC SHEWHART, to produce control charts with centerlines and control limits. When quality improvement initiatives create an out-of-control process of improvement, centerlines and control limits need to be recalculated. While this is not a complicated process, producing many charts with multiple centerline shifts can quickly become difficult. This paper illustrates the use of a macro to efficiently compute centerlines and control limits when one or more recalculations are needed for multiple charts.

Read the paper (PDF).

SAS^® data sets have PROC DATASETS, and SAS catalogs have PROC CATALOG. Find out what the little-known PROC CATALOG can do for you!

Read the paper (PDF).

The task was to produce a figure legend that gave the quintile ranges of a continuous measure corresponding to each color on a five-color choropleth US map. Actually, we needed to produce the figures and associated legends for several dozen maps for several dozen different continuous measures and time periods, as well as create the associated alt text for compliance with Section 508. So, the process needed to be automated. A method was devised using PROC RANK to generate the quintiles, PROC SQL to get the data value ranges within each quintile, and PROC FORMAT (with the CNTLIN= option) to generate and store the legend labels. The resulting data files and format catalogs were used to generate both the maps (with legends) and associated alt text. Then, these processes were rolled into a macro to apply the method for the many different maps and their legends. Each part of the method is quite simple--even mundane--but together, these techniques enabled us to standardize and automate an otherwise very tedious process. The same basic strategy could be used whenever you need to dynamically generate data buckets and keep track of the bucket boundaries (for producing labels, map legends, or alt text or for benchmarking future data against the stored categories).

Read the paper (PDF).

One of the fascinating features of SAS^® is that the software often provides multiple ways to accomplish the same task. A perfect example of this is the aggregation and summarization of data across multiple rows or BY groups of interest. These groupings can be study participants, time periods, geographical areas, or just about any type of discrete classification that you want. While many SAS programmers might be accustomed to accomplishing these aggregation tasks with PROC SUMMARY (or equivalently, PROC MEANS), PROC SQL can also do a bang-up job of aggregation--often with less code and fewer steps. This step-by-step paper explains how to use PROC SQL for a variety of summarization and aggregation tasks. It uses a series of concrete, task-oriented examples to do so. The presentation style issimilar to that used in the author's previous paper, PROC SQL for DATA Step Die-Hards.'

Read the paper (PDF).

It is a common task to reshape your data from long to wide for the purpose of reporting or analytical modeling and PROC TRANSPOSE provides a convenient way to accomplish this. However, when performing the transpose action on large tables stored in a database management system (DBMS) such as Teradata, the performance of PROC TRANSPOSE can be significantly compromised. In this case, it is more efficient for the DBMS to perform the transpose task. SAS^® provides in-database processing technology in PROC SQL, which allows the SQL explicit pass-through method to push some or all of the work to the DBMS. This technique has facilitated integration between SAS and a wide range of data warehouses and databases, including Teradata, EMC Greenplum, IBM DB2, IBM Netezza, Oracle, and Aster Data. This paper uses the Teradata database as an example DBMS and explains how to transpose a large table that resides in it using the SQL explicit pass-through method. The paper begins with comparing the execution time using PROC TRANSPOSE with the execution time using SQL explicit pass-through. From this comparison, it is clear that SQL explicit pass-through is more efficient than the traditional PROC TRANSPOSE when transposing Teradata tables, especially large tables. The paper explains how to use the SQL explicit pass-through method and discusses the types of data columns that you might need to transpose, such as numeric and character. The paper presents a transpose solution for these types of columns. Finally, the paper provides recommendations on packaging the SQL explicit pass-through method by embedding it in a macro. SAS programmers who are working with data stored in an external DBMS and who would like to efficiently transpose their data will benefit from this paper.

Read the paper (PDF).

If your data do not meet the assumptions for a standard parametric test, you might want to consider using a permutation test. By randomly shuffling the data and recalculating a test statistic, a permutation test can calculate the probability of getting a value equal to or more extreme than an observed test statistic. With the power of matrices, vectors, functions, and user-defined modules, the SAS/IML^® language is an excellent option. This paper covers two examples of permutation tests: one for paired data and another for repeated measures analysis of variance. For those new to SAS/IML^® software, this paper offers a basic introduction and examples of how effective it can be.

Read the paper (PDF).

Do you have reports based on SAS/GRAPH^® procedures, customized with multiple GOPTIONS? Do you dream of those same graphs existing in a GOPTIONS and ANNOTATE-free world? Re-creating complex graphs using statistical graphics (SG) procedures is not only possible, but much easier than you think! Using before and after examples, I discuss how the graphs were created using the combination of Graph Template Language (GTL) and the SG procedures. This method produces graphs that are nearly indistinguishable from the original. This method simplifies the code required to make complex graphs, allows for maximum re-usability for graphics code, and enables changes to cascade to multiple reports simultaneously.

Read the paper (PDF).

Okay, you've read all the books, manuals, and papers and can produce graphics with SAS/GRAPH^® and Output Delivery System (ODS) Graphics with the best of them. But how do you handle the Final Mile problem--getting your images generated in SAS^® sized just right and positioned just so in Microsoft Excel? This paper presents a method of doing so that employs SAS Integration Technologies and Excel Visual Basic for Applications (VBA) to produce SAS graphics and automatically embed them in Excel worksheets. This technique might be of interest to all skill levels. It uses Base SAS^®, SAS/GRAPH, ODS Graphics, the SAS macro facility, SAS^® Integration Technologies, Microsoft Excel, and VBA.

Read the paper (PDF).

Many companies use geographically dispersed data centers running SAS^® Grid Manager to provide 24/7 SAS^® processing capability with the thought that if a disaster takes out one of the data centers, another data center can take over the SAS processing. To accomplish this, careful planning must take into consideration hardware, software, and communication infrastructure along with the SAS workload. This paper looks into some of the options available, focusing on using SAS Grid Manager to manage the disaster workload shift.

Read the paper (PDF).

Investment portfolios and investable indexes determine their holdings according to stated mandate and methodology. Part of that process involves compliance with certain allocation constraints. These constraints are developed internally by portfolio managers and index providers, imposed externally by regulations, or both. An example of the latter is the U.S. Internal Revenue Code (25/50) concentration constraint, which relates to a regulated investment company (RIC). These codes state that at the end of each quarter of a RIC's tax year, the following constraints should be met: 1) No more than 25 percent of the value of the RIC's assets might be invested in a single issuer. 2) The sum of the weights of all issuers representing more than 5 percent of the total assets should not exceed 50 percent of the fund's total assets. While these constraints result in a non-continuous model, compliance with concentration constraints can be formalized by reformulating the model as a series of continuous non-linear optimization problems solved using PROC OPTMODEL. The model and solution are presented in this paper. The approach discussed has been used in constructing investable equity indexes.

Read the paper (PDF).

SAS^® Simulation Studio, a component of SAS/OR^® software for Microsoft Windows environments, provides powerful and versatile capabilities for building, executing, and analyzing discrete-event simulation models in a graphical environment. Its object-oriented, drag-and-drop modeling makes building and working with simulation models accessible to novice users, and its broad range of model configuration options and advanced capabilities makes SAS Simulation Studio suitable also for sophisticated, detailed simulation modeling and analysis. Although the number of modeling blocks in SAS Simulation Studio is small enough to be manageable, the number of ways in which they can be combined and connected is almost limitless. This paper explores some of the modeling methods and constructs that have proven most useful in practical modeling with SAS Simulation Studio. SAS has worked with customers who have applied SAS Simulation Studio to measure, predict, and improve system performance in many different industries, including banking, public utilities, pharmaceuticals, manufacturing, prisons, hospitals, and insurance. This paper looks at some discrete-event simulation modeling needs that arise in specific settings and some that have broader applicability, and it considers the ways in which SAS Simulation Studio modeling can meet those needs.

Read the paper (PDF). | Watch the recording.

Researchers, patients, clinicians, and other health-care industry participants are forging new models for data-sharing in hopes that the quantity, diversity, and analytic potential of health-related data for research and practice will yield new opportunities for innovation in basic and translational science. Whether we are talking about medical records (for example, EHR, lab, notes), administrative data (claims and billing), social (on-line activity), behavioral (fitness trackers, purchasing patterns), contextual (geographic, environmental), or demographic data (genomics, proteomics), it is clear that as health-care data proliferates, threats to security grow. Beginning with a review of the major health-care data breeches in our recent history, we highlight some of the lessons that can be gleaned from these incidents. In this paper, we talk about the practical implications of data sharing and how to ensure that only the right people have the right access to the right level of data. To that end, we explore not only the definitions of concepts like data privacy, but we discuss, in detail, methods that can be used to protect data--whether inside our organization or beyond its walls. In this discussion, we cover the fundamental differences between encrypted data, 'de-identified', 'anonymous', and 'coded' data, and methods to implement each. We summarize the landscape of maturity models that can be used to benchmark your organization's data privacy and protection of sensitive data.

Read the paper (PDF). | Watch the recording.

Inpatient treatment is the most common type of treatment ordered for patients who have a serious ailment and need immediate attention. Using a data set about diabetes patients downloaded from the UCI Network Data Repository, we built a model to predict the probability that the patient will be rehospitalized within 30 days of discharge. The data has about 100,000 rows and 51 columns. In our preliminary analysis, a neural network turned out to be the best model, followed closely by the decision tree model and regression model.

Diabetes is a chronic condition affecting people of all ages and is prevalent in around 25.8 million people in the U.S. The objective of this research is to predict the probability of a diabetic patient being readmitted. The results from this research will help hospitals design a follow-up protocol to ensure that patients having a higher re-admission probability are doing well in order to promote a healthy doctor-patient relationship. The data was obtained from the Center for Machine Learning and Intelligent Systems at University of California, Irvine. The data set contains over 100,000 instances and 55 variables such as insulin and length of stay, and so on. The data set was split into training and validation to provide an honest assessment of models. Various variable selection techniques such as stepwise regression, forward regression, LARS, and LASSO were used. Using LARS, prominent factors were identified in determining the patient readmission rate. Numerous predictive models were built: Decision Tree, Logistic Regression, Gradient Boosting, MBR, SVM, and others. The model comparison algorithm in SAS^® Enterprise Miner™ 13.1 recognized that the High-Performance Support Vector Machine outperformed the other models, having the lowest misclassification rate of 0.363. The chosen model has a sensitivity of 49.7% and a specificity of 75.1% in the validation data.

Read the paper (PDF).

Utility companies in America are always challenged when it comes to knowing when their infrastructure fails. One of the most critical components of a utility company's infrastructure is the transformer. It is important to assess the remaining lifetime of transformers so that the company can reduce costs, plan expenditures in advance, and largely mitigate the risk of failure. It is also equally important to identify the high-risk transformers in advance and to maintain them accordingly in order to avoid sudden loss of equipment due to overloading. This paper uses SAS^® to predict the lifetime of transformers, identify the various factors that contribute to their failure, and model the transformer into High, Medium, and Low risk categories based on load for easy maintenance. The data set from a utility company contains around 18,000 observations and 26 variables from 2006 to 2013, and contains the failure and installation dates of the transformers. The data set comprises many transformers that were installed before 2006 (there are 190,000 transformers on which several regression models are built in this paper to identify their risk of failure), but there is no age-related parameter for them. Survival analysis was performed on this left-truncated and right-censored data. The data set has variables such as Age, Average Temperature, Average Load, and Normal and Overloaded Conditions for residential and commercial transformers. Data creation involved merging 12 different tables. Nonparametric models for failure time data were built so as to explore the lifetime and failure rate of the transformers. By building a Cox's regression model, the important factors contributing to the failure of a transformer are also analyzed in this paper. Several risk- based models are then built to categorize transformers into High, Medium, and Low risk categories based on their loads. This categorization can help the utility companies to better manage the risks associated with transformer failures.

Read the paper (PDF).

Predictions, including regressions and classifications, are the predominant focus of many statistical and machine-learning models. However, in the era of big data, a predictive modeling process contains more than just making the final predictions. For example, a large collection of data often represents a set of small, heterogeneous populations. Identification of these sub groups is therefore an important step in predictive modeling. In addition, big data data sets are often complex, exhibiting high dimensionality. Consequently, variable selection, transformation, and outlier detection are integral steps. This paper provides working examples of these critical stages using SAS^® Visual Statistics, including data segmentation (supervised and unsupervised), variable transformation, outlier detection, and filtering, in addition to building the final predictive model using methodology such as linear regressions, decision trees, and logistic regressions. The illustration data was collected from 2010 to 2014, from vehicle emission testing results.

Read the paper (PDF).

Many scientific and academic journals require that statistical tables be created in a specific format, with one of the most common formats being that of the American Psychological Association (APA). The APA publishes a substantial guide book to writing and formatting papers, including an extensive section on creating tables (Nichol 2010). However, the output generated by SAS^® procedures does not match this style. This paper discusses techniques to change the SAS procedure output to match the APA guidelines using SAS ODS (Output Delivery System).

Read the paper (PDF).

A common complaint of employers is that educational institutions do not prepare students for the types of messy data and multi-faceted requirements that occur on the job. No organization has data that resembles the perfectly scrubbed data sets in the back of a statistics textbook. The objective of the Annual Report Project is to quickly bring new SAS^® users to a level of competence where they can use real data to meet real business requirements. Many organizations need annual reports for stockholders, funding agencies, or donors. Or, they need annual reports at the department or division level for an internal audience. Being tapped as part of the team creating an annual report used to mean weeks of tedium, poring over columns of numbers in 8-point font in (shudder) Excel spreadsheets, but no more. No longer painful, using a few SAS procedures and functions, reporting can be easy and, dare I say, fun. All analyses are done using SAS^® Studio (formerly SAS^® Web Editor) of SAS OnDemand for Academics. This paper uses an example with actual data for a report prepared to comply with federal grant funding requirements as proof that, yes, it really is that simple.

Read the paper (PDF). | Watch the recording.

This paper presents a methodology developed to define and prioritize feeders with the least satisfactory performances for continuity of energy supply, in order to obtain an efficiency ranking that supports a decision-making process regarding investments to be implemented. Data Envelopment Analysis (DEA) was the basis for the development of this methodology, in which the input-oriented model with variable returns to scale was adopted. To perform the analysis of the feeders, data from the utility geographic information system (GIS) and from the interruption control system was exported to SAS^® Enterprise Guide^®, where data manipulation was possible. Different continuity variables and physical-electrical parameters were consolidated for each feeder for the years 2011 to 2013. They were separated according to the geographical regions of the concession area, according to their location (urban or rural), and then grouped by physical similarity. Results showed that 56.8% of the feeders could be considered as efficient, based on the continuity of the service. Furthermore, the results enable identification of the assets with the most critical performance and their benchmarks, and the definition of preliminary goals to reach efficiency.

Read the paper (PDF).

The era of big data and health care reform is an exciting and challenging time for anyone whose work involves data security, analytics, data visualization, or health services research. This presentation examines important aspects of current approaches to quality improvement in health care based on data transparency and patient choice. We look at specific initiatives related to the Affordable Care Act (for example, the qualified entity program of section 10332 that allows the Centers for Medicare and Medicaid Services (CMS) to provide Medicare claims data to organizations for multi-payer quality measurement and reporting, the open payments program, and state-level all-payer claims databases to inform improvement and public reporting) within the context of a core issue in the era of big data: security and privacy versus transparency and openness. In addition, we examine an assumption that underlies many of these initiatives: data transparency leads to improved choices by health care consumers and increased accountability of providers. For example, recent studies of one component of data transparency, price transparency, show that, although health plans generally offer consumers an easy-to-use cost calculator tool, only about 2 percent of plan members use it. Similarly, even patients with high-deductible plans (presumably those with an increased incentive to do comparative shopping) seek prices for only about 10 percent of their services. Anyone who has worked in analytics, reporting, or data visualization recognizes the importance of understanding the intended audience, and that methodological transparency is as important as the public reporting of the output of the calculation of cost or quality metrics. Although widespread use of publicly reported health care data might not be a realistic goal, data transparency does offer a number of potential benefits: data-driven policy making, informed management of cost and use of services, as well as public health benefits through, for example, the rec ognition of patterns of disease prevalence and immunization use. Looking at this from a system perspective, we can distinguish five main activities: data collection, data storage, data processing, data analysis, and data reporting. Each of these activities has important components (such as database design for data storage and de-identification and aggregation for data reporting) as well as overarching requirements such as data security and quality assurance that are applicable to all activities. A recent Health Affairs article by CMS leaders noted that the big-data revolution could not have come at a better time, but it also recognizes that challenges remain. Although CMS is the largest single payer for health care in the U.S., the challenges it faces are shared by all organizations that collect, store, analyze, or report health care data. In turn, these challenges are opportunities for database developers, systems analysts, programmers, statisticians, data analysts, and those who provide the tools for public reporting to work together to design comprehensive solutions that inform evidence-based improvement efforts.

Read the paper (PDF).

Sometimes you need to provide multiple administrators with the ability to manage your software. The rationale can be a need to separate roles and responsibilities (such as installer and configuration manager), changing job responsibilities, or even just covering for the primary administrator while on vacation. To meet that need, it's tempting to share the logon credentials of your SAS^® installer account, but doing so can potentially compromise your security and cause a corporate audit to fail. This paper focuses on standard IT practices and utilities, explaining how to diligently manage the administration of your SAS software to help you properly ensure that access is secured and that auditability is maintained.

Read the paper (PDF). | Watch the recording.

Dynamic pricing is a real-time strategy where corporations attempt to alter prices based on varying market demand. The hospitality industry has been doing this for quite a while, altering prices significantly during the summer months or weekends when demand for rooms is at a premium. In recent years, the sports industry has started to catch on to this trend, especially within Major League Baseball (MLB). The purpose of this paper is to explore the methodology of applying this type of pricing to the hockey ticketing arena.

Read the paper (PDF).

One of the hallmarks of a good or great SAS^® program is that it requires only a minimum of upkeep. Especially for code that produces reports on a regular basis, it is preferable to minimize user and programmer input and instead have the input data drive the outputs of a program. Data-driven SAS programs are more efficient and reliable, require less hardcoding, and result in less downtime and fewer user complaints. This paper reviews three ways of building a SAS program to create regular Microsoft Excel reports; one method using hardcoded variables, another using SAS keyword macros, and the last using metadata to drive the reports.

A bubble map can be a useful tool for identifying trends and visualizing the geographic proximity and intensity of events. This session shows how to use PROC GEOCODE and PROC GMAP to turn a data set of addresses and events into a map of the United States with scaled bubbles depicting the location and intensity of the events.

Read the paper (PDF).

Although it does not happen every day, it is not unusual to need to place a quoted string within another quoted string. Fortunately, SAS^® recognizes both single and double quote marks and either can be used within the other, which gives you the ability to have two-deep quoting. There are situations, however, where two kinds of quotes are not enough. Sometimes you need a third layer or, more commonly, you need to use a macro variable within the layers of quotes. Macro variables can be especially problematic, because they generally do not resolve when they are inside single quotes. However, this is SAS and that implies that there are several things going on at once and that there are several ways to solve these types of quoting problems. The primary goal of this presentation is to assist the programmer with solutions to the quotes-within-quotes problem with special emphasis on the presence of macro variables. The various techniques are contrasted as are the likely situations that call for these types of solutions. A secondary goal of this presentation is to help you understand how SAS works with quote marks and how it handles quoted strings. Without going into the gory details, a high-level understanding can be useful in a number of situations.

Read the paper (PDF).

REST is being used across the industry for designing networked applications to provide lightweight and powerful alternatives to web services such as SOAP and Web Services Description Language (WSDL). Since REST is based entirely around HTTP, SAS^® provides everything you need to make REST calls and process structured and unstructured data alike. Learn how PROC HTTP and other SAS language features provide everything you need to simply and securely make use of REST.

Read the paper (PDF).

Retrospective case-control studies are frequently used to evaluate health care programs when it is not feasible to randomly assign members to a respective cohort. Without randomization, observational studies are more susceptible to selection bias where the characteristics of the enrolled population differ from those of the entire population. When the participant sample is different from the comparison group, the measured outcomes are likely to be biased. Given this issue, this paper discusses how propensity score matching and random effects techniques can be used to reduce the impact selection bias has on observational study outcomes. All results shown are drawn from an ROI analysis using a participant (cases) versus non-participant (controls) observational study design for a fitness reimbursement program aiming to reduce health care expenditures of participating members.

Read the paper (PDF). | Download the data file (ZIP).

Streaming data is becoming more and more prevalent. Everything is generating data now--social media, machine sensors, the 'Internet of Things'. And you need to decide what to do with that data right now. And 'right now' could mean 10,000 times or more per second. SAS^® Event Stream Processing provides an infrastructure for capturing streaming data and processing it on the fly--including applying analytics and deciding what to do with that data. All in milliseconds. There are some basic tenets on how SAS^® provides this extremely high-throughput, low-latency technology to meet whatever streaming analytics your company might want to pursue.

Read the paper (PDF). | Watch the recording.

Risk managers and traders know that some knowledge loses its value quickly. Unfortunately, due to the computationally intensive nature of risk, most risk managers use stale data. Knowing your positions and risk intraday can provide immense value. Imagine knowing the portfolio risk impact of a trade before you execute. This paper shows you a path to doing real-time risk analysis leveraging capabilities from SAS^® Event Stream Processing Engine and SAS^® High-Performance Risk. Event stream processing (ESP) offers the ability to process large amounts of data with high throughput and low latency, including streaming real-time trade data from front-office systems into a centralized risk engine. SAS High-Performance Risk enables robust, complex portfolio valuations and risk calculations quickly and accurately. In this paper, we present techniques and demonstrate concepts that enable you to more efficiently use these capabilities together. We also show techniques for analyzing SAS High-Performance data with SAS^® Visual Analytics.

Read the paper (PDF).

To stay competitive in the marketplace, health-care programs must be capable of reporting the true savings to clients. This is a tall order, because most health-care programs are set up to be available to the client's entire population and thus cannot be conducted as a randomized control trial. In order to evaluate the performance of the program for the client, we use an observational study design that has inherent selection bias due to its inability to randomly assign participants. To reduce the impact of bias, we apply propensity score matching to the analysis. This technique is beneficial to health-care program evaluations because it helps reduce selection bias in the observational analysis and in turn provides a clearer view of the client's savings. This paper explores how to develop a propensity score, evaluate the use of inverse propensity weighting versus propensity matching, and determine the overall impact of the propensity score matching method on the observational study population. All results shown are drawn from a savings analysis using a participant (cases) versus non-participant (controls) observational study design for a health-care decision support program aiming to reduce emergency room visits.

Read the paper (PDF).

As a part of regulatory compliance requirements, banks are required to submit reports based on Microsoft Excel, as per templates supplied by the regulators. This poses several challenges, including the high complexity of templates, the fact that implementation using ODS can be cumbersome, and the difficulty in keeping up with regulatory changes and supporting dynamic report content. At the same time, you need the flexibility to customize and schedule these reports as per your business requirements. This paper discusses an approach to building these reports using SAS^® XML Mapper and the Excel XML spreadsheet format. This approach provides an easy-to-use framework that can accommodate template changes from the regulators without needing to modify the code. It is implemented using SAS^® technologies, providing you the flexibility to customize to your needs. This approach also provides easy maintainability.

Read the paper (PDF).

As a consequence of the financial crisis, banks are required to stress test their balance sheet and earnings based on prescribed macroeconomic scenarios. In the US, this exercise is known as the Comprehensive Capital Analysis and Review (CCAR) or Dodd-Frank Act Stress Testing (DFAST). In order to assess capital adequacy under these stress scenarios, banks need a unified view of their projected balance sheet, incomes, and losses. In addition, the bar for these regulatory stress tests is very high regarding governance and overall infrastructure. Regulators and auditors want to ensure that the granularity and quality of data, model methodology, and assumptions reflect the complexity of the banks. This calls for close internal collaboration and information sharing across business lines, risk management, and finance. Currently, this process is managed in an ad hoc, manual fashion. Results are aggregated from various lines of business using spreadsheets and Microsoft SharePoint. Although the spreadsheet option provides flexibility, it brings ambiguity into the process and makes the process error prone and inefficient. This paper introduces a new SAS^® stress testing solution that can help banks define, orchestrate and streamline the stress-testing process for easier traceability, auditability, and reproducibility. The integrated platform provides greater control, efficiency, and transparency to the CCAR process. This will enable banks to focus on more value-added analysis such as scenario exploration, sensitivity analysis, capital planning and management, and model dependencies. Lastly, the solution was designed to leverage existing in-house platforms that banks might already have in place.

Read the paper (PDF).

Replication techniques such as the jackknife and the bootstrap have become increasingly popular in recent years, particularly within the field of complex survey data analysis. The premise of these techniques is to treat the data set as if it were the population and repeatedly sample from it in some systematic fashion. From each sample, or replicate, the estimate of interest is computed, and the variability of the estimate from the full data set is approximated by a simple function of the variability among the replicate-specific estimates. An appealing feature is that there is generally only one variance formula per method, regardless of the underlying quantity being estimated. The entire process can be efficiently implemented after appending a series of replicate weights to the analysis data set. As will be shown, the SURVEY family of SAS/STAT^® procedures can be exploited to facilitate both the task of appending the replicate weights and approximating variances.

Read the paper (PDF).

Reporting Best Practices

SAS^® Visual Analytics provides numerous capabilities to analyze data lightning fast and make key business decisions that are critical for day-to-day operations. Depending on your organization, be it Human Resources, Sales, or Finance, the data can be easily mined by decision makers, providing information that empowers the user to make key business decisions. The right data preparation during report development is the key to success. SAS Visual Analytics provides the ability to explore the data and to make forecasts using automatic charting capabilities with a simple click-and-choose interface. The ability to load all the historical data into memory enables you to make decisions by analyzing the data patterns. The decision is within reach when the report designer uses SAS^® Visual Analytics Designer functionality like alerts, display rules, ranks, comments, and others. Planning your data preparation task is critical for the success of the report. Identify the category and measure values in the source data, and convert them appropriately, based on your planned usage. SAS Visual Analytics has capabilities that help perform conversion on the fly. Creating meaningful derived variables on the go and hierarchies on the run reduces development time. Alerts notifications are sent to the right decision makers by e-mail when the report objects contain data that meets certain criteria. The system monitors the data, and the report developer can specify how frequently the system checks are made and the frequency at which the notifications are sent. Display rules help in highlighting the right metrics to leadership, which helps focus the decision makers on the right metric in the data maze. For example, color coding the metrics quickly tells the report user which business problems require action. Ranking the metrics, such as top 10 or bottom 10, can help the decision makers focus on a success or on problem areas. They can drill into more details about why they stand out or fall b ehind. Discussing a report metric in a particular report can be done using the comments feature. Responding to other comments can lead to the right next steps for the organization. Also, data quality is always monitored when you have actionable reports, which helps to create a responsive and reliable reporting environment.

Read the paper (PDF).

Retailers, amongst nearly every other consumer business are under more pressure and competition than ever before. Today 's consumer is more connected, informed and empowered and the pace of innovation is rapidly changing the way consumers shop. Retailers are expected to sift through and implement digital technology, make sense of their Big Data with analytics, change processes and cut costs all at the same time. Today 's session, SRetail 2015 the landscape, trends, and technology will cover major issues retailers are facing today as well as both business and technology trends that will shape their future.

The IN operator within the DATA step is used for searching a specific variable for some values, either numeric or character (for example, 'if X in (2), then...'). This brief note explains how the opposite situation can be managed. That is, it explains how to search for a specific value in several variables through applying an array and the IN operator together.

Read the paper (PDF).

Pay-for-performance programs are putting increasing pressure on providers to better manage patient utilization through care coordination, with the philosophy that good preventive services and routine care can prevent the need for some high-resource services. Evaluation of provider performance frequently includes measures such as acute care events (ER and inpatient), imaging, and specialist services, yet rarely are these indicators adjusted for the underlying risk of providers' patient panel. In part, this is because standard patient risk scores are designed to predict costs, not the probability of specific service utilization. As such, Blue Cross Blue Shield of North Carolina has developed a methodology to model our members' risk of these events in an effort to ensure that providers are evaluated fairly and to prevent our providers from adverse selection practices. Our risk modeling takes into consideration members' underlying health conditions and limited demographic factors during the previous 12 month period, and employs two-part regression models using SAS^® software. These risk-adjusted measures will subsequently be the basis of performance evaluation of primary care providers for our Accountable Care Organizations and medical home initiatives.

Read the paper (PDF).

Given the challenges of data security in today's business environment, how can you protect the data that is used by SAS^® Visual Analytics? SAS^® has implemented security features in its widely used business intelligence platform, including row-level security in SAS Visual Analytics. Row-level security specifies who can access particular rows in a LASR table. Throughout this paper, we discuss two ways of implementing row-level security for LASR tables in SAS^® Visual Analytics--interactively and in batch. Both approaches link table-based permission conditions with identities that are stored in metadata.

Read the paper (PDF).

It is a safe assumption that almost every SAS^® user learns how to use the SET statement not long after they're taught the concept of a DATA step. Further, it would probably be reasonable to guess that almost everyone of those people covered the MERGE statement soon afterwards. Many, maybe most, also got to try the UPDATE and/or MODIFY statements eventually, as well. It would also be a safe assumption that very few people have taken the time to review the manual since they've learned about those statements. That is most unfortunate, because there are so many options available to the users that can assist them in their tasks of obtaining and combining data sets. This presentation is designed to build onto the basic understanding of SET, MERGE, and UPDATE. It assumes that the attendee or reader has a basic knowledge of those statements, and it introduces various options and usages that extend the utility of these basic commands.

Read the paper (PDF).

Join us for lunch as we discuss the benefits of being part of the elite group that is SAS Certified Professionals. The SAS Global Certification program has awarded more than 79,000 credentials to SAS users across the globe. Come listen to Terry Barham, Global Certification Manager, give an overview of the SAS Certification program, explain the benefits of becoming SAS certified and discuss exam preparation tips. This session will also include a Q&A section where you can get answers to your SAS Certification questions.

The goal of this presentation is to provide user group an update on retail solution releases in past one year and the roadmap moving forward.

SAS9.4m2 brings even more progress to the interoperability between SAS and Hadoop - the industry standard for Big Data. This talk brings you up to date with where we are : - more distributions, more data types, more options :-) You now have the choice to run a stand-alone cluster or to co-mingle your SAS processing on your general purpose clusters managing it all with YARN. Come and learn about our new developments and get a glimpse of what we are looking at in the future.

Read the paper (PDF).

The real promise of the 'Cloud' is to make things easy. Easy for IT to manage. Easy for users to use. Easy for self-service deployments and access to software. SAS's approach to the Cloud is all about 'easy'. See what you can do today in the Cloud, and where we are going. What 'easy' do you want with SAS?

SAS^® formats can be used in so many different ways! Even the most basic SAS format use (modifying the way a SAS data value is displayed without changing the underlying data value) holds a variety of nifty tricks, such as nesting formats, formats that affect various style attributes, and conditional formatting. Add in picture formats, multi-label formats, using formats for data cleaning, and formats for joins and table look-ups, and we have quite a bag of tricks for the humble SAS format and PROC FORMAT, which are used to generate them. This paper describes a few very useful programming techniques that employ SAS formats. While this paper will be appropriate for the newest SAS user, it will also focus on some of the lesser-known features of formats and PROC FORMAT and so should be useful for even quite experienced users of SAS.

Read the paper (PDF).

The latest release of SAS/STAT^® software brings you powerful techniques that will make a difference in your work, whether your data are massive, missing, or somewhere in the middle. New imputation software for survey data adds to an expansive array of methods in SAS/STAT for handling missing data, as does the production version of the GEE procedure, which provides the weighted generalized estimating equation approach for longitudinal studies with dropouts. An improved quadrature method in the GLIMMIX procedure gives you accelerated performance for certain classes of models. The HPSPLIT procedure provides a rich set of methods for statistical modeling with classification and regression trees, including cross validation and graphical displays. The HPGENSELECT procedure adds support for spline effects and lasso model selection for generalized linear models. And new software implements generalized additive models by using an approach that handles large data easily. Other updates include key functionality for Bayesian analysis and pharmaceutical applications.

Read the paper (PDF).

For the many relational database products that SAS/ACCESS^®supports (Oracle, Teradata, DB2, MySQL, SQL Server, Hadoop, Greenplum, PC Files, to name but a few), there are a myriad of RDBMS-specific options at your disposal, but how do you know the right options for any given situation? How much data should you transfer at a time? Which SAS^® functions can be passed through to the database--which cannot? How do you verify that your processes are running efficiently? How do you test and validate any changes? The answer lies with the feedback capabilities of the SASTRACE system option.

Read the paper (PDF).

SAS^® Analytics enables organizations to tackle complex business problems using big data and to provide insights needed to make critical business decisions. A well-architected enterprise storage infrastructure is needed to realize the full potential of SAS Analytics. However, as the need for big data analytics and rapid response times increases, the performance gap between server speeds and traditional hard disk drive (HDD) based storage systems can be a significant concern. The growing performance gap can have detrimental effects, particularly when it comes to critical business applications. As a result, organizations are looking for newer, smarter, faster storage systems to accelerate business insights. IBM FlashSystem Storage systems store the data in flash memory. They are designed for dramatically faster access times and support incredible amounts of input/output operations per second (IOPS) and throughput, with significantly lower latency than HDD-based solutions. Due to their macro-efficiency design, FlashSystem Storage systems consume less power and have significantly lower cooling and space requirements, while allowing server processors to run SAS Analytics more efficiently. Being an all-flash storage system, IBM FlashSystem provides consistent low latency response across IOPS range, as the analytics workload scales. This paper introduces the benefits of IBM FlashSystem Storage for deploying SAS Analytics and highlights some of the deployment scenarios and architectural considerations. This paper also describes best practices and tuning guidelines for deploying SAS Analytics on FlashSystem Storage systems, which would help SAS Analytics customers in architecting solutions with FlashSystem Storage.

Read the paper (PDF).

Individual investors face a daunting challenge. They must select a portfolio of securities comprised of a manageable number of individual stocks, bonds and/or mutual funds. An investor might initiate her portfolio selection process by choosing the number of unique securities to hold in her portfolio. This is both a practical matter and a matter of risk management. It is practical because there are tens of thousands of actively traded securities from which to choose and it is impractical for an individual investor to own every available security. It is also a risk management measure because investible securities bring with them the potential of financial loss -- to the point of becoming valueless in some cases. Increasing the number of securities in a portfolio decreases the probability that an investor will suffer drastically from corporate bankruptcy, for instance. However, holding too many securities in a portfolio can restrict performance. After deciding the number of securities to hold, the investor must determine which securities she will include in her portfolio and what proportion of available cash she will allocate to each security. Once her portfolio is constructed, the investor must manage the portfolio over time. This generally entails periodically reassessing the proportion of each security to maintain as time advances, but may also involve the elimination of some securities and the initiation of positions in new securities. This paper introduces an analytically driven method for portfolio security selection based on minimizing the mean correlation of returns across the portfolio. It also introduces a method for determining the proportion of each security that should be maintained within the portfolio. The methods for portfolio selection and security weighting described herein work in conjunction to maximize expected portfolio return, while minimizing the probability of loss over time. This involves a re-visioning of Harry Markowitz's Nobel Prize winning concept kno wn as Efficient Frontier . Resultant portfolios are assessed via Monte Carlo simulation and results are compared to the Standard & Poor's 500 Index and Warren Buffett's Berkshire Hathaway, which has a well-establish history of beating the Standard & Poor's 500 Index over a long period. To those familiar with Dr. Markowitz's Modern Portfolio Theory this paper may appear simply as a repackaging of old ideas. It is not.

Read the paper (PDF).

Join us for lunch with SAS^® Press Authors! They will share some of their favorite SAS^® tips, summarize their experiences in writing a SAS Press title, and discuss how to become a published author. Bring your questions! There will be an opportunity for you to learn more about the publishing process and forthcoming titles, to chat with the Acquisition Editors, and to win SAS Press books.

In today's competitive job market, both recent graduates and experienced professionals are looking for ways to set themselves apart from the crowd. SAS^® certification is one way to do that. SAS Institute Inc. offers a range of exams to validate your knowledge level. In writing this paper, we have drawn upon our personal experiences, remarks shared by new and longtime SAS users, and conversations with experts at SAS. We discuss what certification is and why you might want to pursue it. Then we share practical tips you can use to prepare for an exam and do your best on exam day.

Read the paper (PDF).

Can you create hundreds of great looking Microsoft Excel tables all within SAS^® and make them all Section 508 compliant at the same time? This paper examines how to use the ODS TAGSETS.EXCELXP statement and other Base SAS^® features to create fantastic looking Excel worksheet tables that are all Section 508 compliant. This paper demonstrates that there is no need for any outside intervention or pre- or post-meddling with the Excel files to make them Section 508 compliant. We do it all with simple Base SAS code.

Read the paper (PDF).

When planning for a journey, one of the main goals is to get the best value possible. The same thing could be said for your corporate data as it journeys through the data management process. It is your goal to get the best data in the hands of decision makers in a timely fashion, with the lowest cost of ownership and the minimum number of obstacles. The SAS^® Data Management suite of products provides you with many options for ensuring value throughout the data management process. The purpose of this session is to focus on how the SAS^® Data Management solution can be used to ensure the delivery of quality data, in the right format, to the right people, at the right time. The journey is yours, the technology is ours--together, we can make it a fulfilling and rewarding experience.

Read the paper (PDF).

First introduced in 2013, the Cloudera Data Science Challenge is a rigorous competition in which candidates must provide a solution to a real-world big data problem that surpasses a benchmark specified by some of the world's elite data scientists. The Cloudera Data Science Challenge 2 (in 2014) involved detecting anomalies in the United States Medicare insurance system. Finding anomalous patients, procedures, providers, and regions in the competition's large, complex, and intertwined data sets required industrial-strength tools for data wrangling and machine learning. This paper shows how I did it with SAS^®.

Read the paper (PDF). | Download the data file (ZIP).

Customer expectations are set high when Microsoft Excel and Microsoft PowerPoint are used to design reports. Using SAS^® for reporting has benefits because it generates plots directly from prepared data sets, automates the plotting process, minimizes labor-intensive manual construction using Microsoft products, and does not compromise the presentation value. SAS^® Enterprise Guide^® 5.1 has a powerful point-and-click method that is quick and easy to use. However, it is limited in its ability to customize the output to mimic manually created Microsoft graphics. This paper demonstrates why SAS Enterprise Guide is the perfect starting point for creating initial code for plots using SAS/GRAPH^® point-and-click features and how the code can be enhanced using established PROC GPLOT, ANNOTATE, and ODS options to re-create the look and feel of plots generated by Excel and PowerPoint. Examples show the generation of plots and tables using PROC TABULATE to embed the plot data into the graphical output. Also included are tips for overcoming the ODS limitation of SAS^® 9.3, which is used by SAS Enterprise Guide 5.1, to transfer the SAS graphical output to PowerPoint files. These SAS^® 9.3 tips are contrasted with the new SAS^® 9.4 ODS POWERPOINT statement that enables direct PowerPoint file creation from a SAS program.

Read the paper (PDF).

The Query Builder in SAS^® Enterprise Guide^® is an excellent point-and-click tool that generates PROC SQL code and creates queries in SAS^®. This hands-on workshop will give an overview of query options, sorting, simple and complex filtering, and joining tables. It is a great workshop for programmers and non-programmers alike.

A good system should embody the following characteristics: planned, maintainable, flexible, simple, accurate, restartable, reliable, reusable, automated, documented, efficient, modular, and validated. This is true of any system, but how to implement this in SAS^® Enterprise Guide^® is a unique endeavor. We provide a brief overview of these characteristics and then dive deeper into how a SAS Enterprise Guide user should approach developing both ad hoc and production systems.

Read the paper (PDF).

SAS^® Enterprise Guide^® is an extremely valuable tool for programmers. But it should also be leveraged by managers and executives to do data exploration, get information on the fly, and take advantage of the powerful analytics and reporting that SAS^® has to offer. This can all be done without learning to program. This paper will overview how SAS Enterprise Guide can improve the process of turning real-time data into real-time business decisions by managers.

SAS^® Studio (previously known as SAS^® Web Editor) was introduced in the first maintenance release of SAS^® 9.4 as an alternative programming environment to SAS^® Enterprise Guide^® and SAS^® Display Manager. SAS Studio is different in many ways from SAS Enterprise Guide and SAS Display Manager. As a programmer, I currently use SAS Enterprise Guide to help me code, test, maintain, and organize my SAS^® programs. I have SAS Display Manager installed on my PC, but I still prefer to write my programs in SAS Enterprise Guide because I know it saves my log and output whenever I run a program, even if that program crashes and takes the SAS session with it! So should I now be using SAS Studio instead, and should you be using it, too?

Read the paper (PDF).

Wouldn't it be great if there were a way to deploy SAS^® Grid Manager in discrete building blocks that have the proper balance of compute capability, RAM, and IO throughput? Well, now you can! This paper discusses the attributes of a well-designed SAS Grid Manager deployment and why it is sometimes difficult to engineer such an environment when IT responsibilities are segregated between server administration, network administration, and storage administration. The paper presents a concrete design that will position the customer for a successful SAS Grid Manager deployment of any size and that can also scale out easily as the needs of the organization grow.

The purpose behind this paper is to provide a high-level overview of how SAS^® security works in a way that can be communicated to both SAS administrators and users who are not familiar with SAS. It is not uncommon to hear SAS administrators complain that their IT department and users just don't 'get' it when it comes to metadata and security. For the administrator or user not familiar with SAS, understanding how SAS interacts with the operating system, the file system, external databases, and users can be confusing. Based on a SAS^® Enterprise Office Analytics installation in a Windows environment, this paper walks the reader through all of the basic metadata relationships and how they are created, thus unraveling the mystery of how the host system, external databases, and SAS work together to provide users what they need, while reliably enforcing the appropriate security.

Read the paper (PDF).

SAS^® Model Manager provides an easy way to deploy analytical models into various relational databases or into Hadoop using either scoring functions or the SAS^® Embedded Process publish methods. This paper gives a brief introduction of both the SAS Model Manager publishing functionality and the SAS^® Scoring Accelerator. It describes the major differences between using scoring functions and the SAS Embedded Process publish methods to publish a model. The paper also explains how to perform in-database processing of a published model by using SAS applications as well as SQL code outside of SAS. In addition to Hadoop, SAS also supports these databases: Teradata, Oracle, Netezza, DB2, and SAP HANA. Examples are provided for publishing a model to a Teradata database and to Hadoop. After reading this paper, you should feel comfortable using a published model in your business environment.

Read the paper (PDF).

Are you a SAS^® software user hoping to convince your organization to move to the latest product release? Has your management team asked how your organization can hire new SAS users familiar with the latest and greatest procedures and techniques? SAS^® Studio and SAS^® University Edition might provide the answers for you. SAS University Edition was created for teaching and learning. It's a new downloadable package of selected SAS products (Base SAS^®, SAS/STAT^®, SAS/IML^®, SAS/ACCESS^® Interface to PC Files, and SAS Studio) that runs on Windows, Linux, and Mac. With the exploding demand for analytical talent, SAS launched this package to grow the next generation of SAS users. Part of the way SAS is helping grow that next generation of users is through the interface to SAS University Edition: SAS Studio. SAS Studio is a developmental web application for SAS that you access through your web browser and--since the first maintenance release of SAS 9.4--is included in Base SAS at no additional charge. The connection between SAS University Edition and commercial SAS means that it's easier than ever to use SAS for teaching, research, and learning, from high schools to community colleges to universities and beyond. This paper describes the product, as well as the intent behind it and other programs that support it, and then talks about some successes in adopting SAS University Edition to grow the next generation of users.

Read the paper (PDF).

Once you have a SAS^® Visual Analytics environment up and running, the next important piece to the puzzle is to keep your users happy by keeping their data loaded and refreshed on a consistent basis. Loading data from the SAS Visual Analytics UI is both a great first start and great for ad hoc data exploring. But automating this data load so that users can focus on exploring the data and creating reports is where to power of SAS Visual Analytics comes into play. By using tried-and-true SAS^® Data Integration Studio techniques (both out of the box and custom transforms), you can easily make this happen. Proven techniques such as sweeping from a source library and stacking similar Hadoop Distributed File System (HDFS) tables into SAS^® LASR™ Analytic Server for consumption by SAS Visual Analytics are presented using SAS Visual Analytics and SAS Data Integration Studio.

Read the paper (PDF).

SAS^® Visual Analytics is a powerful tool for exploring, analyzing, and reporting on your data. Whether you understand your data well or are in need of additional insights, SAS Visual Analytics has the capabilities you need to discover trends, see relationships, and share the results with your information consumers. This paper presents a case study applying the capabilities of SAS Visual Analytics to NCAA Division I college football data from 2005 through 2014. It follows the process from reading raw comma-separated values (csv) files through processing that data into SAS data sets, doing data enrichment, and finally loading the data into in-memory SAS^® LASR™ tables. The case study then demonstrates using SAS Visual Analytics to explore detailed play-by-play data to discover trends and relationships, as well as to analyze team tendencies to develop game-time strategies. Reports on player, team, conference, and game statistics can be used for fun (by fans) and for profit (by coaches, agents and sportscasters). Finally, the paper illustrates how all of these capabilities can be delivered via the web or to a mobile device--anywhere--even in the stands at the stadium. Whether you are using SAS Visual Analytics to study college football data or to tackle a complex problem in the financial, insurance, or manufacturing industry, SAS Visual Analytics provides the power and flexibility to score a big win in your organization.

Read the paper (PDF).

As a risk management unit, our team has invested countless hours in developing the processes and infrastructure necessary to produce large, multipurpose analytical base tables. These tables serve as the source for our key reporting and loss provisioning activities, the output of which (reports and GL bookings) is disseminated throughout the corporation. Invariably, questions arise and further insight is desired. Traditionally, any inquiries were returned to the original analyst for further investigation. But what if there was a way for the less technical user base to gain insights independently? Now there is with SAS^® Visual Analytics. SAS Visual Analytics is often thought of as a big data tool, and while it is certainly capable in this space, its usefulness in regard to leveraging the value in your existing data sets should not be overlooked. By using these tried-and-true analytical base tables, you are guaranteed to achieve one version of the truth since traditional reports match perfectly to the data being explored. SAS Visual Analytics enables your organization to share these proven data assets with an entirely new population of data consumers--people with less 'traditional data skills but with questions that need to be answered. Finally, all this is achieved without any additional data preparation effort and testing. This paper explores our experience with SAS Visual Analytics and the benefits realized.

Read the paper (PDF).

Institutional research and effectiveness offices at most institutions are often the primary beneficiaries of the data warehouse (DW) technologies. However, at many institutions, building the data warehouse for growing accountability, decision support, and the institutional effectiveness needs are still unfulfilled, in part due to the growing data volumes as well as the prohibitively expensive data warehousing costs built by UIT departments. In recent years, many institutional research offices in the country are often asked to take a leadership role in building the DW or partner with the campus IT department to improve the efficiency and effectiveness of the DW development. Within this context, the Office of Institutional Research and Effectiveness at a large public research university in the north east was entrusted with the responsibility to build the new campus data warehouse for growing needs such as resource allocation, competitive positioning, new program development in emerging STEM disciplines, and accountability reporting. These requirements necessitated the deployment of state-of-the-art analytical decision support applications, such as SAS^® Visual Analytics (reporting and analysis), SAS^® Visual Statistics (predictive), in a disparate data environment, including PeopleSoft (student), Kuali (finance), Genesys (human resources), and homegrown sponsored funding database. This presentation focuses on the efforts of institutional research and effectiveness offices in developing the decision support applications using the SAS^® Enterprise business intelligence and analytical solutions. With users ranging from nontechnical to advanced analysts, greater efficiency lies in the ability to get faster and more elegant reporting from those huge stores of data and being able to share the resulting discoveries across departments. Most of the reporting applications were developed based on the needs of IPEDS, CUPA, Common Data Set, US News and World Report, g raduation and retention, and faculty activity, and deployed through an online web-based portal. The participants will learn how the University quickly analyzes institutional data through an easy-to-use, drag-and-drop, web-based application. This presentation demonstrates how to use SAS^® Visual Analytics to quickly design reports that are attractive, interactive, and meaningful and then distribute those reports via the web, or through SAS^® Mobile BI on an iPad^® or tablet.

Read the paper (PDF).

Data visualization can be like a GPS directing us to where in the sea of data we should spend our analytical efforts. In today's big data world, many businesses are still challenged to quickly and accurately distill insights and solutions from ever-expanding information streams. Wells Fargo CEO John Stumpf challenges us with the following: We all work for the customer. Our customers say to us, 'Know me, understand me, appreciate me and reward me.' Everything we need to know about a customer must be available easily, accurately, and securely, as fast as the best Internet search engine. For the Wells Fargo Credit Risk department, we have been focused on delivering more timely, accurate, reliable, and actionable information and analytics to help answer questions posed by internal and external stakeholders. Our group has to measure, analyze, and provide proactive recommendations to support and direct credit policy and strategic business changes, and we were challenged by a high volume of information coming from disparate data sources. This session focuses on how we evaluated potential solutions and created a new go-forward vision using a world-class visual analytics platform with strong data governance to replace manually intensive processes. As a result of this work, our group is on its way to proactively anticipating problems and delivering more dynamic reports.

Read the paper (PDF).

This workshop provides hands-on experience using SAS^® Enterprise Miner. Workshop participants will learn to: open a project, create and explore a data source, build and compare models, and produce and examine score code that can be used for deployment.

Read the paper (PDF).

This workshop provides hands-on experience using SAS^® Forecast Server. Workshop participants will learn to: create a project with a hierarchy, generate multiple forecast automatically, evaluate the forecasts accuracy, and build a custom model.

Read the paper (PDF).

This workshop provides hands-on experience with SAS^® Data Loader for Hadoop. Workshop participants will configure SAS Data Loader for Hadoop and use various directives inside SAS Data Loader for Hadoop to interact with data in the Hadoop cluster.

Read the paper (PDF).

This workshop provides hands-on experience with SAS^® Visual Analytics. Workshop participants will explore data with SAS^® Visual Analytics Explorer and design reports with SAS^® Visual Analytics Designer.

Read the paper (PDF).

This workshop provides hands-on experience with SAS^® Visual Statistics. Workshop participants will learn to: move between the Visual Analytics Explorer interface and Visual Statistics, fit automatic statistical models, create exploratory statistical analysis, compare models using a variety of metrics, and create score code.

Read the paper (PDF).

This workshop provides hands-on experience performing statistical analysis with SAS University Edition and SAS Studio. Workshop participants will learn to: install and setup, perform basic statistical analyses using tasks, connect folders to SAS Studio for data access and results storage, invoke code snippets to import CSV data into SAS, and create a code snippet.

Read the paper (PDF).

This workshop provides hands-on experience using SAS^® Text Miner. Workshop participants will learn to: read a collection of text documents and convert them for use by SAS Text Miner using the Text Import node, use the simple query language supported by the Text Filter node to extract information from a collection of documents, use the Text Topic node to identify the dominant themes and concepts in a collection of documents, and use the Text Rule Builder node to classify documents having pre-assigned categories.

Read the paper (PDF).

Is your company using or considering using SAP Business Warehouse (BW) powered by SAP HANA? SAS^® provides various levels of integration with SAP BW in an SAP HANA environment. This integration enables you to not only access SAP BW components from SAS, but to also push portions of SAS analysis directly into SAP HANA, accelerating predictive modeling and data mining operations. This paper explains the SAS toolset for different integration scenarios, highlights the newest technologies contributing to integration, and walks you through examples of using SAS with SAP BW on SAP HANA. The paper is targeted at SAS and SAP developers and architects interested in building a productive analytical environment with the help of the latest SAS and SAP collaborative advancements.

Read the paper (PDF).

Six Sigma is a business management strategy that seeks to improve the quality of process outputs by identifying and removing the causes of defects (errors) and minimizing variability in manufacturing and business processes. Each Six Sigma project carried out within an organization follows a defined sequence of steps and has quantified financial targets. All Six Sigma project methodologies include an extensive analysis phase in which SAS^® software can be applied. JMP^® software is widely used for Six Sigma projects. However, this paper demonstrates how Base SAS^® (and a bit of SAS/GRAPH^® and SAS/STAT^® software) can be used to address a wide variety of Six Sigma analysis tasks. The reader is assumed to have a basic knowledge of Six Sigma methodology. Therefore, the focus of the paper is the use of SAS code to produce outputs for analysis.

Read the paper (PDF).

SAS^® vApps (virtual applications) are a SAS^® construct designed to logically and physically encapsulate a single- or multi-tier software solution into a virtual machine (or sometimes into multiple virtual machines). In this paper, we examine the conceptual, logical, and physical design perspectives that comprise a vApp, giving you a high-level understanding of both the technical and business benefits of vApps and the design decisions that go into envisioning and constructing SAS vApps. These are described in the context of the user roles involved in the life cycle of a vApp, and how those roles interact with a vApp at various points along its continuum.

Read the paper (PDF).

One of the challenges in Secure Socket Layer (SSL) configuration for any web configuration is the SSL certificate management for client and server side. The SSL overview covers the structure of the x.509 certificate and SSL handshake process for the client and server components. There are three distinctive SSL client/server combinations within the SAS^® Visual Analytics 7.1 web application configuration. The most common one is the browser accessing the web application. The second one is the internal SAS^® web application accessing another SAS web application. The third one is a SAS Workspace Server executing a PROC or LIBNAME statement that accesses the SAS^® LASR™ Authorization Service web application. Each SSL client/server scenario in the configuration is explained in terms of SSL handshake and certificate arrangement. Server identity certificate generation using Microsoft Active Directory Certificate Services (AD CS) for enterprise level organization is showcased. The certificates, in proper format, need to be supplied to the SAS^® Deployment Wizard during the configuration process. The prerequisites and configuration steps are shown with examples.

Read the paper (PDF).

This paper presents an application of the SURVEYSELECT procedure. The objective is to draw a systematic random sample from financial data for review. Topics covered in this paper include a brief review of systematic sampling, variable definitions, serpentine sorting, and an interpretation of the output.

Read the paper (PDF). | Download the data file (ZIP).

Before the Internet era, you might not have come across many Sankey diagrams. These diagrams, which contain nodes and links (paths) that cross, intertwine, and have different widths, were named after Captain Sankey. He first created this type of diagram to visualize steam engine efficiency. Sankey diagrams used to have very specialized applications such as mapping out energy, gas, heat, or water distribution and flow, or cost budget flow. These days, it's become very common to display the flow of web traffic or customer actions and reactions through Sankey diagrams as well. Sankey diagrams in SAS^® Visual Analytics easily enable users to create meaningful visualizations that represent the flow of data from one event or value to another. In this paper, we take a look at the components that make up a Sankey diagram: 1. Nodes; 2. Links; 3. Drop-off links; 4. A transaction. In addition, we look at a practical example of how Sankey diagrams can be used to evaluate web traffic and influence the design of a website. We use SAS Visual Analytics to demonstrate the best way to build a Sankey diagram.

Read the paper (PDF).

The Hadoop ecosystem is vast, and there's a lot of conflicting information available about how to best secure any given implementation. It's also difficult to fix any mistakes made early on once an instance is put into production. In this paper, we demonstrate the currently accepted best practices for securing and Kerberizing Hadoop clusters in a vendor-agnostic way, review some of the not-so-obvious pitfalls one could encounter during the process, and delve into some of the theory behind why things are the way they are.

This paper discusses the selection and transformation of continuous predictor variables for the fitting of binary logistic models. The paper has two parts: (1) A procedure and associated SAS^® macro are presented that can screen hundreds of predictor variables and 10 transformations of these variables to determine their predictive power for a logistic regression. The SAS macro passes the training data set twice to prepare the transformations and one more time through PROC TTEST. (2) The FSP (function selection procedure) and a SAS implementation of FSP are discussed. The FSP tests all transformations from among a class of FSP transformations and finds the one with maximum likelihood when fitting the binary target. In a 2008 book, Patrick Royston and Willi Sauerbrei popularized the FSP.

Read the paper (PDF).

Text messages (SMS) are a convenient way to receive notifications away from your computer screen. In SAS^®, text messages can be sent to mobile phones via the DATA step. This paper briefly describes several methods for sending text messages from SAS and explores possible applications.

Read the paper (PDF).

The two primary objectives of multi-tiered causal analysis (MTCA) are to support and evaluate business strategies based on the effectiveness of marketing actions in both a competitive and holistic environment. By tying the performance of a brand, product, or SKU at retail to internal replenishment shipments at a point in time, the outcome of making a change to the marketing mix (demand) can be simulated and evaluated to determine the full impact on supply (shipments). The key benefit of MTCA is that it captures the entire supply chain by focusing on marketing strategies to shape future demand and to link them, using a holistic framework, to shipments (supply). These relationships are what truly define the marketplace and all marketing elements within the supply chain.

Read the paper (PDF).

Understanding organizational trends in spending can help overseeing government agencies make appropriate modifications in spending to best serve the organization and the citizenry. However, given millions of line items for organizations annually, including free-form text, it is unrealistic for these overseeing agencies to succeed by using only a manual approach to this textual data. Using a publicly available data set, this paper explores how business users can apply text analytics using SAS^® Contextual Analysis to assess trends in spending for particular agencies, apply subject matter expertise to refine these trends into a taxonomy, and ultimately, categorize the spending for organizations in a flexible, user-friendly manner. SAS^® Visual Analytics enables dynamic exploration, including modeling results from SAS^® Visual Statistics, in order to assess areas of potentially extraneous spending, providing actionable information to the decision makers.

Read the paper (PDF).

Little did you know that your last delivery ran on incomplete data. To make matters worse, the client realized the issue first. Sounds like a horror story, no? A few preventative measures can go a long way in ensuring that your data are up-to-date and progressing normally. At the data set level, metadata comparisons between the current and previous data cuts will help identify observation and variable discrepancies. Comparisons will also uncover attribute differences at the variable level. At the subject level, they will identify missing subjects. By compiling these comparison results into a comprehensive scheduled e-mail, a data facilitator need only skim the report to confirm that the data is good to go--or in need of some corrective action. This paper introduces a suite of checks contained in a macro that will compare data cuts in the data set, variable, and subject levels and produce an e-mail report. The wide use of this macro will help all SAS^® users create better deliveries while avoiding rework.

Read the paper (PDF). | Download the data file (ZIP).

A leading killer in the United States is smoking. Moreover, over 8.6 million Americans live with a serious illness caused by smoking or second-hand smoking. Despite this, over 46.6 million U.S. adults smoke tobacco, cigars, and pipes. The key analytic question in this paper is, How would e-cigarettes affect this public health situation? Can monitoring public opinions of e-cigarettes using SAS^® Text Analytics and SAS^® Visual Analytics help provide insight into the potential dangers of these new products? Are e-cigarettes an example of Big Tobacco up to its old tricks or, in fact, a cessation product? The research in this paper was conducted on thousands of tweets from April to August 2014. It includes API sources beyond Twitter--for example, indicators from the Health Indicators Warehouse (HIW) of the Centers for Disease Control and Prevention (CDC)--that were used to enrich Twitter data in order to implement a surveillance system developed by SAS^® for the CDC. The analysis is especially important to The Office of Smoking and Health (OSH) at the CDC, which is responsible for tobacco control initiatives that help states to promote cessation and prevent initiation in young people. To help the CDC succeed with these initiatives, the surveillance system also: 1) automates the acquisition of data, especially tweets; and 2) applies text analytics to categorize these tweets using a taxonomy that provides the CDC with insights into a variety of relevant subjects. Twitter text data can help the CDC look at the public response to the use of e-cigarettes, and examine general discussions regarding smoking and public health, and potential controversies (involving tobacco exposure to children, increasing government regulations, and so on). SAS^® Content Categorization helps health care analysts review large volumes of unstructured data by categorizing tweets in order to monitor and follow what people are saying and why they are saying it. Ultimatel y, it is a solution intended to help the CDC monitor the public's perception of the dangers of smoking and e-cigarettes, in addition, it can identify areas where OSH can focus its attention in order to fulfill its mission and track the success of CDC health initiatives.

Read the paper (PDF).

SAS^® functions provide amazing power to your DATA step programming. Specific functions are essential--they save you from writing volumes of unnecessary code. This presentation covers some of the most useful SAS functions. A few might be new to you and they can all change how you program and approach common programming tasks. The majority of these functions work with character data. There are functions that search for strings, others that find and replace strings, and some that join strings together. Furthermore, certain functions can measure the spelling distance between two strings (useful for fuzzy matching). Some of the newest and most incredible functions are not functions at all--they are call routines. Did you know that you can sort values within an observation? Did you know that not only can you identify the largest or smallest value in a list of variables, but you can identify the second or third or nth largest or smallest value? A knowledge of these functions will make you a better SAS programmer.

Read the paper (PDF).

Your enterprise SAS^® Visual Analytics implementation is on its way to being adopted throughout your organization, unleashing the production of critical business content by business analysts, data scientists, and decision makers from many business units. This content is relied upon to inform decisions and provide insight into the results of those decisions. With the development of SAS Visual Analytics content decentralized into the hands of business users, the use of automated version control is essential to providing protection and recovery in the event of inadvertent changes to that content. Re-creation of complex report objects accidentally modified by a business user is time-consuming and can be eliminated by maintaining a version control repository of report (and other) objects created in SAS Visual Analytics. This paper walks through the steps for implementing an automated process for version control using SAS^®. This process can be applied to all types of metadata objects used in multiple SAS application development and analysis environments, such as reports and explorations from SAS Visual Analytics, and jobs, tables, and libraries from SAS^® Data Integration Studio. Basic concepts for the process, as well as specific techniques used for our implementation are included. So eliminate the risk of content loss for your business users and the burden of manual version control for your applications developers. Your IT shop will enjoy time savings and greater reliability.

Read the paper (PDF).

The report looks simple enough--a bar chart and a table, like something created with GCHART and REPORT procedures. But, there are some twists to the reporting requirements that make those procedures not quite flexible enough. The solution was to mix 'old' and 'new' DATA step-based techniques to solve the problem. Annotate datasets are used to create the bar chart and the Report Writing Interface (RWI) to create the table. Without a whole lot of additional code, an extreme amount of flexibility is gained.

Read the paper (PDF).

Can you actually get something for nothing? With the SAS^® PROC SQL subquery and remerging features, yes, you can. When working with categorical variables, you often need to add group descriptive statistics such as group counts and minimum and maximum values for further BY-group processing. Instead of first creating the group count and minimum or maximum values and then merging the summarized data set to the original data set, why not take advantage of PROC SQL to complete two steps in one? With the PROC SQL subquery and summary functions by the group variable, you can easily remerge the new group descriptive statistics with the original data set. Now with a few DATA step enhancements, you too can include percent calculations.

Read the paper (PDF).

The era of mass marketing is over. Welcome to the new age of relevant marketing where whispering matters far more than shouting.' At ZapFi, using the combination of sponsored free Wi-Fi and real-time consumer analytics,' we help businesses to better understand who their customers are. This gives businesses the opportunity to send highly relevant marketing messages based on the profile and the location of the customer. It also leads to new ways to build deeper and more intimate, one-on-one relationships between the business and the customer. During this presentation, ZapFi will use a few real-world examples to demonstrate that the future of mobile marketing is much more about data and far less about advertising.

Read the paper (PDF).

A prevalent problem surrounding Extract, Transform, and Load (ETL) development is the ability to apply consistent logic and manipulation of source data when migrating to target data structures. Certain inconsistencies that add a layer of complexity include, but are not limited to, naming conventions and data types associated with multiple sources, numerous solutions applied by an array of developers, and multiple points of updating. In this paper, we examine the evolution of implementing a best practices solution during the process of data delivery, with respect to standardizing data. The solution begins with injecting the transformations of the data directly into the code at the standardized layer via Base SAS^® or SAS^® Enterprise Guide^®. A more robust method that we explore is to apply these transformations with SAS^® macros. This provides the capability to apply these changes in a consistent manner across multiple sources. We further explore this solution by implementing the macros within SAS^® Data Integration Studio processes on the DataFlux^® Data Management Platform. We consider these issues within the financial industry, but the proposed solution can be applied across multiple industries.

Read the paper (PDF).

Several U.S. Federal agencies conduct national surveys to monitor health status of residents. Many of these agencies release their survey data to the public. Investigators might be able to address their research objectives by conducting secondary statistical analyses with these available data sources. This paper describes the steps in using the SAS SURVEY procedures to analyze publicly released data from surveys that use probability sampling to make statistical inference to a carefully defined population of elements (the target population).

Read the paper (PDF). | Watch the recording.

Product affinity segmentation is a powerful technique for marketers and sales professionals to gain a good understanding of customers' needs, preferences, and purchase behavior. Performing product affinity segmentation is quite challenging in practice because product-level data usually has high skewness, high kurtosis, and a large percentage of zero values. The Doughnut Clustering method has been shown to be effective using real data, and was presented at SAS^® Global Forum 2013 in the paper titled Product Affinity Segmentation Using the Doughnut Clustering Approach.' However, the Doughnut Clustering method is not a panacea for addressing the product affinity segmentation problem. There is a clear need for a comprehensive evaluation of this method in order to be able to develop generic guidelines for practitioners about when to apply it. In this paper, we meet the need by evaluating the Doughnut Clustering method on simulated data with different levels of skewness, kurtosis, and percentage of zero values. We developed a five-step approach based on Fleishman's power method to generate synthetic data with prescribed parameters. Subsequently, we designed and conducted a set of experiments to run the Doughnut Clustering method as well as the traditional K-means method as a benchmark on simulated data. We draw conclusions on the performance of the Doughnut Clustering method by comparing the clustering validity metric the ratio of between-cluster variance to within-cluster variance as well as the relative proportion of cluster sizes against those of K-means.

Read the paper (PDF). | Download the data file (ZIP).

Video games used to be child's play. Today, millions of gamers of all ages kill countless in-game monsters and villains every day. Gaming is big business, and the data it generates is even bigger. Massive multi-player online games like World of Warcraft by Blizzard Entertainment not only generate data that Blizzard Entertainment can use to monitor users and their environments, but they can also be set up to log player data and combat logs client-side. Many users spend time analyzing their playing 'rotations' and use the information to adjust their playing style to deal more damage or, more appropriately, to heal themselves and other players. This paper explores World of Warcraft logs by using SAS^® Visual Analytics and applies statistical techniques by using SAS^® Visual Statistics to discover trends.

Technology is always changing. To succeed in this ever-evolving landscape, organizations must embrace the change and look for ways to use it to their advantage. Even standard business tasks such as creating reports are affected by the rapid pace of technology. Reports are key to organizations and their customers. Therefore, it is imperative that organizations employ current technology to provide data in customized and meaningful reports across a variety of media. The SAS^® Output Delivery System (ODS) gives you that edge by providing tools that enable you to package, present, and deliver report data in more meaningful ways, across the most popular desktop and mobile devices. To begin, the paper illustrates how to modify styles in your reports using the ODS CSS style engine, which incorporates the use of cascading style sheets (CSS) and the ODS document object model (DOM). You also learn how you can use SAS ODS to customize and generate reports in the body of e-mail messages. Then the paper discusses methods for enhancing reports and rendering them in desktop and mobile browsers by using the HTML and HTML5 ODS destinations. To conclude, the paper demonstrates the use of selected SAS ODS destinations and features in practical, real-world applications.

Read the paper (PDF).

Every day, companies all over the world are moving their data into the Cloud. While there are many options available, much of this data will wind up in Amazon Redshift. As a SAS^® user, you are probably wondering, 'What is the best way to access this data using SAS?' This paper discusses the many ways that you can use SAS/ACCESS^® to get to Amazon Redshift. We compare and contrast the various approaches and help you decide which is best for you. Topics that are discussed are building a connection, choosing appropriate data types, and SQL functions.

Read the paper (PDF).

SAS^® provides a complex ecosystem with multiple tools and products that run in a variety of environments and modes. SAS provides numerous error-handling and program control statements, options, and features. These features can function differently according to the run environment, and many of them have constraints and limitations. In this presentation, we review a number of potential error-handling and program control strategies that can be employed, along with some of the inherent limitations of each. The bottom line is that there is no single strategy that will work in all environments, all products, and all run modes. Instead, programmers need to consider the underlying program requirements and choose the optimal strategy for their situation.

Read the paper (PDF). | Download the data file (ZIP).

In today's omni-channel world, consumers expect retailers to deliver the product they want, where they want it, when they want it, at a price they accept. A major challenge many retailers face in delighting their customers is successfully predicting consumer demand. Business decisions across the enterprise are affected by these demand estimates. Forecasts used to inform high-level strategic planning, merchandising decisions (planning assortments, buying products, pricing, and allocating and replenishing inventory) and operational execution (labor planning) are similar in many respects. However, each business process requires careful consideration of specific input data, modeling strategies, output requirements, and success metrics. In this session, learn how leading retailers are increasing sales and profitability by operationalizing forecasts that improve decisions across their enterprise.

Read the paper (PDF).

In 2014, for the first time, mid-market banks (consisting of banks and bank holding companies with $10-$50 billion in consolidated assets) were required to submit Capital Stress Tests to the federal regulators under the Dodd-Frank Act Stress Testing (DFAST). This is a process large banks have been going through since 2011. However, mid-market banks are not positioned to commit as many resources to their annual stress tests as their largest peers. Limited human and technical resources, incomplete or non-existent detailed historical data, lack of enterprise-wide cross-functional analytics teams, and limited exposure to rigorous model validations are all challenges mid-market banks face. While there are fewer deliverables required from the DFAST banks, the scrutiny the regulators are placing on the analytical modes is just as high as their expectations for Comprehensive Capital Analysis and Review (CCAR) banks. This session discusses the differences in how DFAST and CCAR banks execute their stress tests, the challenges facing DFAST banks, and potential ways DFAST banks can leverage the analytics behind this exercise.

Read the paper (PDF).

SAS^® users organize their applications in a variety of ways. However, there are some approaches that are more successful, and some that are less successful. In particular, the need to process some of the code some of the time in a file is sometimes challenging. Reproducible research methods require that SAS applications be understandable by the author and other staff members. In this presentation, you learn how to organize and structure your SAS application to manage the process of data access, data analysis, and data presentation. The approach to structure applications requires that tasks in the process of data analysis be compartmentalized. This can be done using a well-defined program. The author presents his structuring algorithm, and discusses the characteristics of good structuring methods for SAS applications. Reproducible research methods are becoming more centrally important, and SAS users must keep up with the current developments.

Read the paper (PDF). | Download the data file (ZIP).

Most 'Design of Experiment' textbooks cover Type I, Type II, and Type III sums of squares, but many researchers and statisticians fall into the habit of using one type mindlessly. This breakout session reviews the basics and illustrates the importance of the choice of type as well as the variable definitions in PROC GLM and PROC REG.

Read the paper (PDF).

Sampling for audits and forensics presents special challenges: Each survey/sample item requires examination by a team of professionals, so sample size must be contained. Surveys involve estimating--not hypothesis testing. So power is not a helpful concept. Stratification and modeling is often required to keep sampling distributions from being skewed. A precision of alpha is not required to create a confidence interval of 1-alpha, but how small a sample is supportable? Many times replicated sampling is required to prove the applicability of the design. Given the robust, programming-oriented approach of SAS^®, the random selection, stratification, and optimizing techniques built into SAS can be used to bring transparency and reliability to the sample design process. While a sample that is used in a published audit or as a measure of financial damages must endure a special scrutiny, it can be a rewarding process to design a sample whose performance you truly understand and which will stand up under a challenge.

Read the paper (PDF).

Surveys are designed to elicit information about population characteristics. A survey design typically combines stratification and multistage sampling of intact clusters, sub-clusters, and individual units with specified probabilities of selection. A survey sample can produce valid and reliable estimates of population parameters at a fraction of the cost of carrying out a census of the entire population, with clear logistical efficiencies. For analyses of survey data, SAS^®software provides a suite of procedures from SURVEYMEANS and SURVEYFREQ for generating descriptive statistics and conducting inference on means and proportions to regression-based analysis through SURVEYREG and SURVEYLOGISTIC. For longitudinal surveys and follow-up studies, SURVEYPHREG is designed to incorporate aspects of the survey design for analysis of time-to-event outcomes based on the Cox proportional hazards model, allowing for time-varying explanatory variables.We review the salient features of the SURVEYPHREG procedure with application to survey data from the National Health and Nutrition Examination Survey (NHANES III) Linked Mortality File.

Read the paper (PDF).

One of the more commonly needed operations in SAS^® programming is to determine the value of one variable based on the value of another. A series of techniques and tools have evolved over the years to make the matching of these values go faster, smoother, and easier. A majority of these techniques require operations such as sorting, searching, and comparing. As it turns out, these types of techniques are some of the more computationally intensive. Consequently, an understanding of the operations involved and a careful selection of the specific technique can often save the user a substantial amount of computing resources. Many of the more advanced techniques can require substantially fewer resources. It is incumbent on the user to have a broad understanding of the issues involved and a more detailed understanding of the solutions available. Even if you do not currently have a BIG data problem, you should at the very least have a basic knowledge of the kinds of techniques that are available for your use.

Read the paper (PDF).

Marketers often face a cross-channel challenge in making sense of the behavior of web visitors who spend considerable time researching an item online, even putting the item in a wish list or checkout basket, but failing to follow up with an actual purchase online, instead opting to purchase the item in the store. This research shows the use of SAS^® Visual Analytics to address this challenge. This research uses a large data set of simulated web transactional data, combines it with common IDs to attach the data to in-store retail data, and studies it in SAS Visual Analytics. In this presentation, we go over tips and tricks for using SAS Visual Analytics on a non-distributed server. The loaded data set is analyzed step by step to show how to draw correlations in the web browsing behavior of customers and how to link the data to their subsequent in-store behavior. It shows how we can draw inferences between web visits and in-store visits by department. You'll change your marketing strategy as a result of the research.

Read the paper (PDF).

SAS^® Office Analytics, SAS^® Visual Analytics, and SAS^® Studio provide excellent data analysis and report generation. When these products are combined, their deep interoperability enables you to take your analysis and reporting to the next level. Build interactive reports in SAS^® Visual Analytics Designer, and then view, customize and comment on them from Microsoft Office and SAS^® Enterprise Guide^®. Create stored processes in SAS Enterprise Guide, and then run them in SAS Visual Analytics Designer, mobile tablets, or SAS Studio. Run your SAS Studio tasks in SAS Enterprise Guide and Microsoft Office using data provided by those applications. These interoperability examples and more will enable you to combine and maximize the strength of each of the applications. Learn more about this integration between these products and what's coming in the future in this session.

Read the paper (PDF).

The vast and increasing demands of fraud detection and description have promoted the broad application of statistics and machine learning in fields as diverse as banking, credit card application and usage, insurance claims, trader surveillance, health care claims, and government funding and allowance management. SAS^® Visual Scenario Designer enables you to derive interactive business rules, along with descriptive and predictive models, to detect and describe fraud. This paper focuses on building interactive decision trees to classify fraud. Attention to optimizing the feature space (candidate predictors) prior to modeling is also covered. Because big data plays an increasingly vital role in fraud detection and description, SAS Visual Scenario Designer leverages the in-memory, parallel, and distributed computing abilities of SAS^® LASR™ Analytic Server as a back end to support real-time performance on massive amounts of data.

Read the paper (PDF).

Understanding the behavior of your customers is key to improving and maintaining revenue streams. It is a critical requirement in the crafting of successful marketing campaigns. Using SAS^® Visual Analytics, you can analyze and explore user behavior, click paths, and other event-based scenarios. Flow visualizations help you to best understand hotspots, highlight common trends, and find insights in individual user paths or in aggregated paths. This paper explains the basic concepts of path analysis as well as provides detailed background information about how to use flow visualizations within SAS Visual Analytics.

Read the paper (PDF).

How many times has this happened to you? You create a really helpful report and share it with others. It becomes popular and you find yourself running it over and over. Then they start asking, But can't you re-run it and just change ___? (Fill in the blank with whatever simple request you can think of.) Don't you want to just put the report out as a web page with some basic parameters that users can choose themselves and run when they want? Consider writing your own task in SAS^® Studio! SAS Studio includes several predefined tasks, which are point-and-click user interfaces that guide the user through an analytical process. For example, tasks enable users to create a bar chart, run a correlation analysis, or rank data. When a user selects a task option, SAS^® code is generated and run on the SAS server. Because of the flexibility of the task framework, you can make a copy of a predefined task and modify it or create your own. Tasks use the same common task model and the Velocity Template Language--no Java programming or ActionScript programming is required. Once you have the interface set up to generate the SAS code you need, then you can publish the task for other SAS Studio users to use or you can use a straight URL. Now that others can generate the output themselves, you actually might have time to go fishing!

Read the paper (PDF).

The measurement of factors influencing consumer purchasing decisions is of interest to all manufacturers of goods, retailers selling these goods, and consumers buying these goods. In the past decade, conjoint analysis has become one of the commonly used statistical techniques for analyzing the decisions or trade-offs consumers make when they purchase products. Although recent years have seen increased use of conjoint analysis and conjoint software, there is limited work that has spelled out a systematic procedure on how to do a conjoint analysis or how to use conjoint software. This paper reviews basic conjoint analysis concepts, describes the mathematical and statistical framework on which conjoint analysis is built, and introduces the TRANSREG and PHREG procedures, their syntaxes, and the output they generate using simplified real-life data examples. This paper concludes by highlighting some of the substantives issues related to the application of conjoint analysis in a business environment and the available auto call macros in SAS/STAT^®, SAS/IML^®, and SAS/QC^® software that can handle more complex conjoint designs and analyses. The paper will benefit the basic SAS user, and statisticians and research analysts in every industry, especially those in marketing and advertisement.

Read the paper (PDF).

Data simulation is a fundamental tool for statistical programmers. SAS^® software provides many techniques for simulating data from a variety of statistical models. However, not all techniques are equally efficient. An efficient simulation can run in seconds, whereas an inefficient simulation might require days to run. This paper presents 10 techniques that enable you to write efficient simulations in SAS. Examples include how to simulate data from a complex distribution and how to use simulated data to approximate the sampling distribution of a statistic.

Read the paper (PDF). | Download the data file (ZIP).

This session describes our journey from data acquisition to text analytics on clinical, textual data.

This presentation details the steps involved in using SAS^® Enterprise Miner™ to text mine a sample of member complaints. Specifically, it describes how the Text Parsing, Text Filtering, and Text Topic nodes were used to generate topics that described the complaints. Text mining results are reviewed (slightly modified for confidentiality), as well as conclusions and lessons learned from the project.

Read the paper (PDF).

Streaming data and real-time analytics are being talked about in a lot of different situations. SAS has been exploring a variety of use cases that meld together streaming data and analytics to drive immediate action. We want to share these stories with you and get your feedback on what's going on in your world, where the waiting game is no longer what's being played.

PROC MIXED is one of the most popular SAS procedures to perform longitudinal analysis or multilevel models in epidemiology. Model selection is one of the fundamental questions in model building. One of the most popular and widely used strategies is model selection based on information criteria, such as Akaike Information Criterion (AIC) and Sawa Bayesian Information Criterion (BIC). This strategy considers both fit and complexity, and enables multiple models to be compared simultaneously. However, there is no existing SAS procedure to perform model selection automatically based on information criteria for PROC MIXED, given a set of covariates. This paper provides information about using the SAS %ic_mixed macro to select a final model with the smallest value of AIC and BIC. Specifically, the %ic_mixed macro will do the following: 1) produce a complete list of all possible model specifications given a set of covariates, 2) use do loop to read in one model specification every time and save it in a macro variable, 3) execute PROC MIXED and use the Output Delivery System (ODS) to output AICs and BICs, 4) append all outputs and use the DATA step to create a sorted list of information criteria with model specifications, and 5) run PROC REPORT to produce the final summary table. Based on the sorted list of information criteria, researchers can easily identify the best model. This paper includes the macro programming language, as well as examples of the macro calls and outputs.

Read the paper (PDF).

With cloud service providers such as Amazon commodifying the process to create a server instance based on desirable OS and sizing requirements for a SAS^® implementation, a definite advantage is the speed and simplicity of getting started with a SAS installation. Planning horizons are nonexistent, and initial financial outlay is economized because no server hardware procurement occurs, no data center space reserved, nor any hardware/OS engineers assigned to participate in the initial server instance creation. The cloud infrastructure seems to make the OS irrelevant, an afterthought, and even just an extension of SAS software. In addition, if the initial sizing, memory allocation, or disk space selection results later in some deficiency or errors in SAS processing, the flexibility of the virtual server instance allows the instance image to be saved and restored to a new, larger, or performance-enhanced instance at relatively low cost and minor inconvenience to production users. Once logged on with an authenticated ID, with Internet connectivity established, a SAS installer ID created, and a web browser started, it's just a matter of downloading the SAS^® Download Manager to begin the creation of the SAS software depot. Many Amazon cloud instances have download speeds that tend to be greater and processing time that is shorter to create the depot. Installing SAS via the SAS^® Deployment Wizard is not dissimilar on a cloud instance versus a server instance, and all the same challenges (for example, SSL, authentication and single sign-on, and repository migration) apply. Overall, SAS administrators have an optimal, straightforward, and low-cost opportunity to deploy additional SAS instances running different versions or more complex configurations (for example, SAS^® Grid Computing, resource-based load balancing, and SAS jobs split and run parallel across multiple nodes). While the main advantages of using a cloud instance to deploy a new SAS i mplementation tend to revolve around efficiency, speed, and affordability, its pitfalls have to do with vulnerability to intrusion and external attack. The same easy, low-cost server instance launch also has a negative flip side that includes a possible lack of experienced OS oversight and basic security precaution. At the moment, Linux administrators around the country are patching their physical and virtual systems to prevent the spread of the Shellshock vulnerability for web servers that originated in cloud instances. Cloud instances have also been targeted and credentials compromised which, in some cases, have allowed thousands of new instances to be spun up and billed to an unsuspecting AWS licensed user. Extra steps have to be taken to prevent the aforementioned attacks and fortunately, there are cloud-based methods available. By creating a Virtual Private Cloud (VPC) instance, AWS users can restrict access by originating IP addresses while also requiring additional administration, including creating entries for application ports that require external access. Moreover, with each step toward more secure cloud implementations, there are additional complexities that arise, including making additional changes or compromises with corporate firewall policy and user authentication methods.

Read the paper (PDF).

For the past two academic school years, our SAS^® Programming 1 class had a classroom discussion about the Charlotte Bobcats. We wondered aloud If the Bobcats changed their team name would the dwindling fan base return? As a class, we created a survey that consisted of 10 questions asking people if they liked the name Bobcats, did they attend basketball games, and if they bought merchandise. Within a one-hour class period, our class surveyed 981 out of 1,733 students at Phillip O. Berry Academy of Technology. After collecting the data, we performed advanced analytics using Base SAS^® and concluded that 75% of students and faculty at Phillip O. Berry would prefer any other name except the Bobcats. In other results, 80% percent of the student body liked basketball, and the most preferred name was the Hornets, followed by the Royals, Flight, Dragons, and finally the Bobcats, The following school year, we conducted another survey to discover if people's opinions had changed since the previous survey and if people were happy with the Bobcats changing their name. During this time period, the Bobcats had recently reported that they were granted the opportunity to change the team name to the Hornets. Once more, we collected and analyzed the data and concluded that 77% percent of people surveyed were thrilled with the name change. In addition, around 50% percent of surveyors were interested in purchasing merchandise. Through the work of this project, SAS^® Analytics was applied in the classroom to a real world scenario. The ability to see how SAS^® could be applied to a question of interest and create change inspired the students in our class. This project is significantly important to show the economic impact that sports can have on a city. This project in particular, focused on the nostalgia that people of the city of Charlotte felt for the name Hornets. The project opened the door for more analysis and questions and continues to spa rk interest. This is the case because when people have a connection to the team and the more the team flourishes, the more Charlotte benefits.

Read the paper (PDF). | Download the data file (ZIP).

Many languages, including the SAS DATA step, have extensive debuggers that can be used to detect logic errors in programs. Another easy way to detect logic errors is to simply display messages and variable content at strategic times. The PUTLOG statement will be discussed and examples given that show how using this statement is probably the easiest and most flexible way to detect and correct errors in your DATA step logic.

Read the paper (PDF).

Tracking responses is one of the most important aspects of the campaign life cycle for a marketing analyst; yet this is often a daunting task. This paper provides guidance for how to determine what is a response, how it is defined for your business, and how you collect data to support it. It provides guidance in the context of SAS^® Marketing Automation and beyond.

Read the paper (PDF).

In this session, I discuss an overall approach to governing Big Data. I begin with an introduction to Big Data governance and the governance framework. Then I address the disciplines of Big Data governance: data ownership, metadata, privacy, data quality, and master and reference data management. Finally, I discuss the reference architecture of Big Data, and how SAS^® tools can address Big Data governance.

Credit card usage modelling is a relatively innovative task of client predictive analytics compared to risk modelling such as credit scoring. The credit limit utilization rate is a problem with limited outcome values and highly dependent on customer behavior. Proportion prediction techniques are widely used for Loss Given Default estimation in credit risk modelling (Belotti and Crook, 2009; Arsova et al, 2011; Van Berkel and Siddiqi, 2012; Yao et al, 2014). This paper investigates some regression models for utilization rate with outcome limits applied and provides a comparative analysis of the predictive accuracy of the methods. Regression models are performed in SAS/STAT^® using PROC REG, PROC LOGISTIC, PROC NLMIXED, PROC GLIMMIX, and SAS^® macros for model evaluation. The conclusion recommends credit limit utilization rate prediction techniques obtained from the empirical analysis.

Read the paper (PDF).

The first thing that you need to know is that SAS^® software stores dates and times as numbers. However, this is not the only thing that you need to know, and this presentation gives you a solid base for working with dates and times in SAS. It also introduces you to functions and features that enable you to manipulate your dates and times with surprising flexibility. This paper also shows you some of the possible pitfalls with dates (and with times and datetimes) in your SAS code, and how to avoid them. We show you how SAS handles dates and times through examples, including the ISO 8601 formats and informats, and how to use dates and times in TITLE and FOOTNOTE statements. We close with a brief discussion of Microsoft Excel conversions.

Read the paper (PDF). | Download the data file (ZIP).

Many industries are challenged with requirements to protect information and limit its access. In this paper, we will discuss various approaches for row-level access to LASR tables and demonstrate our implementation. Methods discussed in this paper include security joins in data queries, using star schema with security table as one dimension, permission conditions based on metadata stored user information, and user IDs being associated with data as a dedicated column. The paper then identifies shortcomings and strengths of various approaches as well as our iterations to satisfy business needs that led us to our row-level permissions implementation. In addition, the paper offers recommendations and other considerations to keep in mind while working on row-level persmissions with LASR tables.

Read the paper (PDF).

The SAS^® LASR™ Analytic Server acts as a back-end, in-memory analytics engine for solutions such as SAS^® Visual Analytics and SAS^® Visual Statistics. It is designed to exist in a massively scalable, distributed environment, often alongside Hadoop. This paper guides you through the impacts of the architecture decisions shared by both software applications and what they specifically mean for SAS^®. We then present positive actions you can take to rebound from unexpected outages and resume efficient operations.

Read the paper (PDF).

The knight's tour is a sequence of moves on a chess board such that a knight visits each square only once. Using a heuristic method, it is possible to find a complete path, beginning from any arbitrary square on the board and landing on the remaining squares only once. However, the implementation poses challenging programming problems. For example, it is necessary to discern viable knight moves, which change throughout the tour. Even worse, the heuristic approach does not guarantee a solution. This paper explains a SAS^® solution that finds a knight's tour beginning from every initial square on a chess board...well, almost.

Read the paper (PDF).

Random Forest (RF) is a trademarked term for an ensemble approach to decision trees. RF was introduced by Leo Breiman in 2001.Due to our familiarity with decision trees--one of the intuitive, easily interpretable models that divides the feature space with recursive partitioning and uses sets of binary rules to classify the target--we also know some of its limitations such as over-fitting and high variance. RF uses decision trees, but takes a different approach. Instead of growing one deep tree, it aggregates the output of many shallow trees and makes a strong classifier model. RF significantly improves the accuracy of classification by growing an ensemble of trees and allowing for the selection of the most popular one. Unlike decision trees, RF has a robustness against over-fitting and high variance, since it randomly selects a subset of variables in each split node. This paper demonstrates this simple yet powerful classification algorithm by building an income-level prediction system. Data extracted from the 1994 Census Bureau database was used for this study. The data set comprises information about 14 key attributes for 45,222 individuals. Using SAS^® Enterprise Miner™ 13.1, models such as random forest, decision tree, probability decision tree, gradient boosting, and logistic regression were built to classify the income level( >50K or <50k) of the population. The results showed that the random forest model was the best model for this data, based on the misclassification rate criteria. The RF model predicts the income-level group of the individuals with an accuracy of 85.1%, with the predictors capturing specific characteristic patterns. This demonstration using SAS^® can lead to useful insights into RF for solving classification problems.

Read the paper (PDF).

This unique culture has access to lots of data, unstructured and structured; is innovative, experimental, groundbreaking, and doesn't follow convention; and has access to powerful new infrastructure technologies and scalable, industry-standard computing power like never seen before. The convergence of data, and innovative spirit, and the means to process it is what makes this a truly unique culture. In response to that, SAS^® proposes The New Analytics Experience. Attend this session to hear more about the New Analytics Experience and the latest Intel technologies that make it possible.

It is well-known in the world of SAS^® programming that the REPORT procedure is one of the best procedures for creating dynamic reports. However, you might not realize that the compute block is where all of the action takes place! Its flexibility enables you to customize your output. This paper is a primer for using a compute block. With a compute block, you can easily change values in your output with the proper assignment statement and add text with the LINE statement. With the CALL DEFINE statement, you can adjust style attributes such as color and formatting. Through examples, you learn how to apply these techniques for use with any style of output. Understanding how to use the compute-block functionality empowers you to move from creating a simple report to creating one that is more complex and informative, yet still easy to use.

Read the paper (PDF).

If you are one of the many customers who want to move your SAS^® data to Hadoop, one decision you will encounter is what data storage format to use. There are many choices, and all have their pros and cons. One factor to consider is how you currently store your data. If you currently use the Base SAS^® engine or the SAS^® Scalable Performance Data Engine, then using the SPD Engine with Hadoop will enable you to continue accessing your data with as little change to your existing SAS programs as possible. This paper discusses the enhancements, usage, and benefits of the SPD Engine with Hadoop.

Read the paper (PDF).

Building a holistic view of the customer is becoming the norm across industries. The financial services industry and retail firms have been at the forefront of striving for this goal. Firm ABC is a large insurance firm based in the United States. It uses multiple campaign management platforms across different lines of business. Marketing campaigns are deployed in isolation. Similarly, responses are tracked and attributed in silos. This prevents the firm from obtaining a holistic view of its customers across products and lines of business and leads to gaps and inefficiencies in data management, campaign management, reporting, and analytics. Firm ABC needed an enterprise-level solution that addressed how to integrate with different types of data sources (both external and internal) and grow as a scalable and agile marketing and analytics organization; how to deploy campaign and contact management using a centralized platform to reduce overlap and redundancies and deliver a more coordinated marketing messaging to customers; how to perform more accurate attribution that, in turn, drives marketing measurement and planning; how to implement more sophisticated and visual self-service reporting that enables business users to make marketing decisions; and how to build advanced analytics expertise in-house. The solution needed to support predictive modeling, segmentation, and targeting. Based on these challenges and requirements, the firm conducted an extensive RFP process and reviewed various vendors in the enterprise marketing and business intelligence space. Ultimately, SAS^® Customer Intelligence and SAS^® Enterprise BI were selected to help the firm achieve its goals and transition to a customer-centric organization. The ability for SAS^® to deliver a custom-hosted solution was one of the key drivers for this decision, along with its experience in the financial services and insurance industries. Moreover, SAS can provide the much-needed flexibility and scala bility, whether it is around integrating external vendors, credit data, and mail-shop processing, or managing sensitive customer information. This presentation provides detailed insight on the various modules being implemented by the firm, how they will be leveraged to address the goals, and what their roles are in the future architecture. The presentation includes detailed project implementation and provides insights, best practices, and challenges faced during the project planning, solution design, governance, and development and production phases. The project team included marketers, campaign managers, data analysts, business analysts, and developers with sponsorship and participation from the C suite. The SAS^® Transformation Project provides insights and best practices that prove useful for business users, IT teams, and senior management. The scale, timing, and complexity of the solution deployment make it an interesting and relevant case study, not only for financial clients, but also for any large firm that has been tasked with understanding its customers and building a holistic customer profile.

Read the paper (PDF).

Researchers often use longitudinal data analysis to study the development of behaviors or traits. For example, they might study how an elderly person's cognitive functioning changes over time or how a therapeutic intervention affects a certain behavior over a period of time. This paper introduces the structural equation modeling (SEM) approach to analyzing longitudinal data. It describes various types of latent curve models and demonstrates how you can use the CALIS procedure in SAS/STAT^® software to fit these models. Specifically, the paper covers basic latent curve models, such as unconditional and conditional models, as well as more complex models that involve multivariate responses and latent factors. All illustrations use real data that were collected in a study that looked at maternal stress and the relationship between mothers and their preterm infants. This paper emphasizes the practical aspects of longitudinal data analysis. In addition to illustrating the program code, it shows how you can interpret the estimation results and revise the model appropriately. The final section of the paper discusses the advantages and disadvantages of the SEM approach to longitudinal data analysis.

Read the paper (PDF).

The unsustainable trend in healthcare costs has led to efforts to shift some healthcare services to less expensive sites of care. In North Carolina, the expansion of urgent care centers introduces the possibility that non-emergent and non-life threatening conditions can be treated at a less intensive care setting. BCBSNC conducted a longitudinal study of density of urgent care centers, primary care providers, and emergency departments, and the differences in how members access care near those locations. This talk focuses on several analytic techniques that were considered for the analysis. The model needed to account for the complex relationship between the changes in the population (including health conditions and health insurance benefits) and the changes in the types of services and supply of services offered by healthcare providers proximal to them. Results for the chosen methodology are discussed.

Read the paper (PDF).

Universities often have student data that is difficult to represent, which includes information about the student's home location. Often, student data is represented in tables, and patterns are easily overlooked. This study aimed to represent recruiting and retention data at a county level using SAS^® mapping software for a public, land-grant university. Three years of student data from the student records database were used to visually represent enrollment, retention, and other predictors of student success. SAS^® Enterprise Guide^® was used along with the GMAP procedure to make user-friendly maps. Displaying data using maps on a county level revealed patterns in enrollment, retention, and other factors of interest that might have otherwise been overlooked, which might be beneficial for recruiting purposes.

Read the paper (PDF).

The bookBot Identity: January 2013. With no memory of it from the past, students and faculty at NC State awake to find the Hunt Library just opened, and inside it, the mysterious and powerful bookBot. A true physical search engine, the bookBot, without thinking, relentlessly pursues, captures, and delivers to the patron any requested book (those things with paper pages--remember?) from the Hunt Library. The bookBot Supremacy: Some books were moved from the central campus library to the new Hunt Library. Did this decrease overall campus circulation or did the Hunt Library and its bookBot reign supreme in increasing circulation? The bookBot Ultimatum: To find out if the opening of the Hunt Library decreased or increased overall circulation. To address the bookBot Ultimatum, the Circulation Statistics Investigation (CSI) team uses the power of SAS^® analytics to model library circulation before and after the opening of the Hunt Library. The bookBot Legacy: Join us for the adventure-filled story. Filled with excitement and mystery, this talk is bound to draw a much bigger crowd than had it been more honestly titled Intervention Analysis for Library Data. Tools used are PROC ARIMA, PROC REG, and PROC SGPLOT.

Read the paper (PDF). | Download the data file (ZIP). | Watch the recording.

SAS^® Visual Analytics is a product that easily enables the interactive analysis of data. It offers capabilities for analyzing data using a visual approach. This paper discusses architecture options for configuring a SAS Visual Analytics installation that serves multiple customers in parallel. The overall objective is to create an environment that scales with the volume of data and also with the number of customer groups. This paper explains several concepts for serving multiple customers groups and explains the pros and cons of each approach.

Read the paper (PDF).

SAS^® Visual Analytics is very responsive in analyzing historical data, and it takes advantage of in-memory data. Data query, exploration, and reports form the basis of the tool, which also has other forward-looking techniques such as star schemas and stored processes. A security model is established by defining the permissions through a web-based application that is stored in a database table. That table is brought to the SAS Visual Analytics environment as a LASR table. Typically, security is established based on the departmental access, geographic region, or other business-defined groups. This permission table is joined with the underlying base table. Security is defined by a data filter expression through a conditional grant using SAS^® metadata identities. The in-memory LASR star schema is very similar to a typical star schema. A single fact table that is surrounded by dimension tables is used to create the star schema. The star schema gives you the advantage of loading data quickly on the fly. Each of the dimension tables is joined to the fact table with a dimension key. A SAS application that gives the flexibility and the power of coding is created as a stored process that can be executed as requested by client applications such as SAS Visual Analytics. Input data sources for stored processes can be either LASR tables in the SAS^® LASR™ Analytic Server or any other data that can be reached through the stored process code logic.

Read the paper (PDF).

Did you ever wonder how large US bank holding companies (BHCs) perform stress testing? I had the pleasure to be a part of this process on the model building end, and now I perform model validation. As with everything that is new and uncertain, there is much room for the discovery process. This presentation explains how banks in general perform time series modeling of different loans and credits to establish the bank's position during simulated stress. You learn the basic process behind model building and validation for Comprehensive Capital Analysis and Review (CCAR) purposes, which includes, but is not limited to, back testing, sensitivity analysis, scenario analysis, and model assumption testing. My goal is to gain your interest in the areas of challenging current modeling techniques and looking beyond standard model assumption testing to assess the true risk behind the formulated model and its consequences. This presentation examines the procedures that happen behind the scenes of any code's syntax to better explore statistics that play crucial roles in assessing model performance and forecasting. Forecasting future periods is the process that needs more attention and a better understanding because this is what the CCAR is really all about. In summary, this presentation engages professionals and students to dig dipper into every aspect of time series forecasting.

Read the paper (PDF).

A maximum harvest in farming analytics is achieved only if analytics can also be operationalized at the level of core business applications. Mapped to the use of SAS^® Analytics, the fruits of SAS be shared with Enterprise Business Applications by SAP. Learn how your SAS environment, including the latest of SAS^® In-Memory Analytics, can be integrated with SAP applications based on the SAP In-Memory Platform SAP HANA. We'll explore how a SAS^® Predictive Modeling environment can be embedded inside SAP HANA and how native SAP HANA data management capabilities such as SAP HANA Views, Smart Data Access, and more can be leveraged by SAS applications and contribute to an end-to-end in-memory data management and analytics platform. Come and see how you can extend the reach of your SAS^® Analytics efforts with the SAP HANA integration!

Read the paper (PDF).

So you have big data and need to know how to quickly and efficiently keep your data up-to-date and available in SAS^® Visual Analytics? One of the challenges that customers often face is how to regularly update data tables in the SAS^® LASR™ Analytic Server, the in-memory analytical platform for SAS Visual Analytics. Is appending data always the right answer? What are some of the key things to consider when automating a data update and load process? Based on proven best practices and existing customer implementations, this paper provides you with answers to those questions and more, enabling you to optimize your update and data load processes. This ensures that your organization develops an effective and robust data refresh strategy.

Read the paper (PDF).

There are many 'gotcha's' when you are trying to automate a well-written program. The details differ depending on the way you schedule the program and the environment you are using. This paper covers system options, error handling logic, and best practices for logging. Save time and frustration by using these tips as you schedule programs to run.

Read the paper (PDF).

The Work library is at the core of most SAS^® programs, but programmers tend to ignore it unless something breaks. This paper first discusses the USER= system option for saving the Work files in a directory. Then, we cover a similar macro-controlled method for saving the files in your Work library, and the interaction of this method with OPTIONS NOREPLACE and the syntax check options. A number of SAS system options that help you to manage Work libraries are discussed: WORK=, WORKINIT, WORKTERM; these options might be restricted by SAS Administrators. Additional considerations in managing Work libraries are discussed: handling large files, file compression, programming style, and macro-controlled deletion of redundant files in SAS^® Enterprise Guide^®.

Read the paper (PDF). | Download the data file (ZIP).

This presentation provides an in-depth analysis, with example SAS^® code, of the health care use and expenditures associated with depression among individuals with heart disease using the 2012 Medical Expenditure Panel Survey (MEPS) data. A cross-sectional study design was used to identify differences in health care use and expenditures between depressed (n = 601) and nondepressed (n = 1,720) individuals among patients with heart disease in the United States. Multivariate regression analyses using the SAS survey analysis procedures were conducted to estimate the incremental health services and direct medical costs (inpatient, outpatient, emergency room, prescription drugs, and other) attributable to depression. The prevalence of depression among individuals with heart disease in 2012 was estimated at 27.1% (6.48 million persons) and their total direct medical costs were estimated at approximately $110 billion in 2012 U.S. dollars. Younger adults (< 60 years), women, unmarried, poor, and sicker individuals with heart disease were more likely to have depression. Patients with heart disease and depression had more hospital discharges (relative ratio (RR) = 1.06, 95% confidence interval (CI) [1.02 to 1.09]), office-based visits (RR = 1.27, 95% CI [1.15 to 1.41]), emergency room visits (RR = 1.08, 95% CI [1.02 to 1.14]), and prescribed medicines (RR = 1.89, 95% CI [1.70, 2.11]) than their counterparts without depression. Finally, among individuals with heart disease, overall health care expenditures for individuals with depression was 69% higher than that for individuals without depression (RR = 1.69, 95% CI [1.44, 1.99]). The conclusion is that depression in individuals with heart disease is associated with increased health care use and expenditures, even after adjusting for differences in age, gender, race/ethnicity, marital status, poverty level, and medical comorbidity.

Read the paper (PDF).

The SAS^® macro processor is a powerful ally, but it requires respect. There are a myriad of macro functions available, most of which emulate DATA step functions, but some of which require special consideration to fully use their capabilities. Questions to be answered include the following: When should you protect the macro variable? During macro compilation, during macro execution? (What do those phrases even mean?) How do you know when to use which macro function? %BQUOTE(), %NBRQUOTE(), %UNQUOTE(),%SUPERQ(), and so on? What's with the %Q prefix of some macro functions? And more: %SYSFUNC(), %SYSCALL, and so on. Macro developers will no longer by daunted by the complexity of choices. With a little clarification, the power of these macro functions will open up new possibilities.

Read the paper (PDF).

Drawing on the results from machine learning, exploratory statistics, and a variety of related methodologies, data analytics is becoming one of the hottest areas in a variety of global industries. The utility and application of these analyses have been extremely impressive and have led to successes ranging from business value generation to hospital infection control applications. This presentation examines the philosophical foundations epistemology associated with scientific discovery and shows whether the currently used analytics techniques rest on a rational philosophy of science. Examples are provided to assist in making the concepts more concrete to the business and scientific user.

Read the paper (PDF). | Watch the recording.

Using PROC TRANSPOSE to make wide files wider requires running separate PROC TRANSPOSE steps for each variable that you want transposed, as well as a DATA step using a MERGE statement to combine all of the transposed files. In addition, if you want the variables in a specific order, an extra DATA step is needed to rearrange the variable ordering. This paper presents a method that accomplishes the task in a simpler manner using less code and requiring fewer steps, and which runs n times faster than PROC TRANSPOSE (where n=the number of variables to be transposed).

Read the paper (PDF). | Download the data file (ZIP).

Currently, there are several methods for reading JSON formatted files into SAS^® that depend on the version of SAS and which products are licensed. These methods include user-defined macros, visual analytics, PROC GROOVY, and more. The user-defined macro %GrabTweet, in particular, provides a simple way to directly read JSON-formatted tweets into SAS^® 9.3. The main limitation of %GrabTweet is that it requires the user to repeatedly run the macro in order to download large amounts of data over time. Manually downloading tweets while conforming to the Twitter rate limits might cause missing observations and is time-consuming overall. Imagine having to sit by your computer the entire day to continuously grab data every 15 minutes, just to download a complete data set of tweets for a popular event. Fortunately, the %GrabTweet macro can be modified to automate the retrieval of Twitter data based on the rate that the tweets are coming in. This paper describes the application of the %GrabTweet macro combined with batch processing to download tweets without manual intervention. Users can specify the phrase parameters they want, run the batch processing macro, leave their computer to automatically download tweets overnight, and return to a complete data set of recent Twitter activity. The batch processing implements an automated retrieval of tweets through an algorithm that assesses the rate of tweets for the specified topic in order to make downloading large amounts of data simpler and effortless for the user.

Read the paper (PDF).

If you are looking for ways to make your graphs more communication-effective, this tutorial can help. It covers both the new ODS Graphics SG (Statistical Graphics) procedures and the traditional SAS/GRAPH^® software G procedures. The focus is on management reporting and presentation graphs, but the principles are relevant for statistical graphs as well. Important features unique to SAS^® 9.4 are included, but most of the designs and construction methods apply to earlier versions as well. The principles of good graphic design are actually independent of your choice of software.

Read the paper (PDF).

How does historical production data relate a story about subsurface oil and gas reservoirs? Business and domain experts must perform accurate analysis of reservoir behavior using only rate and pressure data as a function of time. This paper introduces innovative data-driven methodologies to forecast oil and gas production in unconventional reservoirs that, owing to the nature of the tightness of the rocks, render the empirical functions less effective and accurate. You learn how implementations of the SAS^® MODEL procedure provide functional algorithms that generate data-driven type curves on historical production data. Reservoir engineers can now gain more insight to the future performance of the wells across their assets. SAS enables a more robust forecast of the hydrocarbons in both an ad hoc individual well interaction and in an automated batch mode across the entire portfolio of wells. Examples of the MODEL procedure arising in subsurface production data analysis are discussed, including the Duong data model and the stretched exponential data model. In addressing these examples, techniques for pattern recognition and for implementing TREE, CLUSTER, and DISTANCE procedures in SAS/STAT^® are highlighted to explicate the importance of oil and gas well profiling to characterize the reservoir. The MODEL procedure analyzes models in which the relationships among the variables comprise a system of one or more nonlinear equations. Primary uses of the MODEL procedure are estimation, simulation, and forecasting of nonlinear simultaneous equation models, and generating type curves that fit the historical rate production data. You will walk through several advanced analytical methodologies that implement the SEMMA process to enable hypotheses testing as well as directed and undirected data mining techniques. SAS^® Visual Analytics Explorer drives the exploratory data analysis to surface trends and relationships, and the data QC workflows ensure a robust input space for the performance forecasting methodologies that are visualized in a web-based thin client for interactive interpretation by reservoir engineers.

Read the paper (PDF).

Understanding your customer will allow you to create, rollout, or implement meaningful, value-added processes and tools to assist in increasing revenue, simplify or standardize processes, and increase productivity. This session will guide you on how to engage your customer, observe and discern their needs, then ultimately deliver and transition your product.

This paper explores feature extraction from unstructured text variables using Term Frequency-Inverse Document Frequency (TF-IDF) weighting algorithms coded in Base SAS^®. Data sets with unstructured text variables can often hold a lot of potential to enable better predictive analysis and document clustering. Each of these unstructured text variables can be used as inputs to build an enriched data set-specific inverted index, and the most significant terms from this index can be used as single word queries to weight the importance of the term to each document from the corpus. This paper also explores the usage of hash objects to build the inverted indices from the unstructured text variables. We find that hash objects provide a considerable increase in algorithm efficiency, and our experiments show that a novel weighting algorithm proposed by Paik (2013) best enables meaningful feature extraction. Our TF-IDF implementations are tested against a publicly available data breach data set to understand patterns specific to insider threats to an organization.

Read the paper (PDF). | Watch the recording.

The NH Citizens Health Initiative and the University of New Hampshire Institute for Health Policy and Practice, in collaboration with Accountable Care Project (ACP) participants, have developed a set of analytic reports to provide systems undergoing transformation a capacity to compare performance on the measures of quality, utilization, and cost across systems and regions. The purpose of these reports is to provide data and analysis on which our ACP learning collaborative can share knowledge and develop action plans that can be adopted by health-care innovators in New Hampshire. This breakout session showcases the claims-based reports, powered by SAS^® Visual Analytics and driven by the New Hampshire Comprehensive Health Care Information System (CHIS), which includes commercial, Medicaid, and Medicare populations. With the power of SAS Visual Analytics, hundreds of pages of PDF files were distilled down to a manageable, dynamic, web-based portal that allows users to target information most appealing to them. This streamlined approach reduces barriers to obtaining information, offers that information in a digestible medium, and creates a better user experience. For more information about the ACP or to access the public reports, visit http://nhaccountablecare.org/.

Read the paper (PDF).

Athletes in sports, such as baseball and softball, commonly undergo elbow reconstruction surgeries. There is research that suggests that the rate of elbow reconstruction surgeries among professional baseball pitchers continues to rise by leaps and bounds. Given the trend found among professional pitchers, the current study reviews patterns of elbow reconstruction surgery among the privately insured population. The study examined trends (for example, cost, age, geography, and utilization) in elbow reconstruction surgeries among privately insured patients using analytic tools such as SAS^® Enterprise Guide^® and SAS^® Visual Analytics, based on the medical and surgical claims data from the FAIR Health National Private Insurance Claims (NPIC) database. The findings of the study suggested that there are discernable patterns in the prevalence of elbow reconstruction surgeries over time and across specific geographic regions.

Read the paper (PDF). | Download the data file (ZIP).

We've all heard it before: 'If two ampersands don't work, add a third.' But how many of us really know how ampersands work behind the scenes? We show the function of multiple ampersands by going through examples of the common two- and three-ampersand scenarios, and expand to show four, five, six, and even seven ampersands, and explain when they might be (rarely) useful.

Read the paper (PDF).

Interactive Voice Response (IVR) systems are likely one of the best and worst gifts to the world of communication, depending on who you ask. Businesses love IVR systems because they take out hundreds of millions of dollars of call center costs in automation of routine tasks, while consumers hate IVRs because they want to talk to an agent! It is a delicate balancing act to manage an IVR system that saves money for the business, yet is smart enough to minimize consumer abrasion by knowing who they are, why they are calling, and providing an easy automated solution or a quick route to an agent. There are many aspects to designing such IVR systems, including engineering, application development, omni-channel integration, user interface design, and data analytics. For larger call volume businesses, IVRs generate terabytes of data per year, with hundreds of millions of rows per day that track all system and customer- facing events. The data is stored in various formats and is often unstructured (lengthy character fields that store API return information or text fields containing consumer utterances). The focus of this talk is the development of a data mining framework based on SAS^® that is used to parse and analyze IVR data in order to provide insights into usability of the application across various customer segments. Certain use cases are also provided.

Read the paper (PDF).

There have been many SAS^® Global Forum papers written about getting your data into SAS^® from Microsoft Excel and getting your data back out to Excel after using SAS to manipulate it. But sometimes you have to update Excel files with special formatting and formulas that would be too hard to replicate with Dynamic Data Exchange (DDE) or that change too often to make DDE worthwhile. But we can still use the output prowess of SAS and a sprinkling of Visual Basic for Applications (VBA) to maintain the existing formatting and formulas in your Excel file! This paper focuses on the possibilities you have in updating Excel files by reading in the data, using SAS to modify it as needed, and then using DDE and a simple Excel VBA macro to output back to Excel and use a formula while maintaining the existing formatting that is present in your source Excel file.

Read the paper (PDF). | Download the data file (ZIP).

In 2012, the Obama campaign used advanced analytics to target voters, especially in social media channels. Millions of voters were scored on models each night to predict their voting patterns. These models were used as the driver for all campaign decisions, including TV ads, budgeting, canvassing, and digital strategies. This presentation covers how the Obama campaign strategies worked, what's in store for analytics in future elections, and how these strategies can be applied in the business world.

Read the paper (PDF). | Watch the recording.

Becoming one of the best memorizers in the world doesn't happen overnight. With hard work, dedication, a bit of obsession, and with the assistance of some clever analytics metrics, Nelson Dellis was able to climb himself up to the top of the memory rankings in under a year to become the now 3x USA Memory Champion. In this talk, he explains what it takes to become the best at memory, what is involved in such grueling memory competitions, and how analytics helped him get there.

Categorization hierarchies are ubiquitous in big data. Examples include the MEDLINE's Medical Subject Headings (MeSH) taxonomy, United Nations Standard Products and Services Code (UNSPSC) product codes, and the Medical Dictionary for Regulatory Activities (MedDRA) hierarchy for adverse reaction coding. A key issue is that in most taxonomies the probability of any particular example being in a category is very small at lower levels of the hierarchy. Blindly applying a standard categorization model is likely to perform poorly if this fact is not taken into consideration. This paper introduce a novel technique for text categorization, Boolean rule extraction, which enables you to effectively address this situation. In addition, models that are generated by a rule-based technique have good interpretability and can be easily modified by a human expert, enabling better human-machine interaction. The paper demonstrates how to use SAS^® Text Miner macros and procedures to obtain effective predictive models at all hierarchy levels in a taxonomy.

Read the paper (PDF).

SAS^® PROC FASTCLUS generates five clusters for the group of repeat clients of Ontario's Remedial Measures program. Heat map tables are shown for selected variables such as demographics, scales, factor, and drug use to visualize the difference between clusters.

The Behavioral Risk Factor Surveillance System (BRFSS) collects data on health practices and risk behaviors via telephone survey. This study focuses on the question, On average, how many hours of sleep do you get in a 24-hour period? Recall bias is a potential concern in interviews and questionnaires, such as BRFSS. The 2013 BRFSS data is used to illustrate the proper methods for implementing PROC SURVEYREG and PROC SURVEYLOGISTIC, using the complex weighting scheme that BRFSS provides.

Read the paper (PDF).

Real-time web content personalization has come into its teen years, but recently a spate of marketing solutions have enabled marketers to finely personalize web content for visitors based on browsing behavior, geo-location, preferences, and so on. In an age where the attention span of a web visitor is measured in seconds, marketers hope that tailoring the digital experience will pique each visitor's interest just long enough to increase corporate sales. The range of solutions spans the entire spectrum of completely cloud-based installations to completely on-premises installations. Marketers struggle to find the most optimal solution that would meet their corporation's marketing objectives, provide them the highest agility and time-to-market, and still keep a low marketing budget. In the last decade or so, marketing strategies that involved personalizing using purely on-premises customer data quickly got replaced by ones that involved personalizing using only web-browsing behavior (a.k.a, clickstream data). This was possible because of a spate of cloud-based solutions that enabled marketers to de-couple themselves from the underlying IT infrastructure and the storage issues of capturing large volumes of data. However, this new trend meant that corporations weren't using much of their treasure trove of on-premises customer data. Of late, however, enterprises have been trying hard to find solutions that give them the best of both--the ease of gathering clickstream data using cloud-based applications and on-premises customer data--to perform analytics that lead to better web content personalization for a visitor. This paper explains a process that attempts to address this rapidly evolving need. The paper assumes that the enterprise already has tools for capturing clickstream data, developing analytical models, and for presenting the content. It provides a roadmap to implementing a phased approach where enterprises continue to capture clickstream data, but they bring that data in-house to be merg ed with customer data to enable their analytics team to build sophisticated predictive models that can be deployed into the real-time web-personalization application. The final phase requires enterprises to continuously improve their predictive models on a periodic basis.

Read the paper (PDF).

A Chinese wind energy company designs several hundred wind farms each year. An important step in its design process is micrositing, in which it creates a layout of turbines for a wind farm. The amount of energy that a wind farm generates is affected by geographical factors (such as elevation of the farm), wind speed, and wind direction. The types of turbines and their positions relative to each other also play a critical role in energy production. Currently the company is using an open-source software package to help with its micrositing. As the size of wind farms increases and the pace of their construction speeds up, the open-source software is no longer able to support the design requirements. The company wants to work with a commercial software vendor that can help resolve scalability and performance issues. This paper describes the use of the OPTMODEL and OPTLSO procedures on the SAS^® High-Performance Analytics infrastructure together with the FCMP procedure to model and solve this highly nonlinear optimization problem. Experimental results show that the proposed solution can meet the company's requirements for scalability and performance. A Chinese wind energy company designs several hundred wind farms each year. An important step of their design process is micro-siting, which creates a layout of turbines for a wind farm. The amount of energy generated from a wind farm is affected by geographical factors (such as elevation of the farm), wind speed, and wind direction. The types of turbines and their positions relative to each other also play critical roles in the energy production. Currently the company is using an open-source software package to help them with their micro-siting. As the size of wind farms increases and the pace of their construction speeds up, the open-source software is no longer able to support their design requirements. The company wants to work with a commercial software vendor that can help them resolve scalability and performance issues. This pap er describes the use of the FCMP, OPTMODEL, and OPTLSO procedures on the SAS^® High-Performance Analytics infrastructure to model and solve this highly nonlinear optimization problem. Experimental results show that the proposed solution can meet the company's requirements for scalability and performance.

Read the paper (PDF).

Hawkins (1980) defines an outlier as an observation that deviates so much from other observations as to arouse the suspicion that it was generated by a different mechanism . To identify data outliers, a classic multivariate outlier detection approach implements the Robust Mahalanobis Distance Method by splitting the distribution of distance values into two subsets (within-the-norm and out-of-the-norm), with the threshold value usually set to the 97.5% quantile of the Chi-Square distribution with p (number of variables) degrees of freedom and items whose distance values are beyond it are labeled out-of-the-norm. This threshold value is an arbitrary number, however, and it might flag as out-of-the-norm a number of items that are actually extreme values of the baseline distribution rather than outliers. Therefore, it is desirable to identify an additional threshold, a cutoff point that divides the set of out-of-norm points in two subsets--extreme values and outliers. One way to do this--in particular for larger databases--is to Increase the threshold value to another arbitrary number, but this approach requires taking into consideration the size of the data set since size affects the threshold-separating outliers from extreme values. A 2003 article by Gervini (Journal of Multivariate Statistics) proposes an adaptive threshold that increases with the number of items n if the data is clean but it remains bounded if there are outliers in the data. In 2005 Filzmoser, Garrett, and Reimann (Computers & Geosciences) built on Gervini's contribution to derive by simulation a relationship between the number of items n, the number of variables in the data p, and a critical ancillary variable for the determination of outlier thresholds. This paper implements the Gervini adaptive threshold value estimator by using PROC ROBUSTREG and the SAS^® Chi-Square functions CINV and PROBCHI, available in the SAS/STAT^® environment. It also provides data simulations to illustrate the reliab ility and the flexibility of the method in distinguishing true outliers from extreme values.

Read the paper (PDF).

Companies are increasingly relying on analytics as the right solution to their problems. In order to use analytics and create value for the business, companies first need to store, transform, and structure the data to make it available and functional. This paper shows a successful business case where the extraction and transformation of the data combined with analytical solutions were developed to automate and optimize the management of the collections cycle for a TELCO company (DIRECTV Colombia). SAS^® Data Integration Studio is used to extract, process, and store information from a diverse set of sources. SAS Information Map is used to integrate and structure the created databases. SAS^® Enterprise Guide^® and SAS^® Enterprise Miner™ are used to analyze the data, find patterns, create profiles of clients, and develop churn predictive models. SAS^® Customer Intelligence Studio is the platform on which the collection campaigns are created, tested, and executed. SAS^® Web Report Studio is used to create a set of operational and management reports.

Read the paper (PDF).

Breast cancer is the leading cause of cancer-related deaths among women worldwide, and its early detection can reduce mortality rate. Using a data set containing information about breast screening provided by the Royal Liverpool University Hospital, we constructed a model that can provide early indication of a patient's tendency to develop breast cancer. This data set has information about breast screening from patients who were believed to be at risk of developing breast cancer. The most important aspect of this work is that we excluded variables that are in one way or another associated with breast cancer, while keeping the variables that are less likely to be associated with breast cancer or whose associations with breast cancer are unknown as input predictors. The target variable is a binary variable with two values, 1 (indicating a type of cancer is present) and 0 (indicating a type of cancer is not present). SAS^® Enterprise Miner™ 12.1 was used to perform data validation and data cleansing, to identify potentially related predictors, and to build models that can be used to predict at an early stage the likelihood of patients developing breast cancer. We compared two models: the first model was built with an interactive node and a cluster node, and the second was built without an interactive node and a cluster node. Classification performance was compared using a receiver operating characteristic (ROC) curve and average squares error. Interestingly, we found significantly improved model performance by using only variables that have a lesser or unknown association with breast cancer. The result shows that the logistic model with an interactive node and a cluster node has better performance with a lower average squared error (0.059614) than the model without an interactive node and a cluster node. Among other benefits, this model will assist inexperienced oncologists in saving time in disease diagnosis.

Read the paper (PDF).

Abalone is a common name given to sea snails or mollusks. These creatures are highly iridescent, with shells of strong changeable colors. This characteristic makes the shells attractive to humans as decorative objects and jewelry. The abalone structure is being researched to build body armor. The value of a shell varies by its age and the colors it displays. Determining the number of rings on an abalone is a tedious and cumbersome task and is usually done by cutting the shell through the cone, staining it, and counting the number of rings on it through a microscope. In this poster, I aim to predict the number of any rings on any abalone by using the physical characteristics. This data was obtained from UCI Machine Learning Repository, which consists of 4,177 observations with 8 attributes. I considered the number of rings to be my target variable. The abalone's age can be reasonably approximated as being 1.5 times the number of rings on its shell. Using SAS^® Enterprise Miner™, I have built regression models and neural network models to determine the physical measurements responsible for determining the number of rings on the abalone. While I have obtained a coefficient of determination of 54.01%, my aim is to improve and expand the analysis using the power of SAS Enterprise Miner. The current initial results indicate that the height, the shucked weight, and the viscera weight of the shell are the three most influential variables in predicting the number of rings on an abalone.

Many epidemiological studies use medical claims to identify and describe a population. But finding out who was diagnosed, and who received treatment, isn't always simple. Each claim can have dozens of medical codes, with different types of codes for procedures, drugs, and diagnoses. Even a basic definition of treatment could require a search for any one of 100 different codes. A SAS^® macro may come to mind, but generalizing the macro to work with different codes and types allows it to be reused in a variety of different scenarios. We look at a number of examples, starting with a single code type and variable. Then we consider multiple code variables, multiple code types, and multiple flag variables. We show how these macros can be combined and customized for different data with minimal rework. Macro flexibility and reusability are also discussed, along with ways to keep our list of medical codes separate from our program. Finally, we discuss time-dependent medical codes, codes requiring database lookup, and macro performance.

Read the paper (PDF). | Download the data file (ZIP).

Crowd sourcing of data is growing rapidly, enabled by smart devices equipped with assisted GPS location, tagging of photos, and mapping other aspects of the users' lives and activities. A fundamental assumption that the reported locations are accurate within the usual GPS limitations of approximately 10m is made when such data is used. However, as a result of a wide range of technical issues, it turns out that the accuracy of the reported locations is highly variable and cannot be relied on; some locations are accurate but many are highly inaccurate, and that can affect many of the decisions that are being made based on the data. An analysis of a set of data is presented that demonstrates that this assumption is flawed, and examples of the levels of inaccuracy that has significant consequences in a range of contexts are provided. By using Base SAS^®, the paper demonstrates the quality and veracity of the data and the scale of the errors that can be present. This analysis has critical significance in fields such as mobile location-based marketing, forensics, and law.

Read the paper (PDF).

Throughout the latter part of the twentieth century, the United States of America has experienced an incredible boom in the rate of incarceration of its citizens. This increase arguably began in the 1970s when the Nixon administration oversaw the beginning of the war on drugs in America. The U.S. now has one of the highest rates of incarceration among industrialized nations. However, the citizens who have been incarcerated on drug charges have disproportionately been African American or other racial minorities, even though many studies have concluded that drug use is fairly equal among racial groups. In order to remedy this situation, it is essential to first understand why so many more people have been arrested and incarcerated. In this research, I explore a potential explanation for the epidemic of mass incarceration. I intend to answer the question does gubernatorial rhetoric have an effect on the rate of incarceration in a state? More specifically, I am interested in examining the language that the governor of a state uses at the annual State of the State address in order to see if there is any correlation between rhetoric and the subsequent rate of incarceration in that state. In order to understand any possible correlation, I use SAS^® Text Miner and SAS^® Contextual Analysis to examine the attitude towards crime in each speech. The political phenomenon that I am trying to understand is how state government employees are affected by the tone that the chief executive of a state uses towards crime, and whether the actions of these state employees subsequently lead to higher rates of incarceration. The governor is the top government official in charge of employees of a state, so when this official addresses the state, the employees may take the governor's message as an order for how they do their jobs. While many political factors can affect legislation and its enforcement, a governor has the ability to set the tone of a state when it comes to policy issues suc h as crime.

In today's society, where seemingly unlimited information is just a mouse click away, many turn to social media, forums, and medical websites to research and understand how mothers feel about the birthing process. Mining the data in these resources helps provide an understanding of what mothers value and how they feel. This paper shows the use of SAS^® Text Analytics to gather, explore, and analyze reports from mothers to determine their sentiment about labor and delivery topics. Results of this analysis could aid in the design and development of a labor and delivery survey and be used to understand what characteristics of the birthing process yield the highest levels of importance. These resources can then be used by labor and delivery professionals to engage with mothers regarding their labor and delivery preferences.

Read the paper (PDF).

During the financial crisis of 2007-2009, the U.S. labor market lost 8.4 million jobs, causing the unemployment rate to increase from 5% to 9.5%. One of the indicators for economic recession is negative gross domestic product (GDP) for two consecutive quarters. This poster combines quantitative and qualitative techniques to predict the economic downturn by forecasting recession probabilities. Data was collected from the Board of Governors of the Federal Reserve System and the Federal Reserve Bank of St. Louis, containing 29 variables and quarterly observations from 1976-Q1 to 2013-Q3. Eleven variables were selected as inputs based on their effects on recession and limiting the multicollinearity: long-term treasury yield (5-year and 10-year), mortgage rate, CPI inflation rate, prime rate, market volatility index, Better Business Bureau (BBB) corporate yield, house price index, stock market index, commercial real estate price index, and one calculated variable yield spread (Treasury yield-curve spread). The target variable was a binary variable depicting the economic recession for each quarter (1=Recession). Data was prepared for modeling by applying imputation and transformation on variables. Two-step analysis was used to forecast the recession probabilities for the short-term period. Predicted recession probabilities were first obtained from the Backward Elimination Logistic Regression model that was selected on the basis of misclassification (validation misclassification= 0.115). These probabilities were then forecasted using the Exponential Smoothing method that was selected on the basis of mean average error (MAE= 11.04). Results show the recession periods including the great recession of 2008 and the forecast for eight quarters (up to 2015-Q3).

Read the paper (PDF).

An essential part of health services research is describing the use and sequencing of a variety of health services. One of the most frequently examined health services is hospitalization. A common problem in describing hospitalizations is that a patient might have multiple hospitalizations to treat the same health problem. Specifically, a hospitalized patient might be (1) sent to and returned from another facility in a single day for testing, (2) transferred from one hospital to another, and/or (3) discharged home and re-admitted within 24 hours. In all cases, these hospitalizations are treating the same underlying health problem and should be considered as a single episode. If examined without regard for the episode, a patient would be identified as having 4 hospitalizations (the initial hospitalization, the testing hospitalization, the transfer hospitalization, and the readmission hospitalization). In reality, they had one hospitalization episode spanning multiple facilities. IMPORTANCE: Failing to account for multiple hospitalizations in the same episode has implications for many disciplines including health services research, health services planning, and quality improvement for patient safety. HEALTH SERVICES RESEARCH: Hospitalizations will be counted multiple times, leading to an overestimate of the number of hospitalizations a person had. For example, a person can be identified as having 4 hospitalizations when in reality they had one episode of hospitalization. This will result in a person appearing to be a higher user of health care than is true. RESOURCE PLANNING FOR HEALTH SERVICES. The average time and resources needed to treat a specific health problem may be underestimated. To illustrate, if a patient spends 10 days each in 3 different hospitals in the same episode, the total number of days needed to treat the health problem is 30 days, but each hospital will believe it is only 10, and planned resourcing may be inadequate. QUALITY IMPROVEMENT FOR PATIENT SAFETY. Hospital-acquir ed infections are a serious concern and a major cause of extended hospital stays, morbidity, and death. As a result, many hospitals have quality improvement programs that monitor the occurrence of infections in order to identify ways to reduce them. If episodes of hospitalizations are not considered, an infection acquired in a hospital that does not manifest until a patient is transferred to a different hospital will incorrectly be attributed to the receiving hospital. PROPOSAL: We have developed SAS^® code to identify episodes of hospitalizations, the sequence of hospitalizations within each episode, and the overall duration of the episode. The output clearly displays the data in an intuitive and easy-to-understand format. APPLICATION: The method we will describe and the associated SAS code will be useful to not only health services researchers, but also anyone who works with temporal data that includes nested, overlapping, and subsequent events.

Read the paper (PDF).

Developing a quality product or service, while at the same time improving cost management and maximizing profit, are challenging goals for any company. Finding the optimal balance between efficiency and profitability is not easy. The same can be said in regards to the development of a predictive statistical model. On the one hand, the model should predict as accurately as possible. On the other hand, having too many predictors can end up costing company money. One of the purposes of this project is to explore the cost of simplicity. When is it worth having a simpler model, and what are some of the costs of using a more complex one? The answer to that question leads us to another one: How can a predictive statistical model be maximized in order to increase a company's profitability? Using data from the consumer credit risk domain provided from CompuCredit (now Atlanticus), we used logistic regression to build binary classification models to predict the likelihood of default. This project compares two of these models. Although the original data set had several hundred predictor variables and more than a million observations, I chose to use rather simple models. My goal was to develop a model with as few predictors as possible, while not going lower than a concordant level of 80%. Two models were evaluated and compared based on efficiency, simplicity, and profitability. Using the selected model, cluster analysis was then performed in order to maximize the estimated profitability. Finally, the analysis was taken one step further through a supervised segmentation process, in order to target the most profitable segment of the best cluster.

Read the paper (PDF).

SAS^® platform administrators always feel the pinch of not having information about how much storage space is occupied by each user on one specific file system or in the entire environment. Sometimes the platform administrator does not have access to all users' folders, so they have to plan for the worst. There are multiple approaches to tackle this problem. One of the better methods is to initiate an alert mechanism to notify a user when they are in the top 10 file system users on the system.

Read the paper (PDF).

In this era of bigdata, the use of text analytics to discover insights is rapidly gainingpopularity in businesses. On average, more than 80 percent of the data inenterprises may be unstructured. Text analytics can help discover key insightsand extract useful topics and terms from the unstructured data. The objectiveof this paper is to build a model using textual data that predicts the factorsthat contribute to downtime of a truck. This research analyzes the data of over200,000 repair tickets of a leading truck manufacturing company. After theterms were grouped into fifteen key topics using text topic node of SAS^® TextMiner, a regression model was built using these topics to predict truckdowntime, the target variable. Data was split into training and validation fordeveloping the predictive models. Knowledge of the factors contributing todowntime and their associations helped the organization to streamline theirrepair process and improve customer satisfaction.

Read the paper (PDF).

The concept of least squares means, or population marginal means, seems to confuse a lot of people. We explore least squares means as implemented by the LSMEANS statement in SAS^®, beginning with the basics. Particular emphasis is paid to the effect of alternative parameterizations (for example, whether binary variables are in the CLASS statement) and the effect of the OBSMARGINS option. We use examples to show how to mimic LSMEANS using ESTIMATE statements and the advantages of the relatively new LSMESTIMATE statement. The basics of estimability are discussed, including how to get around the dreaded non-estimable messages. Emphasis is put on using the STORE statement and PROC PLM to test hypotheses without having to redo all the model calculations. This material is appropriate for all levels of SAS experience, but some familiarity with linear models is assumed.

Read the paper (PDF). | Watch the recording.

Many SAS^® procedures can be used to analyze longitudinal data. This study employed a multisite randomized controlled trial design to demonstrate the effectiveness of two SAS procedures, GLIMMIX and GENMOD, to analyze longitudinal data from five Department of Veterans Affairs Medical Centers (VAMCs). Older male veterans (n = 1222) seen in VAMC primary care clinics were randomly assigned to two behavioral health models, integrated (n = 605) and enhanced referral (n = 617). Data was collected at baseline, and at 3-, 6-, and 12- month follow-up. A mixed-effects repeated measures model was used to examine the dependent variable, problem drinking, which was defined as count and dichotomous from baseline to 12 month follow-up. Sociodemographics and depressive symptoms were included as covariates. First, bivariate analyses included general linear model and chi-square tests to examine covariates by group and group by problem drinking outcomes. All significant covariates were included in the GLIMMIX and GENMOD models. Then, multivariate analysis included mixed models with Generalized Estimation Equations (GEEs). The effect of group, time, and the interaction effect of group by time were examined after controlling for covariates. Multivariate results were inconsistent for GLIMMIX and GENMOD using Lognormal, Gaussian, Weibull, and Gamma distributions. SAS is a powerful statistical program in data analyses for longitudinal study.

Read the paper (PDF).

Mathematical optimization is a powerful paradigm for modeling and solving business problems that involve interrelated decisions about resource allocation, pricing, routing, scheduling, and similar issues. The OPTMODEL procedure in SAS/OR^® software provides unified access to a wide range of optimization solvers and supports both standard and customized optimization algorithms. This paper illustrates PROC OPTMODEL's power and versatility in building and solving optimization models and describes the significant improvements that result from PROC OPTMODEL's many new features. Highlights include the recently added support for the network solver, the constraint programming solver, and the COFOR statement, which allows parallel execution of independent solver calls. Best practices for solving complex problems that require access to more than one solver are also demonstrated.

Read the paper (PDF). | Download the data file (ZIP).

Competing risks arise in studies in which individuals are subject to a number of potential failure events and the occurrence of one event might impede the occurrence of other events. For example, after a bone marrow transplant, a patient might experience a relapse or might die while in remission. You can use one of the standard methods of survival analysis, such as the log-rank test or Cox regression, to analyze competing-risks data, whereas other methods, such as the product-limit estimator, might yield biased results. An increasingly common practice of assessing the probability of a failure in competing-risks analysis is to estimate the cumulative incidence function, which is the probability subdistribution function of failure from a specific cause. This paper discusses two commonly used regression approaches for evaluating the relationship of the covariates to the cause-specific failure in competing-risks data. One approach models the cause-specific hazard, and the other models the cumulative incidence. The paper shows how to use the PHREG procedure in SAS/STAT^® software to fit these models.

Read the paper (PDF).

Polytomous items have been widely used in educational and psychological settings. As a result, the demand for statistical programs that estimate the parameters of polytomous items has been increasing. For this purpose, Samejima (1969) proposed the graded response model (GRM), in which category characteristic curves are characterized by the difference of the two adjacent boundary characteristic curves. In this paper, we show how the SAS-PIRT macro (a SAS^® macro written in SAS/IML^®) was developed based on the GRM and how it performs in recovering the parameters of polytomous items using simulated data.

Read the paper (PDF).

Many papers have been written over the years that describe how to use Dynamic Data Exchange (DDE) to pass data from SAS^® to Excel. This presentation aims to show you how to do the same exchange with the SAS Output Delivery System (ODS) and the TEMPLATE Procedure.

Read the paper (PDF). | Download the data file (ZIP).

A bank that wants to use the Internal Ratings Based (IRB) methods to calculate minimum Basel capital requirements has to calculate default probabilities (PDs) for all its obligors. Supervisors are mainly concerned about the credit risk being underestimated. For high-quality exposures or groups with an insufficient number of obligors, calculations based on historical data may not be sufficiently reliable due to infrequent or no observed defaults. In an effort to solve the problem of default data scarcity, modeling assumptions are made, and to control the possibility of model risk, a high level of conservatism is applied. Banks, on the other hand, are more concerned about PDs that are too pessimistic, since this has an impact on their pricing and economic capital. In small samples or where we have little or no defaults, the data provides very little information about the parameters of interest. The incorporation of prior information or expert judgment and using Bayesian parameter estimation can potentially be a very useful approach in a situation like this. Using PROC MCMC, we show that a Bayesian approach can serve as a valuable tool for validation and monitoring of PD models for low default portfolios (LDPs). We cover cases ranging from single-period, zero correlation, and zero observed defaults to multi-period, non-zero correlation, and few observed defaults.

Read the paper (PDF).

How do you serve 25 million video ads a day to Internet users in 25 countries, while ensuring that you target the right ads to the right people on the right websites at the right time? With a lot of help from math, that's how! Come hear how Videology, an Internet advertising company, combines mathematical programming, predictive modeling, and big data techniques to meet the expectations of advertisers and online publishers, while respecting the privacy of online users and combatting fraudulent Internet traffic.

Data visualization is synonymous with big data, for which billions of records and millions of variables are analyzed simultaneously. But that does not mean data scientists analyzing clinical trial data that include only thousands of records and hundreds of variables cannot take advantage of data visualization methodologies. This paper presents a comprehensive process for loading standard clinical trial data into SAS^® Visual Analytics, an interactive analytic solution. The process implements template reporting for a wide variety of point-and-click visualizations. Data operations required to support this reporting are explained and examples of the actual visualizations are presented so that users can implement this reporting using their own data.

Read the paper (PDF).

Network diagrams in SAS^® Visual Analytics help highlight relationships in complex data by enabling users to visually correlate entire populations of values based on how they relate to one another. Network diagrams are appealing because they enable an analyst to visualize large volumes and relationships of data and to assign multiple roles to represent key factors for analysis such as node size and color and linkage size and color. SAS Visual Analytics can overlay a network diagram on top of a spatial geographic map for an even more appealing visualization. This paper focuses specifically on how to prepare data for network diagrams and how to build network diagrams in SAS Visual Analytics. This paper provides two real-world examples illustrating how to visualize users and groups from SAS^® metadata and how banks can visualize transaction flow using network diagrams.

Read the paper (PDF). | Download the data file (ZIP).

Enrollment management is very important to all colleges. Having the correct tools to help you better understand your enrollment patterns of the past and the future is critical to any school. This session will describe how Valencia College went from manually updating static charts for enrollment management, to building dynamic, interactive visualizations to compare how students register across different calendar-date periods (current versus previous period)grouped by different start-of-registration dates--from start of registration, days into registration, and calendar date to previous year calendar date. This includes being able to see the trend by college campus, instructional method mode (onsite or online ) or by type of session (part of semester, full, and so on) all available in one visual and sliced and diced via check lists. The trend loads 4-6 million rows of data nightly to the SAS^® LASR™ Analytics Server in a snap with no performance issues on the back-end or presentation visual. We will give a brief history of how we used to load data into Excel and manually build charts. Then we will describe the current environment, which is an automated approach through SAS^® Visual Analytics. We will show pictures of our old, static reports, and then show the audience the power and functionality of our new, interactive reports through SAS Visual Analytics.

Read the paper (PDF).

Whether you have a few variables to compare or billions of rows of data to explore, seeing the data in visual format can make all the difference in the insights you glean. In this session, learn how to determine which data is best delivered through visualization, understand the myriad types of data visualizations for use with your big data, and create effective data visualizations. If you are new to data visualization, this talk will help you understand how to best communicate with your data.

Read the paper (PDF).

When you are analyzing your data and building your models, you often find out that the data cannot be used in the intended way. Systematic pattern, incomplete data, and inconsistencies from a business point of view are often the reason. You wish you could get a complete picture of the quality status of your data much earlier in the analytic lifecycle. SAS^® analytics tools like SAS^® Visual Analytics help you to profile and visualize the quality status of your data in an easy and powerful way. In this session, you learn advanced methods for analytic data quality profiling. You will see case studies based on real-life data, where we look at time series data from a bird's-eye-view and interactively profile GPS trackpoint data from a sail race.

Read the paper (PDF). | Download the data file (ZIP).

If you have not had a chance to explore SAS^® Studio yet, or if you're anxious to see what's new, this paper gives you an introduction to this new browser-based interface for SAS^® programmers and a peek at what's coming. With SAS Studio, you can access your data files, libraries, and existing programs, and you can write new programs while using SAS software behind the scenes. SAS Studio connects to a SAS server in order to process SAS programs. The SAS server can be a hosted server in a cloud environment, a server in your local environment, or a copy of SAS on your local machine. It's a comfortable environment for those used to the traditional SAS windowing environment (SAS^® Display Manager) but new features like a query window, process flow diagrams, and tasks have been added to appeal to traditional SAS^® Enterprise Guide^® users.

Read the paper (PDF).

The latest releases of SAS^® Data Integration Studio and DataFlux^® Data Management Platform provide an integrated environment for managing and transforming your data to meet new and increasingly complex data management challenges. The enhancements help develop efficient processes that can clean, standardize, transform, master, and manage your data. The latest features include capabilities for building complex job processes, new web-based development and job monitoring environments, enhanced ELT transformation capabilities, big data transformation capabilities for Hadoop, integration with the analytic platform provided by SAS^® LASR™ Analytic Server, enhanced features for lineage tracing and impact analysis, and new features for master data and metadata management. This paper provides an overview of the latest features of the products and includes use cases and examples for leveraging product capabilities.

Read the paper (PDF). | Watch the recording.

Now that SAS^® users are moving to 64-bit Microsoft Windows platforms, some are discovering that vendor-supplied DLLs might still be 32-bit. Since 64-bit applications cannot use 32-bit DLLs, this would present serious technical issues. This paper explains how the MODULE routines in SAS can be used to call into 32-bit DLLs successfully, using new features added in SAS^® 9.3.

Read the paper (PDF).

In many situations, an outcome of interest has a large number of zero outcomes and a group of nonzero outcomes that are discrete or highly skewed. For example, in modeling health care costs, some patients have zero costs, and the distribution of positive costs are often extremely right-skewed. When modeling charitable donations, many potential donors give nothing, and the majority of donations are relatively small with a few very large donors. In the analysis of count data, there are also times where there are more zeros than would be expected using standard methodology, or cases where the zeros might differ substantially than the non-zeros, such as number of cavities a patient has at a dentist appointment or number of children born to a mother. If data has such structure, and ordinary least squares methods are used, then predictions and estimation might be inaccurate. The two-part model gives us a flexible and useful modeling framework in many situations. Methods for fitting the models with SAS^® software are illustrated.

Read the paper (PDF).

Many freshmen leave their first college and go on to attend another institution. Some of these students are even successful in earning degrees elsewhere. As there is more focus on college graduation rates, this paper shows how the power of SAS^® can pull in data from many disparate sources, including the National Student Clearinghouse, to answer questions on the minds of many institutional researchers. How do we use the data to answer questions such as What would my graduation rate be if these students graduated at my institution instead of at another one?', What types of schools do students leave to attend? , and Are there certain characteristics of students who leave, and are they concentrated in certain programs? The data-handling capabilities of SAS are perfect for this type of analysis, and this presentation walks you through the process.

Read the paper (PDF).

A familiar adage in firefighting--if you can predict it, you can prevent it--rings true in many circles of accident prevention, including software development. If you can predict that a fire, however unlikely, someday might rage through a structure, it's prudent to install smoke detectors to facilitate its rapid discovery. Moreover, the combination of smoke detectors, fire alarms, sprinklers, fire-retardant building materials, and rapid intervention might not prevent a fire from starting, but it can prevent the fire from spreading and facilitate its immediate and sometimes automatic extinguishment. Thus, as fire codes have grown to incorporate increasingly more restrictions and regulations, and as fire suppression gear, tools, and tactics have continued to advance, even the harrowing business of firefighting has become more reliable, efficient, and predictable. As operational SAS^® data processes mature over time, they too should evolve to detect, respond to, and overcome dynamic environmental challenges. Erroneous data, invalid user input, disparate operating systems, network failures, memory errors, and other challenges can surprise users and cripple critical infrastructure. Exception handling describes both the identification of and response to adverse, unexpected, or untimely events that can cause process or program failure, as well as anticipated events or environmental attributes that must be handled dynamically through prescribed, predetermined channels. Rapid suppression and automatic return to functioning is the hopeful end state but, when catastrophic events do occur, exception handling routines can terminate a process or program gracefully while providing meaningful execution and environmental metrics to developers both for remediation and future model refinement. This presentation introduces fault-tolerant Base SAS^® exception handling routines that facilitate robust, reliable, and responsible software design.

Read the paper (PDF).

We demonstrate a method of using SAS^® 9.4 to supplement the interpretation of dimensions of a Multidimensional Scaling (MDS) model, a process that could be difficult without SAS^®. In our paper, we examine why do people choose to drive to work (over other means of travel),' a question that transportation researchers need to answer in order to encourage drivers to switch to more environmentally-friendly travel modes. We applied the MDS approach on a travel survey data set because MDS has the advantage of extracting drivers' motivations in multiple dimensions.To overcome the challenges of dimension interpretation with MDS, we used the logistic regression function of SAS 9.4 to identify the variables that are strongly associated with each dimension, thus greatly aiding our interpretation procedure. Our findings are important to transportation researchers, practitioners, and MDS users.

Read the paper (PDF).

The experiences of the programmer role in a large SAS^® shop are shared. Shortages in SAS programming talent tend to result in one SAS programmer doing all of the production programming within a unit in a shop. In a real-world example, management realized the problem and brought in new programmers to help do the work. The new programmers actually improved the existing programmers' programs. It became easier for the experienced programmers to complete other programming assignments within the unit. And, the different programs in the shop had a standard structure. As a result, all of the programmers had a clearer picture of the work involved and knowledge hoarding was eliminated. Experienced programmers were now available when great SAS code needed to be written. Yet, they were not the only programmers who could do the work! With multiple programmers able to do the same tasks, vacations were possible and didn't threaten deadlines. It was even possible for these programmers to be assigned other tasks outside of the unit and broaden their own skills in statistical production work.

Read the paper (PDF).

Smoke detectors operate by comparing actual air quality to expected air quality standards and immediately alerting occupants when smoke or particle levels exceed established thresholds. Just as rapid identification of smoke (that is, poor air quality) can detect harmful fire and facilitate its early extinguishment, rapid detection of poor quality data can highlight data entry or ingestion errors, faulty logic, insufficient or inaccurate business rules, or process failure. Aspects of data quality--such as availability, completeness, correctness, and timeliness--should be assessed against stated requirements that account for the scope, objective, and intended use of data products. A single outlier, an accidentally locked data set, or even subtle modifications to a data structure can cause a robust extract-transform-load (ETL) infrastructure to grind to a halt or produce invalid results. Thus, a mature data infrastructure should incorporate quality assurance methods that facilitate robust processing and quality data products, as well as quality control methods that monitor and validate data products against their stated requirements. The SAS^® Smoke Detector represents a scalable, generalizable solution that assesses the availability, completeness, and structure of persistent SAS data sets, ideal for finished data products or transactional data sets received with standardized frequency and format. Like a smoke detector, the quality control dashboard is not intended to discover the source of the blaze, but rather to sound an alarm to stakeholders that data have been modified, locked, deleted, or otherwise corrupted. Through rapid detection and response, the fidelity of data is increased as well as the responsiveness of developers to threats to data quality and validity.

Read the paper (PDF).

Working with multiple data sources in SAS^® was not a straight forward thing until PROC FEDSQL was introduced in the SAS^® 9.4 release. Federated Query Language, or FEDSQL, is a vendor-independent language that provides a common SQL syntax to communicate across multiple relational databases without having to worry about vendor-specific SQL syntax. PROC FEDSQL is a SAS implementation of the FEDSQL language. PROC FEDSQL enables us to write federated queries that can be used to perform joins on tables from different databases with a single query, without having to worry about loading the tables into SAS individually and combining them using DATA steps and PROC SQL statements. The objective of this paper is to demonstrate the working of PROC FEDSQL to fetch data from multiple data sources such as Microsoft SQL Server database, MySQL database, and a SAS data set, and run federated queries on all the data sources. Other powerful features of PROC FEDSQL such as transactions and FEDSQL pass-through facility are discussed briefly.

Read the paper (PDF).

Many retail and consumer packaged goods (CPG) companies are now keeping track of what their customers purchased in the past, often through some form of loyalty program. This record keeping is one example of how modern corporations are building data sets that have a panel structure, a data structure that is also pervasive in insurance and finance organizations. Panel data (sometimes called longitudinal data) can be thought of as the joining of cross-sectional and time series data. Panel data enable analysts to control for factors that cannot be considered by simple cross-sectional regression models that ignore the time dimension. These factors, which are unobserved by the modeler, might bias regression coefficients if they are ignored. This paper compares several methods of working with panel data in the PANEL procedure and discusses how you might benefit from using multiple observations for each customer. Sample code is available.

Read the paper (PDF). | Download the data file (ZIP).

Managing and organizing external files and directories play an important part in our data analysis and business analytics work. A good file management system can streamline project management and file organizations and significantly improve work efficiency . Therefore, under many circumstances, it is necessary to automate and standardize the file management processes through SAS^® programming. Compared with managing SAS files via PROC DATASETS, managing external files is a much more challenging task, which requires advanced programming skills. This paper presents and discusses various methods and approaches to managing external files with SAS programming. The illustrated methods and skills can have important applications in a wide variety of analytic work fields.

Read the paper (PDF).

Everyone likes getting a raise, and using ARRAYs in SAS^® can help you do just that! Using ARRAYs simplifies processing, allowing for reading and analyzing of repetitive data with minimum coding. Improving the efficiency of your coding and in turn, your SAS productivity, is easier than you think! ARRAYs simplify coding by identifying a group of related variables that can then be referred to later in a DATA step. In this quick tip, you learn how to define an ARRAY using an array statement that establishes the ARRAY name, length, and elements. You also learn the criteria and limitations for the ARRAY name, the requirements for array elements to be processed as a group, and how to call an ARRAY and specific array elements within a DATA step. This quick tip reveals useful functions and operators, such as the DIM function and using the OF operator within existing SAS functions, that make using ARRAYs an efficient and productive way to process data. This paper takes you through an example of how to do the same task with and without using ARRAYs in order to illustrate the ease and benefits of using them. Your coding will be more efficient and your data analysis will be more productive, meaning you will deserve ARRAYs!

Read the paper (PDF).

We all know there are multiple ways to use SAS^® language components to generate the same values in data sets and output (for example, using the DATA step versus PROC SQL, If-Then-Elses versus Format table conversions, PROC MEANS versus PROC SQL summarizations, and so on). However, do you compare those different ways to determine which are the most efficient in terms of computer resources used? Do you ever consider the time a programmer takes to develop or maintain code? In addition to determining efficient syntax, do you validate your resulting data sets? Do you ever check data values that must remain the same after being processed by multiple steps and verify that they really don't change? We share some simple coding techniques that have proven to save computer and human resources. We also explain some data validation and comparison techniques that ensure data integrity. In our distributed computing environment, we show a quick way to transfer data from a SAS server to a local client by using PROC DOWNLOAD and then PROC EXPORT on the client to convert the SAS data set to a Microsoft Excel file.

Read the paper (PDF).

How often have you pulled oodles of data out of the corporate data warehouse down into SAS^® for additional processing? This additional processing, sometimes thought to be uniquely SAS, might include FIRST. logic, cumulative totals, lag functionality, specialized summarization, or advanced date manipulation. Using the analytical (or OLAP) and Windowing functionality available in many databases (for example, in Teradata and IBM Netezza ), all of this processing can be performed directly in the database without moving and reprocessing detail data unnecessarily. This presentation illustrates how to increase your coding and execution efficiency by using the database's power through your SAS environment.

Read the paper (PDF).

As SAS^® products become more web-oriented and sophisticated, SAS administrators face an increased challenge to manage their SAS middle-tier environments. They want to know the answers to important critical questions when planning, installing, configuring, deploying, and administrating their SAS products. They also need to meet the requirements of high performance, high availability, increased security, maintainability, and more. In this paper, we identify the most common and challenging questions that most of our administrators and customers have asked. These questions range across topics such as SAS middle-tier architecture, clustering, performance, security, and administration using SAS^® Environment Manger. These questions come from many sources such as technical support, consultants, and internal customer experience testing teams. The specific questions include: what is new in SAS 9.4 mid-tier infrastructure and why that is better for me; should I use the SAS Web Server or can I use another third party Web Server in my deployment; where can I deploy customer dynamic web applications and static contents; what are the SAS JRE, SAS Web Server, SAS Web Application Server upgrade policy and process; how to architect and configure to achieve High Availability for EBI and VA; how to install, update or add my products for cluster members; how can I tune the mid-tier performance and improve the start-up time of my SAS Web Application Server; what options are available for configuring SSL; what is the security policy, what security patches are available and how to apply them; how can I manage my mid-tier infrastructure and applications and how the user and account are managed in SAS Environment Manager? The paper will present detailed answers for these questions and also point out where you can find more information. We believe that with the answers to these questions, you, SAS administrators, can better implement and manage your SAS environment with a higher confide nce and satisfaction.