SAS Global Forum 2014 Proceedings

Nowadays, most corporations build and maintain their own data warehouse, and an ETL (Extract, Transform, and Load) process plays a critical role in managing the data. Some people might create a large program and execute this program from top to bottom. Others might generate a SAS^® driver with several programs included, and then execute this driver. If some programs can be run in parallel, then developers must write extra code to handle these concurrent processes. If one program fails, then users can either rerun the entire process or comment out the successful programs and resume the job from where the program failed. Usually the programs are deployed in production with read and execute permission only. Users do not have the priviledge of modifying codes on the fly. In this case, how do you comment out the programs if the job terminated abnormally? This paper illustrates an approach for managing ETL process flows. The approach uses a framework based on SAS, on a UNIX platform. This is a high-level infrastructure discussion with some explanation of the SAS codes that are used to implement the framework. The framework supports the rerun or partial run of the entire process without changing any source codes. It also supports the concurrent process, and therefore no extra code is needed.

One of the first lessons that SAS^® programmers learn on the job is that numeric and character variables do not play well together, and that type mismatches are one of the more common source of errors in their otherwise flawless SAS programs. Luckily, converting variables from one type to another in SAS (that is, casting) is not difficult, requiring only the judicious use of either the input() or put() function. There remains, however, the danger of data being lost in the conversion process. This type of error is most likely to occur in cases of character-to-numeric variable conversion, most especially when the user does not fully understand the data contained in the data set. This paper will review the basics of data storage for character and numeric variables in SAS, the use of formats and informats for conversions, and how to ensure accurate type conversion of even high-precision numeric values.

Have you ever wished that with one click you could copy any SAS^® data set, including variable names, so that you could paste the text into a Microsoft Word file, Microsoft PowerPoint slide, or spreadsheet? You can and, with just Base SAS^®, there are some little-known but easy-to use methods that are available for automating many of your (or your users ) common tasks.

Influence analysis in statistical modeling looks for observations that unduly influence the fitted model. Cook s distance is a standard tool for influence analysis in regression. It works by measuring the difference in the fitted parameters as individual observations are deleted. You can apply the same idea to examining influence of groups of observations for example, the multiple observations for subjects in longitudinal or clustered data but you need to adapt it to the fact that different subjects can have different numbers of observations. Such an adaptation is discussed by Zhu, Ibrahim, and Cho (2012), who generalize the subject size factor as the so-called degree of perturbation, and correspondingly generalize Cook s distances as the scaled Cook s distance. This paper presents the %SCDMixed SAS^® macro, which implements these ideas for analyzing influence in mixed models for longitudinal or clustered data. The macro calculates the degree of perturbation and scaled Cook s distance measures of Zhu et al. (2012) and presents the results with useful tabular and graphical summaries. The underlying theory is discussed, as well as some of the programming tricks useful for computing these influence measures efficiently. The macro is demonstrated using both simulated and real data to show how you can interpret its results for analyzing influence in your longitudinal modeling.

Stepwise regression includes regression models in which the predictive variables are selected by an automated algorithm. The stepwise method involves two approaches: backward elimination and forward selection. Currently, SAS^® has three procedures capable of performing stepwise regression: REG, LOGISTIC, and GLMSELECT. PROC REG handles the linear regression model, but does not support a CLASS statement. PROC LOGISTIC handles binary responses and allows for logit, probit, and complementary log-log link functions. It also supports a CLASS statement. The GLMSELECT procedure performs selections in the framework of general linear models. It allows for a variety of model selection methods, including the LASSO method of Tibshirani (1996) and the related LAR method of Efron et al. (2004). PROC GLMSELECT also supports a CLASS statement. We present a stepwise algorithm for generalized linear mixed models for both marginal and conditional models. We illustrate the algorithm using data from a longitudinal epidemiology study aimed to investigate parents beliefs, behaviors, and feeding practices that associate positively or negatively with indices of sleep quality.

SAS^® functions provide amazing power to your DATA step programming. Some of these functions are essential some of them save you writing volumes of unnecessary code. This paper covers some of the most useful SAS functions. Some of these functions might be new to you, and they will change the way you program and approach common programming tasks.

Accelerated testing is an effective tool for predicting when systems fail, where the system can be as simple as an engine gasket or as complex as a magnetic resonance imaging (MRI) scanner. In particular, you might conduct an experiment to determine how factors such as temperature and voltage impose enough stress on the system to cause failure. Because system components usually meet nominal quality standards, it can take a long time to obtain failure data under normal-use conditions. An effective strategy is to accelerate the experiment by testing under abnormally stressful conditions, such as higher temperatures. Following that approach, you obtain data more quickly, and you can then analyze the data by using the RELIABILITY procedure in SAS/QC^® software. The analysis is a three-step process: you establish a probability model, explore the relationship between stress and failure, and then extrapolate to normal-use conditions. Graphs are a key component of all three stages: you choose a model by comparing residual plots from candidate models, use graphs to examine the stress-failure relationship, and then use an appropriately scaled graph to extrapolate along a straight line. This paper guides you through the process, and it highlights features added to the RELIABILITY procedure in SAS/QC 13.1.

Finding groups with similar attributes is at the core of knowledge discovery. To this end, Cluster Analysis automatically locates groups of similar observations. Despite successful applications, many practitioners are uncomfortable with the degree of automation in Cluster Analysis, which causes intuitive knowledge to be ignored. This is more true in text mining applications since individual words have meaning beyond the data set. Discovering groups with similar text is extremely insightful. However, blind applications of clustering algorithms ignore intuition and hence are unable to group similar text categories. The challenge is to integrate the power of clustering algorithms with the knowledge of experts. We demonstrate how SAS/STAT^® 9.2 procedures and the SAS^® Macro Language are used to ensemble the opinion of domain experts with multiple clustering models to arrive at a consensus. The method has been successfully applied to a large data set with structured attributes and unstructured opinions. The result is the ability to discover observations with similar attributes and opinions by capturing the wisdom of the crowds whether man or model.

In randomized experiments, it is generally assumed that the hierarchical structures and variances are the same in the treatment and control groups. In some situations, however, these structures and variance components can differ. Consider a randomized experiment in which individuals randomized to the treatment condition are further assigned to clusters in which the intervention is administered, but no such clustering occurs in the control condition. Such a structure can occur, for example, when the individuals in the treatment condition are randomly assigned to group therapy sessions or to mathematics tutoring groups; individuals in the control condition do not receive group therapy or mathematics tutoring and therefore do not have that level of clustering. In this example, individuals in the treatment condition have a hierarchical structure, but individuals in the control condition do not. If the therapists or tutors differ in efficacy, the clustering in the treatment condition induces an extra source of variability in the data that needs to be accounted for in the analysis. We show how special features of SAS^® PROC MIXED and PROC GLIMMIX can be used to analyze data in which one or more treatment groups have a hierarchical structure that differs from that in the control group. We also discuss how to code variables in order to increase the computational efficiency for estimating parameters from these designs.

Hierarchical data are common in many fields, from pharmaceuticals to agriculture to sociology. As data sizes and sources grow, information is likely to be observed on nested units at multiple levels, calling for the multilevel modeling approach. This paper describes how to use the GLIMMIX procedure in SAS/STAT^® to analyze hierarchical data that have a wide variety of distributions. Examples are included to illustrate the flexibility that PROC GLIMMIX offers for modeling within-unit correlation, disentangling explanatory variables at different levels, and handling unbalanced data. Also discussed are enhanced weighting options, new in SAS/STAT 13.1, for both the MODEL and RANDOM statements. These weighting options enable PROC GLIMMIX to handle weights at different levels. PROC GLIMMIX uses a pseudolikelihood approach to estimate parameters, and it computes robust standard error estimators. This new feature is applied to an example of complex survey data that are collected from multistage sampling and have unequal sampling probabilities.

A central component of discussions of healthcare reform in the U.S. are estimations of healthcare cost and use at the national or state level, as well as for subpopulation analyses for individuals with certain demographic properties or medical conditions. For example, a striking but persistent observation is that just 1% of the U.S. population accounts for more than 20% of total healthcare costs, and 5% account for almost 50% of total costs. In addition to descriptions of specific data sources underlying this type of observation, we demonstrate how to use SAS^® to generate these estimates and to extend the analysis in various ways; that is, to investigate costs for specific subpopulations. The goal is to provide SAS programmers and healthcare analysts with sufficient data-source background and analytic resources to independently conduct analyses on a wide variety of topics in healthcare research. For selected examples, such as the estimates above, we concretely show how to download the data from federal web sites, replicate published estimates, and extend the analysis. An added plus is that most of the data sources we describe are available as free downloads.

Sampling is widely used in different fields for quality control, population monitoring, and modeling. However, the purposes of sampling might be justified by the business scenario, such as legal or compliance needs. This paper uses one probability sampling method, stratified sampling, combined with quality control review business cost to determine an optimized procedure of sampling that satisfies both statistical selection criteria and business needs. The first step is to determine the total number of strata by grouping the strata with a small number of sample units, using box-and-whisker plots outliers as a whole. Then, the cost to review the sample in each stratum is justified by a corresponding business counter-party, which is the human working hour. Lastly, using the determined number of strata and sample review cost, optimal allocation of predetermined total sample population is applied to allocate the sample into different strata.

The Challenge: assigning outbound calling agents in a telemarketing campaign to geographic districts. The districts have a variable number of leads, and each agent needs to be assigned entire districts with the total number of leads being as close as possible to a specified number for each of the agents (usually, but not always, an equal number). In addition, there are constraints concerning the distribution of assigned districts across time zones in order to maximize productivity and availability. Our Solution: use the SAS/OR^® procedure PROC CLP to formulate the challenge as a constraint satisfaction problem (CSP) since the objective is not necessarily to minimize a cost function, but rather to find a feasible solution to the constraint set. The input consists of the number of agents, the number of districts, the number of leads in each district, the desired number of leads per agent, the amount by which the actual number of leads can differ from the desired number, and the time zone for each district.

In our previous work, we often needed to perform large numbers of repetitive and data-driven post-campaign analyses to evaluate the performance of marketing campaigns in terms of customer response. These routine tasks were usually carried out manually by using Microsoft Excel, which was tedious, time-consuming, and error-prone. In order to improve the work efficiency and analysis accuracy, we managed to automate the analysis process with SAS^® programming and replace the manual Excel work. Through the use of SAS macro programs and other advanced skills, we successfully automated the complicated data-driven analyses with high efficiency and accuracy. This paper presents and illustrates the creative analytical ideas and programming skills for developing the automatic analysis process, which can be extended to apply in a variety of business intelligence and analytics fields.

This paper kicks off a project to write a comprehensive book of best practices for documenting SAS^® projects. The presenter s existing documentation styles are explained. The presenter wants to discuss and gather current best practices used by the SAS user community. The presenter shows documentation styles at three different levels of scope. The first is a style used for project documentation, the second a style for program documentation, and the third a style for variable documentation. This third style enables researchers to repeat the modeling in SAS research, in an alternative language, or conceptually.

As IT professionals, saving time is critical. Delivering timely and quality-looking reports and information to management, end users, and customers is essential. SAS^® provides numerous 'canned' PROCedures for generating quick results to take care of these needs ... and more. In this hands-on workshop, attendees acquire basic insights into the power and flexibility offered by SAS PROCedures using PRINT, FORMS, and SQL to produce detail output; FREQ, MEANS, and UNIVARIATE to summarize and create tabular and statistical output; and data sets to manage data libraries. Additional topics include techniques for informing SAS which data set to use as input to a procedure, how to subset data using a WHERE statement (or WHERE= data set option), and how to perform BY-group processing.

As complicated as the macro language is to learn, there are very strong reasons for doing so. At its heart, the macro language is a code generator. In its simplest uses, it can substitute simple bits of code like variable names and the names of data sets that are to be analyzed. In more complex situations, it can be used to create entire statements and steps based on information may even be unavailable to the person writing or even executing the macro. At the time of execution, it can be used to make queries of the SAS^® environment as well as the operating system, and utilize the gathered information to make informed decisions about how it is to further function and execute.

Because the macro language is primarily a code generator, it makes sense that the code that it creates must be generated before it can be executed. This implies that execution of the macro language comes first. Simple as this is in concept, timing issues and conflicts are often not so simple to recognize in application. As we use the macro language to take on more complex tasks, it becomes even more critical that we have an understanding of these issues.

Macro variables and their values are stored in symbol tables, which in turn are held in memory. Not only are there are a number of ways to create macro variables, but they can be created in a wide variety of situations. How they are created and under what circumstances effects the variable s scope how and where the macro variable is stored and retrieved. There are a number of misconceptions about macro variable scope and about how the macro variables are assigned to symbol tables. These misconceptions can cause problems that the new, and sometimes even the experienced, macro programmer does not anticipate. Understanding the basic rules for macro variable assignment can help the macro programmer solve some of these problems that are otherwise quite mystifying.

This paper provides an overview of how to create a SAS^® Enterprise Guide^® process that is well designed, simple, documented, automated, modular, efficient, reliable, and easy to maintain. Topics include how to organize a SAS Enterprise Guide process, how to best document in SAS Enterprise Guide, when to leverage point-and-click functionality, and how to automate and simplify SAS Enterprise Guide processes. This paper has something for any SAS Enterprise Guide user, new or experienced!

The emerging discipline of data governance encompasses data quality assurance, data access and use policy, security risks and privacy protection, and longitudinal management of an organization s data infrastructure. In the interests of forestalling another bureaucratic solution to data governance issues, this presentation features database programming tools that provide rapid access to big data and make selective access to and restructuring of metadata practical.

In many organizations the amount of data we deal with increases far faster than the hardware and IT infrastructure to support them. As a result, we encounter significant bottlenecks and I/O bound processes. However, clever use of SAS^® software can help us find a way around. In this paper we will look at the clever use of PROC OLAP to show you how to address I/O bound processing spread I/O traffic to different servers to increase cube building efficiency. This paper assumes experience with SAS^® OLAP Cube Studio and/or PROC OLAP.

Digital data has manifested into a classic BIG DATA challenge for marketers who want to push past the retroactive analysis limitations of traditional web analytics. The current groundswell of digital device adoption and variety of digital interactions grows larger year after year. The opportunity for 'digital intelligence' has arrived, as traditional web analytic techniques were not designed for the breadth of channels, devices, and pace that fuels consumer experiences. In parallel, today's landscape for data visualization, advanced analytics, and our ability to process very large amounts of multi-channel information is changing. The democratization of analytics for the masses is upon us, and marketers have the oppourtunity to take advantage of descriptive, predictive, and (most importantly) prescriptive data-driven insights. This presentation describes how organizations can use SAS^® products, specifically SAS^® Visual Analytics and SAS^® Adaptive Customer Experience, to overcome the limitations of web analytics, and support data-driven integrated marketing objectives.

Inference of variance components in linear mixed effect models (LMEs) is not always straightforward. I introduce and describe a flexible SAS^® macro (%COVTEST) that uses the likelihood ratio test (LRT) to test covariance parameters in LMEs by means of the parametric bootstrap. Users must supply the null and alternative models (as macro strings), and a data set name. The macro calculates the observed LRT statistic and then simulates data under the null model to obtain an empirical p-value. The macro also creates graphs of the distribution of the simulated LRT statistics. The program takes advantage of processing accomplished by PROC MIXED and some SAS/IML^® functions. I demonstrate the syntax and mechanics of the macro using three examples.

This paper demonstrates the new case-level residuals in the CALIS procedure and how they differ from classic residuals in structural equation modeling (SEM). Residual analysis has a long history in statistical modeling for finding unusual observations in the sample data. However, in SEM, case-level residuals are considerably more difficult to define because of 1) latent variables in the analysis and 2) the multivariate nature of these models. Historically, residual analysis in SEM has been confined to residuals obtained as the difference between the sample and model-implied covariance matrices. Enhancements to the CALIS procedure in SAS/STAT^® 12.1 enable users to obtain case-level residuals as well. This enables a more complete residual and influence analysis. Several examples showing mean/covariance residuals and case-level residuals are presented.

This paper describes a method that uses some simple SAS^® macros and SQL to merge data sets containing related data that contains rows with varying effective date ranges. The data sets are merged into a single data set that represents a serial list of snapshots of the merged data, as of a change in any of the effective dates. While simple conceptually, this type of merge is often problematic when the effective date ranges are not consecutive or consistent, when the ranges overlap, or when there are missing ranges from one or more of the merged data sets. The technique described was used by the Fairfax County Human Resources Department to combine various employee data sets (Employee Name and Personal Data, Personnel Assignment and Job Classification, Personnel Actions, Position-Related data, Pay Plan and Grade, Work Schedule, Organizational Assignment, and so on) from the County's SAP-HCM ERP system into a single Employee Action History/Change Activity file for historical reporting purposes. The technique currently is used to combine fourteen data sets, but is easily expandable by inserting a few lines of code using the existing macros.

The Census Bureau conducts the Common Core of Data surveys for the National Center for Education Statistics annually. We have written SAS^® programs to automate the database documentation. We try to avoid including hard-coded values in the programs. Thanks to a record layout spreadsheet, the analysts can quickly update the survey metadata outside the SAS programs. This paper explains how SAS can read the record layout spreadsheet to create formats on the fly. The analysts can update the values as changes occur over time without having to worry about writing correct SAS syntax. Behind the scenes, SAS is using dictionary views, macros, ODS OUTPUT, PROC TEMPLATE, PROC FORMAT, the ODS Report Writing Interface, and RTF to create the desired results. This paper uses syntax for SAS^® 9.2, written for programmers at the intermediate level.

LaTeX is a free document creation package that is often used to create journal articles. It provides the capability to create very specific formatting and to write a wide variety of formulas. Using ODS, SAS^® can write documents to a LaTeX file, which can then be compiled through LaTeX into PDF files. This paper briefly reviews the basic syntax and options to produce these files. Then, we look at how to create a new tagset to make changes to the standard ODS LaTeX templates to create the non-gridded table appearance that is typically seen in journal articles. We also explore how to write special characters and equations not otherwise available through ODS LaTeX.

Business Intelligence platforms provide a bridge between expert data analysts and decision-makers and other end-users. But what do you do when you can identify no system that meets both your needs and your budget? If you are the Consolidated Data Analysis Center in the HHS Office of Inspector General, you use SAS^® Enterprise BI Server and the SAS^® Stored Process Web Application to build your own. This presentation covers the inception, design, and implementation of the PAYment by Geographic Area (PAYGAR) system, which uses only SAS^® Enterprise BI tools, namely the SAS Stored Process Web Application, PROC GMAP, and HTML/JAVA embedded in a DATA step, to create an interactive platform for presenting and exploring data that has a geographic component. In particular, the presentation reviews how we created a system of chained stored processes to enable a user to select the data to be presented, navigate through different geographic levels, and display companion reports related to the current data and geographic selections. It also covers the creation of the HTML front-end that sits over and manages the system. Throughout, the presentation emphasizes the scalability of PAYGAR, which the SAS Stored Process Web Application facilitates.

Merging or joining data sets is an integral part of the data consolidation process. Within SAS^®, there are numerous methods and techniques that can be used to combine two or more data sets. We commonly think that within the DATA step the MERGE statement is the only way to join these data sets, while in fact, the MERGE is only one of numerous techniques available to us to perform this process. Each of these techniques has advantages, and some have disadvantages. The informed programmer needs to have a grasp of each of these techniques if the correct technique is to be applied. This paper covers basic merging concepts and options within the DATA step, as well as a number of techniques that go beyond the traditional MERGE statement. These include fuzzy merges, double SET statements, and the use of key indexing. The discussion will include the relative efficiencies of these techniques, especially when working with large data sets.

Cross-visit checks are a vital part of data cleaning for longitudinal studies. The nature of longitudinal studies encourages repeatedly collecting the same information. Sometimes, these variables are expected to remain static, go away, increase, or decrease over time. This presentation reviews the na ve and the better approaches at handling one-variable and two-variable consistency checks. For a single-variable check, the better approach features the new ALLCOMB function, introduced in SAS^® 9.2. For a two-variable check, the better approach uses the .first pseudo-class to flag inconsistencies. This presentation will provide you the tools to enhance your longitudinal data cleaning process.

With increased concern about privacy and simultaneous pressure to make survey data available, statistical disclosure control (SDC) treatments are performed on survey microdata to reduce disclosure risk prior to dissemination to the public. This situation is all the more problematic in the push to provide data online for immediate user query. Two SDC approaches are data coarsening, which reduces the information collected, and data swapping, which is used to adjust data values. Data coarsening includes recodes, top-codes and variable suppression. Challenges related to creating a SAS^® macro for data coarsening include providing flexibility for conducting different coarsening approaches, and keeping track of the changes to the data so that variable and value labels can be assigned correctly. Data swapping includes selecting target records for swapping, finding swapping partners, and swapping data values for the target variables. With the goal of minimizing the impact on resulting estimates, challenges for data swapping are to find swapping partners that are close matches in terms of both unordered categorical and ordered categorical variables. Such swapping partners ensure that enough change is made to the target variables, that data consistency between variables is retained, and that the pool of potential swapping partners is controlled. An example is presented using each algorithm.

A revolution is taking place in the U.S. at both the national and state level in the area of health care transparency. Large amounts of data on the health of communities, the quality of health care providers, and the cost of health care is being collected and is being made available by both levels of government to a variety of stakeholders. The surfacing of this data and the consumption of it by health care decision makers unfolds a new opportunity to view, explore, and analyze health care data in novel ways. Furthermore, this data provides the health care system an opportunity to advance the achievement of the Triple Aim. Data transparency will bring a sea change to the world of health care by necessitating new ways of communicating information to end users such as payers, providers, researchers, and consumers of health care. This paper examines the information needs of public health care payers such as Medicare and Medicaid, and discusses the convergence of health care and data visualization in creating consumable health insights that will aid in achieving cost containment, quality improvement, and increased accessibility for populations served. Moreover, using claims data and SAS^® Visual Analytics, it examines how data visualization can help identify the most critical insights necessary to managing population health. If health care payers can analyze large amounts of claims data effectively, they can improve service and care delivery to their recipients.

Debugging SAS^® code contained in a macro can be frustrating because the SAS error messages refer only to the line in the SAS log where the macro was invoked. This can make it difficult to pinpoint the problem when the macro contains a large amount of SAS code. Using a macro that contains one small DATA step, this paper shows how to use the MPRINT and MFILE options along with the fileref MPRINT to write just the SAS code generated by a macro to a file. The 'de-macroified' SAS code can be easily executed and debugged.

New innovative, analytical techniques are necessary to extract patterns in big data that have temporal and geo-spatial attributes. An approach to this problem is required when geo-spatial time series data sets, which have billions of rows and the precision of exact latitude and longitude data, make it extremely difficult to locate patterns of interest The usual temporal bins of years, months, days, hours, and minutes often do not allow the analyst to have control of the precision necessary to find patterns of interest. Geohashing is a string representation of two-dimensional geometric coordinates. Time hashing is a similar representation, which maps time to preserve all temporal aspects of the date and time of the data into a one-dimensional set of data points. Geohashing and time hashing are both forms of a Z-order curve, which maps multidimensional data into single dimensions and preserves the locality of the data points. This paper explores the use of a multidimensional Z-order curve, combining both geohashing and time hashing, that is known as geo-temporal hashing or space-time boxes using SAS^®. This technique provides a foundation for reducing the data into bins that can yield new methods for pattern discovery and detection in big data.

This paper outlines the techniques that I have used with my clients over the last five years to build powerful applications that run from a web browser. The user interface is presented using HTML and JavaScript, which is generated by SAS^® Stored Processes. A JavaScript framework called Ext JS is used to build components such as tables and graphs, which have a lot of functionality built in. A range of SAS^® macros are used for building HTML and JavaScript, so the generation of the user interface is simplified. This technique has been used to create a medical monitoring system, the UK Census MIS, and a bank's risk management application. I also discuss some techniques involved with integrating a system like this with SAS^® Portal, cubes, and web reports.

Particle swarm optimization is a heuristic global optimization method that was given by James Kennedy and Russell C. Eberhart in 1995. (James Kennedy and Russell C. Eberhart). The purpose of this paper develops a code for particle swarm optimization in SAS^® 9.2.

APP is an unofficial collective abbreviation for the SAS^® functions ADDR, PEEK, PEEKC, the CALL POKE routine, and their so-called LONG 64-bit counterparts the SAS tools designed to directly read from and write to physical memory in the DATA step. APP functions have long been a SAS dark horse. First, the examples of APP usage in SAS documentation amount to a few technical report tidbits intended for mainframe system programming, with nary a hint how the functions can be used for data management programming. Second, the documentation note on the CALL POKE routine is so intimidating in tone that many potentially receptive folks might decide to avoid the allegedly precarious route altogether. However, little can stand in the way of an inquisitive SAS programmer daring to take a close look, and it turns out that APP functions are very simple and useful tools! They can be used to explore how things really work, to make code more concise, to implement en masse data movement, and they can often dramatically improve execution efficiency. The author and many other SAS experts (notably Peter Crawford, Koen Vyverman, Richard DeVenezia, Toby Dunn, and the fellow masked by his 'Puddin' Man' sobriquet) have been poking around the SAS APP realm on SAS-L and in their own practices since 1998, occasionally letting the SAS community at large to peek at their findings. This opus is an attempt to circumscribe the results in a systematic manner. Welcome to the APP world! You are in for a few glorious surprises.

For data analysts, one of the most important steps after manipulating and analyzing the data set is to create a report for it. Nowadays, many statistics tables and reports are generated as HTML files that can be easily accessed through the Internet. However, the SAS^® Output Delivery System (ODS) HTML output has many limitations on interacting with users. In this paper, we introduce a method to enhance the traditional ODS HTML output by using jQuery (a JavaScript library). A macro was developed to implement this idea. Compared to the standard HTML output, this macro can add sort, pagination, search, and even dynamic drilldown function to the ODS HTML output file.

Given a time series data set, you can use automatic time series modeling software to select an appropriate time series model. You can use various statistics to judge how well each candidate model fits the data (in-sample). Likewise, you can use various statistics to select an appropriate model from a list of candidate models (in-sample or out-of-sample or both). Finally, you can use rolling simulations to evaluate ex-ante forecast performance over several forecast origins. This paper demonstrates how you can use SAS^® Forecast Server Procedures and SAS^® Forecast Studiosoftware to perform the statistical analyses that are related to rolling simulations.

SAS^® is an outstanding suite of software, but not everyone in the workplace speaks SAS. However, almost everyone speaks Excel. Often, the data you are analyzing, the data you are creating, and the report you are producing is a form of a Microsoft Excel spreadsheet. Every year at SAS^® Global Forum, there are SAS and Excel presentations, not just because Excel isso pervasive in the workplace, but because there s always something new to learn (or re-learn)! This paper summarizes and references (and pays homage to!) previous SAS Global Forum presentations, as well as examines some of the latest Excel capabilities with the latest versions of SAS^® 9.4 and SAS^® Visual Analytics.

Explore the various DATA step merge and PROC SQL join processes. This presentation examines the similarities and differences between merges and joins, and provides examples of effective coding techniques. Attendees examine the objectives and principles behind merges and joins, one-to-one merges (joins), and match-merge (equi-join), as well as the coding constructs associated with inner and outer merges (joins) and PROC SQL set operators.

SAS^® can easily perform calculations and export the result to Microsoft Excel in a report. However, sometimes you need Excel to have a formula or a function in a cell and not just a number. Whether it s for a boss who wants to see a SUM formula in the total cell or to have automatically updating reports that can be sent to people who don t use SAS to be completed, exporting formulas to Excel can be very powerful. This paper illustrates how, by using PROC REPORT and PROC PRINT along with the ExcelXP tagset, you can easily export formulas and functions into Excel directly from SAS. The method outlined in this paper requires Base SAS^® 9.1 or higher and Excel 2002 or later and requires a basic understanding of the ExcelXP tagset.

When you want to know the details about a small subset of a much larger data set, it can take a long time to select the records you need. This paper shows you how to create a user-defined SAS^® format to pull only the observations that you want out of a big data source. Even when selecting a million records out of data sets that can have more than 100 million records, this method is much quicker than either a PROC SQL join or a SAS merge.

This paper shares our experience integrating two leading data analytics and Geographic Information Systems (GIS) software products SAS^® and ArcGIS to provide integrated reporting capabilities. SAS is a powerful tool for data manipulation and statistical analysis. ArcGIS is a powerful tool for analyzing data spatially and presenting complex cartographic representations. Combining statistical data analytics and GIS provides increased insight into data and allows for new and creative ways of visualizing the results. Although products exist to facilitate the sharing of data between SAS and ArcGIS, there are no ready-made solutions for integrating the output of these two tools in a dynamic and automated way. Our approach leverages the individual strengths of SAS and ArcGIS, as well as the report delivery infrastructure of SAS^® Information Delivery Portal.

This introductory presentation is intended for an audience new to mixed models who wants to get an overview of this useful class of models. Learn about mixed models as an extension of ordinary regression models, and see several examples of mixed models in social, agricultural, and pharmaceutical research.

Analyzing data from a complex probability survey involves weighting observations so that inferences are correct. This introductory presentation is intended for an audience new to analyzing survey data. Learn the essentials of using the SURVEYxx procedures in SAS/STAT^®.

It is not uncommon to find models with random components like location, clinic, teacher, etc., not just the single error term we think of in ordinary regression. This paper uses several examples to illustrate the underlying ideas. In addition, the response variable might be Poisson or binary rather than normal, thus taking us into the realm of generalized linear mixed models, These too will be illustrated with examples.

Beginning with SA^®S 9.2, ODS Graphics introduces a whole new way of generating graphs using SAS^®. With just a few lines of code, you can create a wide variety of high-quality graphs. This paper covers the three basic ODS Graphics procedures SGPLOT, SGPANEL, and SGSCATTER. SGPLOT produces single-celled graphs. SGPANEL produces multi-celled graphs that share common axes. SGSCATTER produces multi-celled graphs that might use different axes. This paper shows how to use each of these procedures in order to produce different types of graphs, how to send your graphs to different ODS destinations, how to access individual graphs, and how to specify properties of graphs, such as format, name, height, and width.

SAS offers advanced analytics while Teradata has developed one of the fastestdatabases known to mankind. The Office of the Actuary at CMS uses SAS andTeradata to perform many important tasks such as setting Medicare Advantagerates for providers, forecasting the cost of closing the Part-D Donut Hole, andestimating the future cost of healthcare in America. The major advantage to theTeradata platform is the speed at which data can be summarized as compared tolegacy systems (billions of rows of data can be summarized in seconds where itonce took days) and the SAS system provides many options for accessing andanalyzing this data - whether you re on a mainframe, Windows^®, or Unixthrough SAS^® Enterprise Guide or Business Intelligence.

Determining what, when, and how to migrate SAS^® software from one major version to the next is a common challenge. SAS provides documentation and tools to help make the assessment, planning, and eventual deployment go smoothly. We describe some of the keys to making your migration a success, including the effective use of the SAS^® Migration Utility, both in the analysis mode and the execution mode. This utility is responsible for analyzing each machine in an existing environment, surfacing product-specific migration information, and creating packages to migrate existing product configurations to later versions. We show how it can be used to simplify each step of the migration process, including recent enhancements to flag product version compatibility and incompatibility.

SAS^® High-Performance Analytics is a significant step forward in the area of high-speed, analytic processing in a scalable clustered environment. However, Big Data problems generally come with data from lots of data sources, at varying levels of maturity. Teradata s innovative Unified Data Architecture (UDA) represents a significant improvement in the way that large companies can think about Enterprise Data Management, including the Teradata Database, Hortonworks Hadoop, and Aster Data Discovery platform in a seamless integrated platform. Together, the two platforms provide business users, analysts, and data scientists with the ideally suited data management platforms, targeted specifically to their analytic needs, based upon analytic use cases, managed in a single integrated enterprise data management environment. The paper will focus on how several companies today are using Teradata s Integrated Hardware and Software UDA Platform to manage a single enterprise analytic environment, fight the ongoing proliferation of analytic data marts, and speed their operational analytic processes.

Controlled vocabularies define a common set of concepts that retain their meaning across contexts, supporting consistent use of terms to annotate, integrate, retrieve, and interpret information. Controlled vocabularies are large hierarchical structures that cannot be represented using typical SAS^® practices (e.g., SAS format statements and hash objects). This paper compares and contrasts three models for representing hierarchical structures using SAS data sets: adjacency list, path enumeration, and nested set (Celko, 2004; Mackey, 2002). Specific controlled vocabularies include a university organizational structure and several biological vocabularies (MeSH, NCBI Taxonomy, and GO). The paper presents data models and SAS code for populating tables and performing queries. The paper concludes with a discussion of implications for data warehouse implementation and future work related to efficiency of update and delete operations.

Report automation and scheduling are very hot topics in many industries. They confer many advantages including reduced work load, elimination of repetitive tasks, generatation of accurate results, and better performance. This paper illustrates how to design an appropriate program to automate and schedule reports in SAS^® 9.1 and SAS^® Enterprise Guide^® 5.1 using a SAS^® server as well as the Windows Scheduler. The automation part includes good aspects of formatting Microsoft Excel tables using XML or VBA coding or any other formats, and conditional auto e-mailing with file attachments. We systematically walk through each step with a clear flow diagram from the data source to the final destination. We also discuss details of server-side and PC-side schedulers and how these schedulers involve invoking batch programs.

Effective graphs are indispensable for modern statistical analysis. They reveal tendencies that are not readily apparent in simple tables and add visual clarity to reports. My client is a big graph fan; he always shows me a lot of high-quality and complex sample graphs that were created by other software and asks me Can SAS^® duplicate these outputs? Often, by leveraging the capabilities of the ODS Graph Template Language and the SGRENDER procedure, the answer is Yes . Graph Template Language offers SAS users a more direct approach to customize the output and to overlay graphs in different levels. This paper uses cases drawn from a real work situation to demonstrate how to get the seemingly unattainable results with the power of Graph Template Language: utilizing bubble plots as your distribution density bars creating refreshing looking linear regression graphics with the slop information in the legend overlaying different plots together to create sophisticated analytical bottleneck test output

A common complaint from users working on identifying fraud and abuse in Medicare is that teams focus on operational applications, static reports, and high-level outliers. But, when faced with the need to constantly evaluate changing Medicare provider and beneficiary or enrollee dynamics, users are clamoring for more dynamic and accurate detection approaches. Providing these organizations with a data discovery and predictive analytics framework that leverages Hadoop and other big data approaches, while providing a clear path for teams to make more fact-based decisions more quickly is very important in pre- and post-fraud and abuse analysis. Organizations that do pursue a framework and a reusable services-based data discovery and analytics framework and architecture approach enjoy greater success in supporting data management, reporting, and analytics demands. They can quickly turn models into prioritized alerts and avoid improper or fraudulent payments. A successful framework should enable organizations to come up with efficient fraud, waste, and abuse models to address complex schemes; identify fraud, waste, and abuse vulnerabilities; and shorten triage efforts using a variety of data sourced from big data platforms like Hadoop and other relational database management systems. This paper talks about the data management, data discovery, predictive analytics, and social network analysis capabilities that are included in the SAS fraud framework and how a unified approach can significantly reduce the lifecycle of building and deploying fraud models. We hope this paper will provide IT leaders with a clear path for resolving issues from the simple to the incredibly complex, through a measured and scalable approach for delivering value for fraud, waste, and abuse models by providing deep insights to support evidence-based investigations.

One of the most common questions about logistic regression is How do I know if my model fits the data? There are many approaches to answering this question, but they generally fall into two categories: measures of predictive power (like R-squared) and goodness of fit tests (like the Pearson chi-square). This presentation looks first at R-squared measures, arguing that the optional R-squares reported by PROC LOGISTIC might not be optimal. Measures proposed by McFadden and Tjur appear to be more attractive. As for goodness of fit, the popular Hosmer and Lemeshow test is shown to have some serious problems. Several alternatives are considered.

Bootstrapped Decision Tree is a variable selection method used to identify and eliminate unintelligent variables from a large number of initial candidate variables. Candidates for subsequent modeling are identified by selecting variables consistently appearing at the top of decision trees created using a random sample of all possible modeling variables. The technique is best used to reduce hundreds of potential fields to a short list of 30 50 fields to be used in developing a model. This method for variable selection has recently become available in JMP^® under the name BootstrapForest; this paper presents an implementation in Base SAS^®9. The method does accept but does not require a specific outcome to be modeled and will therefore work for nearly any type of model, including segmentation, MCMC, multiple discrete choice, in addition to standard logistic regression. Keywords: Bootstrapped Decision Tree, Variable Selection

This paper considers the %MRE macro for estimating multivariate ratio estimates. Also, we use PROC REG to estimate multivariate regression estimates and to show that regression estimates are superior to the ratio estimates.

This presentation features implementation leads from SAS^® Professional Services and Health Canada's Non-Insured Health benefits (NIHB) program, on a joint implementation of SAS^® Fraud Framework for Health Care. The presentation walks through the fast-paced implementation of NIHB's Pharmacy Surveillance System that guards Canadian taxpayers from undue costs, and protects the safety of NIHB clients. This presentation is a blend of project management and technical material, and presents both the client (NIHB) and consultant (SAS) perspectives throughout the story. The presentation converges onto several core principles needed to successfully deliver analytical solutions.

When viewing and working with SAS^® data sets especially wide ones it s often instinctive to rearrange the variables (columns) into some intuitive order. The RETAIN statement is one of the most commonly cited methods used for ordering variables. Though RETAIN can perform this task, its use as an ordering clause can cause a host of easily missed problems due to its intended function of retaining values across DATA step iterations. This risk is especially great for the more novice SAS programmer. Instead, two equally effective and less risky ways to order data set variables are recommended, namely, the FORMAT and SQL SELECT statements.

PROC TABULATE is a powerful tool for creating tabular summary reports. Its advantages, over PROC REPORT, are that it requires less code, allows for more convenient table construction, and uses syntax that makes it easier to modify a table s structure. However, its inability to compute the sum, difference, product, and ratio of column sums has hindered its use in many circumstances. This paper illustrates and discusses some creative approaches and methods for overcoming these limitations, enabling users to produce needed reports and still enjoy the simplicity and convenience of PROC TABULATE. These methods and skills can have prominent applications in a variety of business intelligence and analytics fields.

The SQL procedure contains many powerful and elegant language features for intermediate and advanced SQL users. This presentation discusses topics that will help SAS^® users unlock the many powerful features, options, and other gems found in the SQL universe. Topics include CASE logic; a sampling of summary (statistical) functions; dictionary tables; PROC SQL and the SAS macro language interface; joins and join algorithms; PROC SQL statement options _METHOD, MAGIC=101, MAGIC=102, and MAGIC=103; and key performance (optimization) issues.

Can you juggle? Maybe. Can you shuffle a deck of cards? Probably. Can you do both at the same time? Welcome to the world of SAS^® and LSF! Very few SAS Administrators start out learning LSF at the same time they learn SAS; most already know SAS, possibly starting out as a programmer or analyst, but now have to step up to an enterprise platform with shared resources. The biggest challenge on an enterprise platform? How to share! How to maximum the utilization of a SAS platform, yet still ensure everyone gets their fair share? This presentation will boil down the 2000+ pages of LSF documentation to provide an introduction into various LSF concepts: * Host * Clusters * Nodes * Queues * First-Come-First-Serve * Fairshare * and various configuration settings: UJOB_LIMIT, PJOB_LIMIT, etc. Plus some insight on where to configure all these settings which are set up by the installation process, and which can be configured by the SAS or LSF administrator. This session is definitely NOT for experts. It is for those about to step into an enterprise deployment of SAS, and want to understand how the SAS server sessions they know so well can run on a shared platform.

In healthcare, we often express our analytics results as being adjusted . For example, you might have read a study in which the authors reported the data as age-adjusted or risk-adjusted. The concept of adjustment is widely used in program evaluation, comparing quality indicators across providers and systems, forecasting incidence rates, and in cost-effectiveness research. In order to make reasonable comparisons across time, place, or population, we need to account for small sample sizes and case-mix variation in other words, we need to level the playing field and account for differences in health status and for uniqueness in a given population. If you are new to healthcare. What it really means to adjust the data in order to make comparisons might not be obvious. In this paper, we explore the methods by which we control for potentially confounding variables in our data. We do so through a series of examples from the healthcare literature in both primary care and health insurance. In this survey of methods, we discuss the concepts of rates and how they can be adjusted for demographic strata (such as age, gender, and race), as well as health risk factors such as case mix.

This paper reveals the human mobility behavior in the metropolitan area of Rio de Janeiro, Brazil. The base for this study is the mobile phone data provided by one of the largest mobile carriers in Brazil. Mobile phone data comprises a reasonable variety of information, including data about time and location for call activity throughout urban areas. This information might be used to build users trajectories over time, describing the major characteristics of the urban mobility within the city. A variety of distribution analyses is presented in this paper aiming clearly describes the most relevant characteristics of the overall mobility in the metropolitan area of Rio de Janeiro. In addition to that, methods from physics to describe trends in trips such as gravity and radiation models were computed and compared in terms of granularity of the geographic scales and also in relation to traditional data mining approach such as linear regressions. A brief comparison in terms of performance in predicting the amount of trips between pairs of locations is presented at the end.

The project focuses on using analytics to reveal unwarranted use of access to medical records, i.e. employees in health organizations that access information about neighbours, friends, celebrities, etc., without a sound reason to do so. The method is based on the natural assumption that the vast majority of lookups are legitimate lookups that differ from a statistically defined normal behavior will be subject to manual investigation. The work was carried out in collaboration between SAS Institute Norway and the largest Norwegian hospital, Oslo University Hospital (OUS) and was aimed at establishing whether the method is suitable for unveiling unwarranted lookups in medical records. A number of so called scenarios are used to indicate adverse behaviour, each responsible for looking at one particular aspect of journal access data. For instance, one scenario determines the timeliness of a lookup relative to the patient's admission history; another judges whether the medical competency of the employee is relevant to the situation of the patient at the time of the lookup. We have so far designed and developed a library of around 20 scenarios that together are used in weighted combination to render a final judgment of the appropriateness of the lookup. The approach has been proven highly successful, and a further development of these ideas is currently being done, the aim of which is to establish a joint Norwegian solution to the problem of unwarranted access. Furthermore, we believe that the approach and the framework may be utilised in many other industries where sensitive data is being processed, such as financial, police, tax and social services. In this paper, the method is outlined, as well as results of its application on data from OUS.

This discussion uses SAS^® Office Analytics as an example to demonstrate the importance of preparing for the SAS^® installation. There are many nuances as well as requirements that need to be addressed before you do an installation. These requirements are basically similar, yet they differ according to the target installation operating system. In other words, there are some differences in preparation routines for Windows and *Nix flavors. Our discussion focuses on these three topics: 1. Pre-installation considerations such as sizing, storage, proper credentials, and third-party requirements; 2. Installation steps and requirements; and 3. Post-installation configuration. In addition to preparation, this paper also discusses potential issues and pitfalls to watch out for, as well as best practices.

Statistical mediation analysis is common in business, social sciences, epidemiology, and related fields because it explains how and why two variables are related. For example, mediation analysis is used to investigate how product presentation affects liking the product, which then affects the purchase of the product. Mediation analysis evaluates the mechanism by which a health intervention changes norms that then change health behavior. Research on mediation analysis methods is an active area of research. Some recent research in statistical mediation analysis focuses on extracting accurate information from small samples by using Bayesian methods. The Bayesian framework offers an intuitive solution to mediation analysis with small samples; namely, incorporating prior information into the analysis when there is existing knowledge about the expected magnitude of mediation effects. Using diffuse prior distributions with no prior knowledge allows researchers to reason in terms of probability rather than in terms of (or in addition to) statistical power. Using SAS^® PROC MCMC, researchers can choose one of two simple and effective methods to incorporate their prior knowledge into the statistical analysis, and can obtain the posterior probabilities for quantities of interest such as the mediated effect. This project presents four examples of using PROC MCMC to analyze a single mediator model with real data using: (1) diffuse prior information for each regression coefficient in the model, (2) informative prior distributions for each regression coefficient, (3) diffuse prior distribution for the covariance matrix of variables in the model, and (4) informative prior distribution for the covariance matrix.

Linear regression has been a widely used approach in social and medical sciences to model the association between a continuous outcome and the explanatory variables. Assessing the model assumptions, such as linearity, normality, and equal variance, is a critical step for choosing the best regression model. If any of the assumptions are violated, one can apply different strategies to improve the regression model, such as performing transformation of the variables or using a spline model. SAS^® has been commonly used to assess and validate the postulated model and SAS^® 9.3 provides many new features that increase the efficiency and flexibility in developing and analyzing the regression model, such as ODS Statistical Graphics. This paper aims to demonstrate necessary steps to find the best linear regression model in SAS 9.3 in different scenarios where variable transformation and the implementation of a spline model are both applicable. A simulated data set is used to demonstrate the model developing steps. Moreover, the critical parameters to consider when evaluating the model performance are also discussed to achieve accuracy and efficiency.

SAS^® OLAP technology is used to organize and present summarized data for business intelligence applications. It features flexible options for creating and storing aggregations to improve performance and brings a powerful multi-dimensional approach to querying data. This paper focuses on managing security features available to OLAP cubes through the combination of SAS metadata and MDX logic.

The Purchasing Department is considering contracting with your team for a new SAS^® Enterprise BI application. He's already met with SAS^® and seen the sales pitch, and he is very interested. But the manager is a tightwad and not sure about spending the money. Also, he wants his team to be the primary developers for this new application. Before investing his money on training, programming, and support, he would like a proof-of-concept. This paper will walk you through the seven steps to create a SAS Enterprise BI POC project: Develop a kick-off meeting including a full demo of the SAS Enterprise BI tools. Set up your UNIX file systems and security. Set up your SAS metadata ACTs, users, groups, folders, and libraries. Make sure the necessary SAS client tools are installed on the developers machines. Hold a SAS Enterprise BI workshop to introduce them to the basics, including SAS^® Enterprise Guide^®, SAS^® Stored Processes, SAS^® Information Maps, SAS^® Web Report Studio, SAS^® Information Delivery Portal, and SAS^® Add-In for Microsoft Office, along with supporting documentation. Work with them to develop a simple project, one that highlights the benefits of SAS Enterprise BI and shows several methods for achieving the desired results. Last but not least, follow up! Remember, your goal is not to launch a full-blown application. Instead, we ll strive toward helping them see the potential in your organization for applying this methodology.

Big data is all the rage these days, with the proliferation of data-accumulating electronic gadgets and instrumentation. At the heart of big data analytics is the MapReduce programming model. As a framework for distributed computing, MapReduce uses a divide-and-conquer approach to allow large-scale parallel processing of massive data. As the name suggests, the model consists of a Map function, which first splits data into key-value pairs, and a Reduce function, which then carries out the final processing of the mapper outputs. It is not hard to see how these functions can be simulated with the SAS^® hash objects technique, and in reality, implemented in the new SAS^® DS2 language. This paper demonstrates how hash object programming can handle data in a MapReduce fashion and shows some potential applications in physics, chemistry, biology, and finance.

Contrasting two sets of textual data points out important differences. For example, consider social media data that have been collected on the race between incumbent Kay Hagan and challenger Thom Tillis in the 2014 election for the seat of US Senator from North Carolina. People talk about the candidates in different terms for different topics, and you can extract the words and phrases that are used more in messages about one candidate than about the other. By using SAS^® Sentiment Analysis on the extracted information, you can discern not only the most important topics and sentiments for each candidate, but also the most prominent and distinguishing terms that are used in the discussion. Find out if Republicans and Democrats speak different languages!

'Can I have that in Excel?' This is a request that makes many of us shudder. Now your boss has discovered Microsoft Excel pivot tables. Unfortunately, he has not discovered how to make them. So you get to extract the data, massage the data, put the data into Excel, and then spend hours rebuilding pivot tables every time the corporate data is refreshed. In this workshop, you learn to be the armchair quarterback and build pivot tables without leaving the comfort of your SAS^® environment. In this workshop, you learn the basics of Excel pivot tables and, through a series of exercises, you learn how to augment basic pivot tables first in Excel, and then using SAS. No prior knowledge of Excel pivot tables is required.

For decades, SAS^® has been the cornerstone of many organizations for business reporting. In more recent times, the ability to quickly determine the performance of an organization through the use of dashboards has become a requirement. Different ways of providing dashboard capabilities are discussed in this paper: using out-of-the-box solutions such as SAS^® Visual Analytics and SAS^® BI Dashboard, through to alternative solutions using SAS^® Stored Processes, batch processes, and SAS^® Integration Technologies. Extending the available indicators is also discussed, using Graph Template Language and KPI indicators provided with Base SAS^®, as well as alternatives such as Google Charts and Flash objects. Real-world field experience, problem areas, solutions, and tips are shared, along with live examples of some of the different methods.

The SAS^® Enterprise Guide^® Query Builder is one of the most powerful components of the software. It enables a user to bring in data, join, drop and add columns, compute new columns, sort, filter data, leverage the advanced expression builder, change column attributes, and more! This presentation provides an overview of the major features of this powerful tool and how to leverage it every day.

Raking (iterative proportional fitting) is a procedure that takes sampling weights from complex sample surveys and adjusts them so that they add to known control totals. This process reduces variance and adjusts for undercoverage. But raking in multiple dimensions can lead to extreme weights, which increase variance. Trimming is another sample weighting procedure that reduces extreme weights to cutoffs, thereby improving variance properties while potentially introducing bias. The RAKE-TRIM macro combines raking and trimming in an iterative algorithm to achieve these two goals simultaneously. The raking reduces the bias potential from trimming, and the trimming reduces the variance inflation from raking. When convergence occurs, the final weights aggregate to the control totals, as well as respect the trimming limits. SAS^® macros are well suited for this kind of envelope program: the larger macro consists of the integration of component macros that were developed for other applications. A parameter specification sheet enables users to provide all of the parameters needed to define the algorithm for their particular situation, and, if necessary, to alter the parameters to facilitate convergence. Diagnostics are included when convergence fails. Microsoft Excel tables are imported to provide the cell structure and are exported to provide statistics for the algorithm s results. This RAKE-TRIM macro was first developed in 2010 for the 2009 National Household Transportation Survey and has been used in other studies as well. The paper describes the algorithm and discusses our experiences with it.

This is the way I have always done it and it works fine for me. Have you heard yourself or others say this when someone suggests a new technique to help solve a problem? Most of us have a set of tricks and techniques from which we draw when starting a new project. Over time we might overlook newer techniques because our old toolkit works just fine. Sometimes we actively avoid new techniques because our initial foray leaves us daunted by the steep learning curve to mastery. For me, the PRX functions and the SAS^® hash object fell into this category. In this workshop, we address possible objections to learning to use the SAS hash object. We start with the fundamentals of setting up the hash object and work through a variety of practical examples to help you master this powerful technique.

As a longtime Base SAS^® programmer, whether to use a different application for programming is a constant question when powerful applications such as SAS^® Enterprise Guide^® are available. This paper provides some important tips for a programmer, such as the best way to use the code window and how to take advantage of system-generated code in SAS Enterprise Guide 5.1. This paper also explains the differences between some of the functions and procedures in Base SAS and SAS Enterprise Guide. It highlights features in SAS Enterprise Guide such as process flow, data access management, and report automation, including formatting using XML tag sets.

This paper gives you a better idea of how and where to use the record lookup functions to locate observations where a variable has some characteristic. Various related functions are illustrated to search numeric and character values in this process. Code is shown with time comparisons. I will discuss three possible ways to retrieve records using the SAS^® DATA step, PROC SQL, and Perl regular expressions. Real and CPU time processing issues will be highlighted when comparing to retrieve records using these methods. Although the program is written for the PC using SAS^® 9.2 in a Windows XP 32-bit environment, all the functions are applicable to any system. All the tools discussed are in Base SAS^®. The typical attendee or reader will have some experience in SAS, but not a lot of experience dealing with large amount of data.

SAS^® Add-In for Microsoft Office remains a popular tool for people who are not SAS^® programmers due to its easy interface with the SAS servers. In this session, you'll learn some of the many tricks that other organizations use for getting more value out of the tool.

Understanding previous research in key domain areas can help R&D organizations focus new research in non-duplicative areas and ensure that future endeavors do not repeat the mistakes of the past. However, manual analysis of previous research efforts can prove insufficient to meet these ends. This paper highlights how a combination of SAS^® Text Analytics and SAS^® Visual Analytics can deliver the capability to understand key topics and patterns in previous research and how it applies to a current research endeavor. We will explore these capabilities in two use cases. The first will be in uncovering trends in publicly visible government funded research (SBIR) and how these trends apply to future research in nanotechnology. The second will be visualizing past research trends in publicly available NASA publications, and how these might impact the development of next-generation spacecraft.

You have built the simple bar chart and mastered the art of layering multiple plot statements to create complex graphs like the Survival Plot using the SGPLOT procedure. You know all about how to use plot statements creatively to get what you need and how to customize the axes to achieve the look and feel you want. Now it s time to up your game and step into the realm of the Graphics Wizard. Behold the magical powers of Graph Template Language Layouts! Here you will learn the esoteric art of creating complex multi-cell graphs using LAYOUT LATTICE. This is the incantation that gives you the power to build complex, multi-cell graphs like the Forest plot, Stock plots with multiple indicators like MACD and Stochastics, Adverse Events by Relative Risk graphs, and more. If you ever wondered how the Diagnostics panel in the REG procedure was built, this paper is for you. Be warned, this is not the realm for the faint of heart!

Regression is a helpful statistical tool for showing relationships between two or more variables. However, many users can find the barrage of numbers at best unhelpful, and at worst undecipherable. Using the shipments and inventories historical data from the U.S. Census Bureau's office of Manufacturers' Shipments, Inventories, and Orders (M3), we can create a graphical representation of two time series with PROC GPLOT and map out reported and expected results. By combining this output with results from PROC REG, we are able to highlight problem areas that might need a second look. The resulting graph shows which dates have abnormal relationships between our two variables and presents the data in an easy-to-use format that even users unfamiliar with SAS^® can interpret. This graph is ideal for analysts finding problematic areas such as outliers and trend-breakers or for managers to quickly discern complications and the effect they have on overall results.

There are yearly 2.35 million road accident cases recorded in the U.S. Among them, 37,000 were considered fatal. Road crashes cost USD 230.6 billion per year, or an average of USD 820 per person. Our efforts are to identify the important factors that lead to vehicle collisions and to predict the injury risk involved in them. Data was collected from National Automotive Sampling System (NASS), containing 20,247 cases with 19 variables. Input variables describe the factors involved in an accident like Height, Age, Weight, Gender, Vehicle model year, Speed limit, Energy absorption in Collision & Deformation location, etc. The target variable is nominal showing levels of injury. Missing values in interval variables were imputed using mean and class variables using the count method. Multivariate analysis suggests high correlation between tire footprint and wheelbase (Corr=0.97, P<0.0001) and original weight of car and curb weight of car (Corr=0.79, P<0.0001). Variables having high kurtosis values were transformed using range standardization. Variables were sorted using variable importance using decision tree analysis. Models like multiple regression, polynomial regression, neural network, and decision tree were applied in the dataset to identify the factors that are most significant in predicting the injury risk. Multilinear perception neural network came out to be the best model to predict injury risk index, with the least Average Squared Error 0.086 in validation dataset.

Dataprev has become the principal owner of social data on the citizens in Brazil by collecting information for over forty years in order to subsidize pension applications for the government. The use of this data can be expanded to provide new tools to aid policy and assist the government to optimize the use of its resources. Using SAS^® MDM, we are developing a solution that uniquely identifies the citizens of Brazil. Overcoming challenges with multiple government agencies and with the validation of survey records that suggest the same person requires rules for governance and a definition of what represents a particular Brazilian citizen. In short, how do you turn a repository of master data into an efficient catalyst for public policy? This is the goal for creating a repository focused on identifying the citizens of Brazil.

This paper discusses the techniques I used at the Census Bureau to overcome the issue of dealing with large amounts of data while modernizing some of their public-facing web applications by using service oriented architecture (SOA) to deploy Flex web applications powered by SAS^®. The paper covers techniques that resulted in reducing 142,293 XML lines (3.6 MB) down to 15,813 XML lines (1.8 MB), a 50% size reduction on the server side (HTTP Response), and 196,167 observations down to 283 observations, a reduction of 99.8% in summarized data on the client side (XML Lookup file).

When reading data files or writing SAS^® programs, we are often hunting for the right format or informat. There are so many to choose from! Does it seem like too many to search the manual? Let SAS help find the right one! We use the SAS dictionary table VFORMAT and a very small SAS program. This presentation demonstrates how two simple functions unlock the potential of this great resource: SASHELP.VFORMAT.

The world's first wind resource assessment buoy, residing in Lake Michigan, uses a pulsing laser wind sensor to accurately measure wind speed, direction, and turbulence offshore up to wind turbine hub-height and across the blade span every second. Understanding wind behavior would be tedious and fatiguing with such large data sets. However, SAS/GRAPH^® 9.4 helps the user grasp wind characteristics over time and at different altitudes by exploring the data visually. This paper covers graphical approaches to evaluate wind speed validity, seasonal wind speed variation, and storm systems to inform engineers on the candidacy of Lake Michigan offshore wind farms.

Missing observations caused by dropouts or skipped visits present a problem in studies of longitudinal data. When the analysis is restricted to complete cases and the missing data depend on previous responses, the generalized estimating equation (GEE) approach, which is commonly used when the population-average effect is of primary interest, can lead to biased parameter estimates. The new GEE procedure in SAS/STAT^® 13.2 implements a weighted GEE method, which provides consistent parameter estimates when the dropout mechanism is correctly specified. When none of the data are missing, the method is identical to the usual GEE approach, which is available in the GENMOD procedure. This paper reviews the concepts and statistical methods. Examples illustrate how you can apply the GEE procedure to incomplete longitudinal data.