SAS Global Forum 2014 Proceedings

SAS^® Grid Computing is a scale-out SAS^® solution that enables SAS applications to better utilize computing resources, which is extremely I/O and compute intensive. It requires the use of a high-performance shared storage (SS) that allows all servers to access the same file systems. SS may be implemented via traditional NFS NAS or clustered file systems (CFS) like GPFS. This paper uses the Lustre* file system, a parallel, distributed CFS, for a case study of performance scalability of SAS Grid Computing nodes on SS. The paper qualifies the performance of a standardized SAS workload running on Lustre at scale. Lustre has been traditionally used for large and sequential I/O. We will record and present the tuning changes necessary for the optimization of Lustre for the SAS applications. In addition, results from the scaling of SAS Cluster jobs running on Lustre will be presented.

Nowadays, most corporations build and maintain their own data warehouse, and an ETL (Extract, Transform, and Load) process plays a critical role in managing the data. Some people might create a large program and execute this program from top to bottom. Others might generate a SAS^® driver with several programs included, and then execute this driver. If some programs can be run in parallel, then developers must write extra code to handle these concurrent processes. If one program fails, then users can either rerun the entire process or comment out the successful programs and resume the job from where the program failed. Usually the programs are deployed in production with read and execute permission only. Users do not have the priviledge of modifying codes on the fly. In this case, how do you comment out the programs if the job terminated abnormally? This paper illustrates an approach for managing ETL process flows. The approach uses a framework based on SAS, on a UNIX platform. This is a high-level infrastructure discussion with some explanation of the SAS codes that are used to implement the framework. The framework supports the rerun or partial run of the entire process without changing any source codes. It also supports the concurrent process, and therefore no extra code is needed.

The current study looks at recent health trends and behavior analyses of youth in America. Data used in this analysis was provided by the Center for Disease Control and Prevention and gathered using the Youth Risk Behavior Surveillance System (YRBSS). A factor analysis was performed to identify and define latent mental health and risk behavior variables. A series of logistic regression analyses were then performed using the risk behavior and demographic variables as potential contributing factors to each of the mental health variables. Mental health variables included disordered eating and depression/suicidal ideation data, while the risk behavior variables included smoking, consumption of alcohol and drugs, violence, vehicle safety, and sexual behavior data. Implications derived from the results of this research are a primary focus of this study. Risks and benefits of using a factor analysis with logistic regression in social science research will also be discussed in depth. Results included reporting differences between the years of 1991 and 2011. All results are discussed in relation to current youth health trend issues. Data was analyzed using SAS^® 9.3.

One of the first lessons that SAS^® programmers learn on the job is that numeric and character variables do not play well together, and that type mismatches are one of the more common source of errors in their otherwise flawless SAS programs. Luckily, converting variables from one type to another in SAS (that is, casting) is not difficult, requiring only the judicious use of either the input() or put() function. There remains, however, the danger of data being lost in the conversion process. This type of error is most likely to occur in cases of character-to-numeric variable conversion, most especially when the user does not fully understand the data contained in the data set. This paper will review the basics of data storage for character and numeric variables in SAS, the use of formats and informats for conversions, and how to ensure accurate type conversion of even high-precision numeric values.

Complex data manipulations can be resource intensive, both in terms of development time and processing duration. However, in recent years SAS has introduced a number of new technologies that, when used together, can produce a dramatic increase in performance while simultaneously simplifying program development and maintenance. This paper presents a development paradigm that utilizes the problem decomposition capabilities of DS2, the flexibility of SQL, and the performance benefits of in-memory storage using hash objects.

Have you ever wished that with one click you could copy any SAS^® data set, including variable names, so that you could paste the text into a Microsoft Word file, Microsoft PowerPoint slide, or spreadsheet? You can and, with just Base SAS^®, there are some little-known but easy-to use methods that are available for automating many of your (or your users ) common tasks.

Influence analysis in statistical modeling looks for observations that unduly influence the fitted model. Cook s distance is a standard tool for influence analysis in regression. It works by measuring the difference in the fitted parameters as individual observations are deleted. You can apply the same idea to examining influence of groups of observations for example, the multiple observations for subjects in longitudinal or clustered data but you need to adapt it to the fact that different subjects can have different numbers of observations. Such an adaptation is discussed by Zhu, Ibrahim, and Cho (2012), who generalize the subject size factor as the so-called degree of perturbation, and correspondingly generalize Cook s distances as the scaled Cook s distance. This paper presents the %SCDMixed SAS^® macro, which implements these ideas for analyzing influence in mixed models for longitudinal or clustered data. The macro calculates the degree of perturbation and scaled Cook s distance measures of Zhu et al. (2012) and presents the results with useful tabular and graphical summaries. The underlying theory is discussed, as well as some of the programming tricks useful for computing these influence measures efficiently. The macro is demonstrated using both simulated and real data to show how you can interpret its results for analyzing influence in your longitudinal modeling.

Stepwise regression includes regression models in which the predictive variables are selected by an automated algorithm. The stepwise method involves two approaches: backward elimination and forward selection. Currently, SAS^® has three procedures capable of performing stepwise regression: REG, LOGISTIC, and GLMSELECT. PROC REG handles the linear regression model, but does not support a CLASS statement. PROC LOGISTIC handles binary responses and allows for logit, probit, and complementary log-log link functions. It also supports a CLASS statement. The GLMSELECT procedure performs selections in the framework of general linear models. It allows for a variety of model selection methods, including the LASSO method of Tibshirani (1996) and the related LAR method of Efron et al. (2004). PROC GLMSELECT also supports a CLASS statement. We present a stepwise algorithm for generalized linear mixed models for both marginal and conditional models. We illustrate the algorithm using data from a longitudinal epidemiology study aimed to investigate parents beliefs, behaviors, and feeding practices that associate positively or negatively with indices of sleep quality.

The ExcelXP tagset offers several options for controlling column widths, including Width_Points, Width_Fudge, and Absolute_Column_Width. Although Absolute_Column_Width might seem unpredictable at first, it is possible to fix the first two options so that the Absolute_Column_Width is the exact column width in pixels. This poster presents these settings and suggests how to create and manage the integer string of column widths.

The SAS^® Data Quality Server allows SAS^® programmers to integrate the power of DataFlux^® into their data cleaning programs. The power of SAS Data Quality Server enables programmers to efficiently identify matching records across different datasets when exact matches are not present. During a recent educational research project, the DQMATCH function proved very capable when trying to link records from disparate data sources. Two key insights led to even greater success in linking records. The first insight was acknowledging that the hierarchical structure of data can greatly improve success in matching records. The second insight was that the names of individuals can be restructured to improve the chances of successful matches. This paper provides an overview of how these insights were implemented using the DQMATCH function to link educational data from multiple sources.

The operational tempo of marketing in a digital world seems faster every day. New trends, issues, and ideas appear and spread like wildfire, demanding that sales and marketing adapt plans and priorities on-the-fly. The technology available to us can handle this, but traditional organizational processes often become the bottleneck. The solution is a new management approach called agile marketing. Drawing upon the success of agile software development and the lean start-up movement, agile marketing is a simple but powerful way to make marketing teams more nimble and marketing programs more responsive. You don't have to be small to be agile agile marketing has thrived at large enterprises such as Cisco and EMC. This session covers the basics of agile marketing what it is, why it works, and how to get started with agile marketing in your own team. In particular, we look at how agile marketing dovetails with the explosion of data-driven management in marketing by using the fast feedback from analytics to rapidly iterate and adapt new marketing programs in a rapid yet focused fashion.

Finding groups with similar attributes is at the core of knowledge discovery. To this end, Cluster Analysis automatically locates groups of similar observations. Despite successful applications, many practitioners are uncomfortable with the degree of automation in Cluster Analysis, which causes intuitive knowledge to be ignored. This is more true in text mining applications since individual words have meaning beyond the data set. Discovering groups with similar text is extremely insightful. However, blind applications of clustering algorithms ignore intuition and hence are unable to group similar text categories. The challenge is to integrate the power of clustering algorithms with the knowledge of experts. We demonstrate how SAS/STAT^® 9.2 procedures and the SAS^® Macro Language are used to ensemble the opinion of domain experts with multiple clustering models to arrive at a consensus. The method has been successfully applied to a large data set with structured attributes and unstructured opinions. The result is the ability to discover observations with similar attributes and opinions by capturing the wisdom of the crowds whether man or model.

This paper expands upon A Multilevel Model Primer Using SAS^® PROC MIXED in which we presented an overview of estimating two- and three-level linear models via PROC MIXED. However, in our earlier paper, we, for the most part, relied on simple options available in PROC MIXED. In this paper, we present a more advanced look at common PROC MIXED options used in the analysis of social and behavioral science data, as well introduce users to two different SAS macros previously developed for use with PROC MIXED: one to examine model fit (MIXED_FIT) and the other to examine distributional assumptions (MIXED_DX). Specific statistical options presented in the current paper include (a) PROC MIXED statement options for estimating statistical significance of variance estimates (COVTEST, including problems with using this option) and estimation methods (METHOD =), (b) MODEL statement option for degrees of freedom estimation (DDFM =), and (c) RANDOM statement option for specifying the variance/covariance structure to be used (TYPE =). Given the importance of examining model fit, we also present methods for estimating changes in model fit through an illustration of the SAS macro MIXED_FIT. Likewise, the SAS macro MIXED_DX is introduced to remind users to examine distributional assumptions associated with two-level linear models. To maintain continuity with the 2013 introductory PROC MIXED paper, thus providing users with a set of comprehensive guides for estimating multilevel models using PROC MIXED, we use the same real-world data sources that we used in our earlier primer paper.

The use of Bayesian methods has become increasingly popular in modern statistical analysis, with applications in numerous scientific fields. In recent releases, SAS^® has provided a wealth of tools for Bayesian analysis, with convenient access through several popular procedures in addition to the MCMC procedure, which is specifically designed for complex Bayesian modeling (not discussed here). This paper introduces the principles of Bayesian inference and reviews the steps in a Bayesian analysis. It then describes the Bayesian capabilities provided in four procedures(the GENMOD, PHREG, FMM, and LIFEREG procedures) including the available prior distributions, posterior summary statistics, and convergence diagnostics. Various sampling methods that are used to sample from the posterior distributions are also discussed. The second part of the paper describes how to use the GENMOD and PHREG procedures to perform Bayesian analyses for real-world examples and how to take advantage of the Bayesian framework to address scientific questions.

Hierarchical data are common in many fields, from pharmaceuticals to agriculture to sociology. As data sizes and sources grow, information is likely to be observed on nested units at multiple levels, calling for the multilevel modeling approach. This paper describes how to use the GLIMMIX procedure in SAS/STAT^® to analyze hierarchical data that have a wide variety of distributions. Examples are included to illustrate the flexibility that PROC GLIMMIX offers for modeling within-unit correlation, disentangling explanatory variables at different levels, and handling unbalanced data. Also discussed are enhanced weighting options, new in SAS/STAT 13.1, for both the MODEL and RANDOM statements. These weighting options enable PROC GLIMMIX to handle weights at different levels. PROC GLIMMIX uses a pseudolikelihood approach to estimate parameters, and it computes robust standard error estimators. This new feature is applied to an example of complex survey data that are collected from multistage sampling and have unequal sampling probabilities.

Most marketers today are trying to use Facebook s network of 1.1 billion plus registered users for social media marketing. Local television stations and newspapers are no exception. This paper investigates what makes a post effective. A Facebook page that is owned by a brand has fans, or people who like the page and follow the stories posted on that page. The posts on a brand page, however, do not appear on all the fans News Feeds. This is determined by EdgeRank, a Facebook proprietary algorithm that determines what content users see and how it s prioritized on their News Feed. If marketers can understand how EdgeRank works, then they can develop more impactful posts and ultimately produce more effective social marketing using Facebook. The objective of this paper is to find the characteristics of a Facebook post that enhance the efficacy of a news outlet s page among their fans using Facebook Power Ratio as the target variable. Power Ratio, a surrogate to EdgeRank, was developed by experts at Frank N. Magid Associates, a research-based media consulting firm. Seventeen variables that describe the characteristics of a post were extracted from more than 8,000 posts, which were encoded by 10 media experts at Magid. Numerous models were built and compared to predict Power Ratio. The most useful model is a polynomial regression with the top three important factors as whether a post asks fans to like the post, content category of a post (including news, weather, etc.), and number of fans of the page.

Do you have an abstract for an idea that you want to submit as a proposal to SAS^® conferences, but you are not sure which section is the most appropriate one? In this paper, we discuss a methodology for automatically identifying the most suitable section or sections for your proposal content. We use SAS^® Text Miner 12.1 and SAS^® Content Categorization Studio 12.1 to develop a rule-based categorization model. This model is used to automatically score your paper abstract to identify the most relevant and appropriate conference sections to submit to for a better chance of acceptance.

In our previous work, we often needed to perform large numbers of repetitive and data-driven post-campaign analyses to evaluate the performance of marketing campaigns in terms of customer response. These routine tasks were usually carried out manually by using Microsoft Excel, which was tedious, time-consuming, and error-prone. In order to improve the work efficiency and analysis accuracy, we managed to automate the analysis process with SAS^® programming and replace the manual Excel work. Through the use of SAS macro programs and other advanced skills, we successfully automated the complicated data-driven analyses with high efficiency and accuracy. This paper presents and illustrates the creative analytical ideas and programming skills for developing the automatic analysis process, which can be extended to apply in a variety of business intelligence and analytics fields.

This is a simple macro that will examine all fields in a SAS^® dataset that are stored as character data to see if they contain real character data or if they could be converted to numeric. It then performs the conversion and reports in the log the names of any fields that could not be converted. This allows the truly numeric data to be analyzed by PROC MEANS or PROC UNIVARIATE. It makes use of the SAS dictionary tables, the SELECT INTO syntax, and the ANYALPHA function.

As IT professionals, saving time is critical. Delivering timely and quality-looking reports and information to management, end users, and customers is essential. SAS^® provides numerous 'canned' PROCedures for generating quick results to take care of these needs ... and more. In this hands-on workshop, attendees acquire basic insights into the power and flexibility offered by SAS PROCedures using PRINT, FORMS, and SQL to produce detail output; FREQ, MEANS, and UNIVARIATE to summarize and create tabular and statistical output; and data sets to manage data libraries. Additional topics include techniques for informing SAS which data set to use as input to a procedure, how to subset data using a WHERE statement (or WHERE= data set option), and how to perform BY-group processing.

As complicated as the macro language is to learn, there are very strong reasons for doing so. At its heart, the macro language is a code generator. In its simplest uses, it can substitute simple bits of code like variable names and the names of data sets that are to be analyzed. In more complex situations, it can be used to create entire statements and steps based on information may even be unavailable to the person writing or even executing the macro. At the time of execution, it can be used to make queries of the SAS^® environment as well as the operating system, and utilize the gathered information to make informed decisions about how it is to further function and execute.

Because the macro language is primarily a code generator, it makes sense that the code that it creates must be generated before it can be executed. This implies that execution of the macro language comes first. Simple as this is in concept, timing issues and conflicts are often not so simple to recognize in application. As we use the macro language to take on more complex tasks, it becomes even more critical that we have an understanding of these issues.

Macro variables and their values are stored in symbol tables, which in turn are held in memory. Not only are there are a number of ways to create macro variables, but they can be created in a wide variety of situations. How they are created and under what circumstances effects the variable s scope how and where the macro variable is stored and retrieved. There are a number of misconceptions about macro variable scope and about how the macro variables are assigned to symbol tables. These misconceptions can cause problems that the new, and sometimes even the experienced, macro programmer does not anticipate. Understanding the basic rules for macro variable assignment can help the macro programmer solve some of these problems that are otherwise quite mystifying.

This paper provides an overview of how to create a SAS^® Enterprise Guide^® process that is well designed, simple, documented, automated, modular, efficient, reliable, and easy to maintain. Topics include how to organize a SAS Enterprise Guide process, how to best document in SAS Enterprise Guide, when to leverage point-and-click functionality, and how to automate and simplify SAS Enterprise Guide processes. This paper has something for any SAS Enterprise Guide user, new or experienced!

Rapid adoption of high-performance scalable systems supporting big data acquisition, absorption, management, and analysis has exposed a potential gap in asserting governance and oversight over the information production flow. In a traditional organizational data management environment, business rules encompassed within data controls can be used to govern the creation and consumption of data across the enterprise. However, as more big data analytics applications absorbing massive data sets from external sources whose creation points are far removed from their various repurposed uses, the ability to control the production of data must give way to a different kind of data governance. In this talk, we discuss a rational approach to scoping data governance for big data. Instead of presuming the ability to validate and cleanse data prior to its loading into the analytical platform, we will explore pragmatic expectations and measures for scoring data utility and believability within the context of big data analytics applications. Attendees will learn about: Expectations for data variation The impact of variance on analytical results Focusing on the quality of your data management processesand infrastructure Measurements for data usability

The emerging discipline of data governance encompasses data quality assurance, data access and use policy, security risks and privacy protection, and longitudinal management of an organization s data infrastructure. In the interests of forestalling another bureaucratic solution to data governance issues, this presentation features database programming tools that provide rapid access to big data and make selective access to and restructuring of metadata practical.

In many organizations the amount of data we deal with increases far faster than the hardware and IT infrastructure to support them. As a result, we encounter significant bottlenecks and I/O bound processes. However, clever use of SAS^® software can help us find a way around. In this paper we will look at the clever use of PROC OLAP to show you how to address I/O bound processing spread I/O traffic to different servers to increase cube building efficiency. This paper assumes experience with SAS^® OLAP Cube Studio and/or PROC OLAP.

Simply using an ODS destination to replay PROC CONTENTS output does not provide the user with attractive, usable metadata. Harness the power of SAS^® and ODS output objects to create designer multi-tab metadata workbooks with the click of a mouse!

Inference of variance components in linear mixed effect models (LMEs) is not always straightforward. I introduce and describe a flexible SAS^® macro (%COVTEST) that uses the likelihood ratio test (LRT) to test covariance parameters in LMEs by means of the parametric bootstrap. Users must supply the null and alternative models (as macro strings), and a data set name. The macro calculates the observed LRT statistic and then simulates data under the null model to obtain an empirical p-value. The macro also creates graphs of the distribution of the simulated LRT statistics. The program takes advantage of processing accomplished by PROC MIXED and some SAS/IML^® functions. I demonstrate the syntax and mechanics of the macro using three examples.

The use of Cohen s kappa has enjoyed a growing popularity in the social sciences as a way of evaluating rater agreement on a categorical scale. The kappa statistic can be calculated as Cohen first proposed it in his 1960 paper or by using any one of a variety of weighting schemes. The most popular among these are the linear weighted kappa and the quadratic weighted kappa. Currently, SAS^® users can produce the kappa statistic of their choice through PROC FREQ and the use of relevant AGREE options. Complications arise however when the data set does not contain a completely square cross-tabulation of data. That is, this method requires that both raters have to have at least one data point for every available category. There have been many solutions offered for this predicament. Most suggested solutions include the insertion of dummy records into the data and then assigning a weight of zero to those records through an additional class variable. The result is a multi-step macro, extraneous variable assignments, and potential data integrity issues. The author offers a much more elegant solution by producing a segment of code which uses brute force to calculate Cohen s kappa as well as all popular variants. The code uses nested PROC SQL statements to provide a single conceptual step which generates kappa statistics of all types even those that the user wishes to define for themselves.

Usually, log files are checked by users only when SAS^® completes the execution of programs. If SAS finds any errors in the current line, it skips the current step and executes the next line. The process is completed only at the execution complete program. There are a few programs that will take more than a day to complete. In this case, the user opens the log file in Read-Only mode frequently to check for errors, warnings, and unexpected notes and terminates the execution of the program manually if any potential messages are identified. Otherwise, the user will be notified with the errors in the log file only at the end of the execution. Our suggestion is to run the parallel utility program along with the production program to check the log file of the currently running program and to notify the user through an e-mail when an error, warning, or unexpected note is found in the log file. Also, the execution can be terminated automatically and the user can be notified when potential messages are identified.

Can clustering discharge records improve a predictive model s overall fit statistic? Do the predictors differ across segments? This talk describes the methods and results of data mining pediatric IBD patient records from my Analytics 2013 poster. SAS^® Enterprise Miner^™ 12.1 was used to segment patients and model important predictors for the length of hospital stay using discharge records from the national Kid s Inpatient Database. Profiling revealed that patient segments were differentiated by primary diagnosis, operating room procedure indicator, comorbidities, and factors related to admission and disposition of patient. Cluster analysis of patient discharges improved the overall average square error of predictive models and distinguished predictors that were unique to patient segments.

When a SAS^® user asked for help scanning words in textual data and then matching them to pre-scored keywords, it struck a chord with SAS programmers! They contributed code that solved the problem using hash structures, SQL, informats, arrays, and PRX routines. Of course, the next question was which program is fastest! This paper compares the different approaches and evaluates the performance of the programs on varying amounts of data. The code for each program is provided to show how SAS has a variety of tools available to solve common problems. While this won t make you an expert on any of these programming techniques, you ll see each of them in action on a common problem.

This paper describes a method that uses some simple SAS^® macros and SQL to merge data sets containing related data that contains rows with varying effective date ranges. The data sets are merged into a single data set that represents a serial list of snapshots of the merged data, as of a change in any of the effective dates. While simple conceptually, this type of merge is often problematic when the effective date ranges are not consecutive or consistent, when the ranges overlap, or when there are missing ranges from one or more of the merged data sets. The technique described was used by the Fairfax County Human Resources Department to combine various employee data sets (Employee Name and Personal Data, Personnel Assignment and Job Classification, Personnel Actions, Position-Related data, Pay Plan and Grade, Work Schedule, Organizational Assignment, and so on) from the County's SAP-HCM ERP system into a single Employee Action History/Change Activity file for historical reporting purposes. The technique currently is used to combine fourteen data sets, but is easily expandable by inserting a few lines of code using the existing macros.

Missing data commonly occurs in medical, psychiatry, and social researches. The SAS^® MI and MIANALYZE procedures are often used to generate multiple imputations and then provide valid statistical inferences based on them. However, MIANALYZE is not applicable to combine type-III analyses obtained using multiple imputed data sets. In this manuscript, we write a macro to combine the type-III analyses generated from the SAS MIXED procedure based on multiple imputations. The proposed method can be extended to other procedures reporting type-III analyses, such as GENMOD and GLM.

Graphic software users are confronted with what I call Options Over-Choice, and with defaults that are designed to easily give you a result, but not necessarily the best result. This presentation and paper focus on guidelines for communication-effective data visualization. It demonstrates their practical implementation, using graphic examples likely to be adaptable to your own work. Code is provided for the examples. Audience members will receive the latest update of my tip sheet compendium of graphic design principles. The examples use SAS^® tools (traditional SAS/GRAPH^® or the newer ODS graphics procedures that are available with Base SAS^®), but the design principles are really software independent. Come learn how to use data visualization to inform and influence, to reveal and persuade, using tips and techniques developed and refined over 34 years of working to get the best out of SAS^® graphic software tools.

People from all over the world are using SAS^® analytics to achieve great things, such as to develop life-saving medicines, detect and prevent financial fraud, and ensure the survival of endangered species. Chris Hemedinger is not one of those people. Instead, Chris has used SAS to optimize his baby name selections, evaluate his movie rental behavior, and analyze his Facebook friends. Join Chris as he reviews some of his personal triumphs over the little problems in life, and learn how these exercises can help to hone your skills for when it really matters.

This presentation explains how to use Base SAS^®9 software to create multi-sheet Microsoft Excel workbooks. You learn step-by-step techniques for quickly and easily creating attractive multi-sheet Excel workbooks that contain your SAS^® output using the ExcelXP ODS tagset. The techniques can be used regardless of the platform on which your SAS software is installed. You can even use them on a mainframe! Creating and delivering your workbooks on-demand and in real time using SAS server technology is discussed. Although the title is similar to previous presentations by this author, this presentation contains new and revised material not previously presented.

The first table in many research journal articles is a statistical comparison of demographic traits across study groups. It might not be exciting, but it s necessary. And although SAS^® calculates these numbers with ease, it is a time-consuming chore to transfer these results into a journal-ready table. Introducing the time-saving deluxe %MAKETABLE SAS macro it does the heavy work for you. It creates a Microsoft Word table of up to four comparative groups reporting t-tests, chi-square, ANOVA, or median test results, including a p-value. You specify only a one-line macro call for each line in the table, and the macro takes it from there. The result is a tidily formatted journal-ready Word table that you can easily include in a manuscript, report, or Microsoft PowerPoint presentation. For statisticians and researchers needing to summarize group comparisons in a table, this macro saves time and relieves you from the drudgery of trying to make your output neat and pretty. And after all, isn t that what we want computing to do for us?

With increased concern about privacy and simultaneous pressure to make survey data available, statistical disclosure control (SDC) treatments are performed on survey microdata to reduce disclosure risk prior to dissemination to the public. This situation is all the more problematic in the push to provide data online for immediate user query. Two SDC approaches are data coarsening, which reduces the information collected, and data swapping, which is used to adjust data values. Data coarsening includes recodes, top-codes and variable suppression. Challenges related to creating a SAS^® macro for data coarsening include providing flexibility for conducting different coarsening approaches, and keeping track of the changes to the data so that variable and value labels can be assigned correctly. Data swapping includes selecting target records for swapping, finding swapping partners, and swapping data values for the target variables. With the goal of minimizing the impact on resulting estimates, challenges for data swapping are to find swapping partners that are close matches in terms of both unordered categorical and ordered categorical variables. Such swapping partners ensure that enough change is made to the target variables, that data consistency between variables is retained, and that the pool of potential swapping partners is controlled. An example is presented using each algorithm.

Euramax is a global manufacturer of precoated metals who relies on analytics and data visualization for its decision making. Euramax has deployed significant innovations in recent years. SAS^® Visual Analytics fits in the innovative culture of Euramax and its need for information-based decision making. During this presentation, Peter Wijers shares best practices of the implementation process and several application areas.

We used OPTNET to link hedge fund datasets from four vendors, covering overlapping populations, but with no universal identifier. This quick tip shows how to treat data records as nodes, use pairwise identifiers to generate distance measures, and get PROC OPTNET to assign clusters of records from all sources to each hedge fund. This proved to be far faster, and easier, than doing the same task in PROC SQL.

Your company s chronically overloaded SAS^® environment, adversely impacted user community, and the resultant lackluster productivity have finally convinced your upper management that it is time to upgrade to a SAS^® grid to eliminate all the resource problems once and for all. But after the contract is signed and implementation begins, you as the SAS administrator suddenly realize that your company-wide standard mode of SAS operations, that is, using the traditional SAS^® Display Manager on a server machine, runs counter to the expectation of the SAS grid your users are now supposed to switch to SAS^® Enterprise Guide^® on a PC. This is utterly unacceptable to the user community because almost everything has to change in a big way. If you like to play a hero in your little world, this is your opportunity. There are a number of things you can do to make the transition to the SAS grid as smooth and painless as possible, and your users get to keep their favorite SAS Display Manager.

New innovative, analytical techniques are necessary to extract patterns in big data that have temporal and geo-spatial attributes. An approach to this problem is required when geo-spatial time series data sets, which have billions of rows and the precision of exact latitude and longitude data, make it extremely difficult to locate patterns of interest The usual temporal bins of years, months, days, hours, and minutes often do not allow the analyst to have control of the precision necessary to find patterns of interest. Geohashing is a string representation of two-dimensional geometric coordinates. Time hashing is a similar representation, which maps time to preserve all temporal aspects of the date and time of the data into a one-dimensional set of data points. Geohashing and time hashing are both forms of a Z-order curve, which maps multidimensional data into single dimensions and preserves the locality of the data points. This paper explores the use of a multidimensional Z-order curve, combining both geohashing and time hashing, that is known as geo-temporal hashing or space-time boxes using SAS^®. This technique provides a foundation for reducing the data into bins that can yield new methods for pattern discovery and detection in big data.

Very often, there is a need to present the analysis output from SAS^® through web applications. On these occasions, it would make a lot of difference to have highly interactive charts over static image charts and graphs. Not only this is visually appealing, with features like zooming, filtering, etc., it enables consumers to have a better understanding of the output. There are a lot of charting libraries available in the market which enable us to develop cool charts without much effort. Some of the packages are Highcharts, Highstock, KendoUI, and so on. They are developed in JavaScript and use the latest HTML5 components, and they also support a variety of chart types such as line, spline, area, area spline, column, bar, pie, scatter, angular gauges, area range, area spline range, column range, bubble, box plot, error bars, funnel, waterfall, polar chart types etc. This paper demonstrates how we can combine the data processing and analytic powers of SAS with the visualization abilities of these charting libraries. Since most of them consume JSON-formatted data, the emphasis is on JSON producing capabilities of SAS, both with PROC JSON and other custom programming methods. The example would show how easy it is to do develop a stored process which produces JSON data which would be consumed by the charting library with minimum change to the sample program.

APP is an unofficial collective abbreviation for the SAS^® functions ADDR, PEEK, PEEKC, the CALL POKE routine, and their so-called LONG 64-bit counterparts the SAS tools designed to directly read from and write to physical memory in the DATA step. APP functions have long been a SAS dark horse. First, the examples of APP usage in SAS documentation amount to a few technical report tidbits intended for mainframe system programming, with nary a hint how the functions can be used for data management programming. Second, the documentation note on the CALL POKE routine is so intimidating in tone that many potentially receptive folks might decide to avoid the allegedly precarious route altogether. However, little can stand in the way of an inquisitive SAS programmer daring to take a close look, and it turns out that APP functions are very simple and useful tools! They can be used to explore how things really work, to make code more concise, to implement en masse data movement, and they can often dramatically improve execution efficiency. The author and many other SAS experts (notably Peter Crawford, Koen Vyverman, Richard DeVenezia, Toby Dunn, and the fellow masked by his 'Puddin' Man' sobriquet) have been poking around the SAS APP realm on SAS-L and in their own practices since 1998, occasionally letting the SAS community at large to peek at their findings. This opus is an attempt to circumscribe the results in a systematic manner. Welcome to the APP world! You are in for a few glorious surprises.

'NOTE: No unequal values were found. All values compared are exactly equal.' Do your eyes automatically drop to the end of your PROC COMPARE output in search of these words? Do you then conclude that your data sets match? Be careful here! Major discrepancies might still lurk in the shadows, and you ll never know about them if you make this common mistake. This paper describes several of PROC COMPARE s blind spots and how to steer clear of them. Watch in horror as PROC COMPARE glosses over important differences while boldly proclaiming that all is well. See the gruesome truth about what PROC COMPARE does, and what it doesn t do! Learn simple techniques that allow you to peer into these blind spots and avoid getting blindsided by PROC COMPARE!

The implicit loop refers to the DATA step repetitively reading data and creating observations, one at a time. The explicit loop, which uses the iterative DO, DO WHILE, or DO UNTIL statements, is used to repetitively execute certain SAS^® statements within each iteration of the DATA step execution. Explicit loops are often used to simulate data and to perform a certain computation repetitively. However, when an explicit loop is used along with array processing, the applications are extended widely, which includes transposing data, performing computations across variables, and so on. To be able to write a successful program that uses loops and arrays, one needs to know the contents in the program data vector (PDV) during the DATA step execution, which is the fundamental concept of DATA step programming. This workshop covers the basic concepts of the PDV, which is often ignored by novice programmers, and then illustrates how to use loops and arrays to transform lengthy code into more efficient programs.

Pipeline parallelism, an extension of MP Connect, is an effective way to speed processing. Piping allows the typical programming sequence of DATA step followed by PROC to execute in parallel. Piping uses TCP ports to pass records directly from the DATA step to the PROC immediately as each individual record is processed. The DATA step in effect becomes a data transformation filter for the PROC , running in parallel and incurring no additional disk storage or related I/O lag. Establishing a pipe with MP Connect typically requires specifying a physical TCP port to be used by the writing and by the reading processes. Coding in this style opens the possibility for users to generate systems conflicts by inadvertently requesting ports that are in use. SAS^® Metadata Server allows one to allocate ports dynamically; that is, users can use a symbolic name for the port with the server dynamically determining an unused port to temporarily assign to the SAS^® job. While this capability is attractive, implementing SAS Metadata Server on a system which does not use any of the other SAS BI technology can be inefficient from a cost perspective. To enable dynamic port allocation without the added cost, we created a UNIX script which can be called from within SAS to ascertain which ports are available at runtime. The script returns a list of available ports which is captured in a SAS macro variable and subsequently used in establishing pipeline parallelism.

This paper is based on the belief that debugging your programs is not only necessary, but also a good way to gain insight into how SAS^® works. Once you understand why you got an error, a warning, or a note, you'll be better able to avoid problems in the future. In other words, people who are good debuggers are good programmers. This paper covers common problems including missing semicolons and character-to-numeric conversions, and the tricky problem of a DATA step that runs without suspicious messages but, nonetheless, produces the wrong results. For each problem, the message is deciphered, possible causes are listed, and how to fix the problem is explained.

Logistic regression is a powerful technique for predicting the outcome of a categorical response variable and is used in a wide range of disciplines. Until recently, however, this methodology was available only for data that were collected using a simple random sample. Thanks to the work of statisticians such as Binder (1983), logistic modeling has been extended to data that are collected from a complex survey design that includes strata, clusters, and weights. Through examples, this paper provides guidance on how to use PROC SURVEYLOGISTIC to apply logistic regression modeling techniques to data that are collected from a complex survey design. The examples relate to calculating odds ratios for models with interactions, scoring data sets, and producing ROC curves. As an extension of these techniques, a final example shows how to fit a Generalized Estimating Equations (GEE) logit model.

SAS^® is an outstanding suite of software, but not everyone in the workplace speaks SAS. However, almost everyone speaks Excel. Often, the data you are analyzing, the data you are creating, and the report you are producing is a form of a Microsoft Excel spreadsheet. Every year at SAS^® Global Forum, there are SAS and Excel presentations, not just because Excel isso pervasive in the workplace, but because there s always something new to learn (or re-learn)! This paper summarizes and references (and pays homage to!) previous SAS Global Forum presentations, as well as examines some of the latest Excel capabilities with the latest versions of SAS^® 9.4 and SAS^® Visual Analytics.

Explore the various DATA step merge and PROC SQL join processes. This presentation examines the similarities and differences between merges and joins, and provides examples of effective coding techniques. Attendees examine the objectives and principles behind merges and joins, one-to-one merges (joins), and match-merge (equi-join), as well as the coding constructs associated with inner and outer merges (joins) and PROC SQL set operators.

The growing adoption of electronic systems for keeping medical records provides an opportunity for health care practitioners and biomedical researchers to access traditionally unstructured data in a new and exciting way. Pathology reports, progress notes, and many other sections of the patient record that are typically written in a narrative format can now be analyzed by employing natural language processing contextual extraction techniques to identify specific concepts contained within the text. Linking these concepts to a standardized nomenclature (for example, SNOMED CT, ICD-9, ICD-10, and so on) frees analysts to explore and test hypotheses using these observational data. Using SAS^® software, we have developed a solution in order to extract data from the unstructured text found in medical pathology reports, link the extracted terms to biomedical ontologies, join the output with more structured patient data, and view the results in reports and graphical visualizations. At its foundation, this solution employs SAS^® Enterprise Content Categorization to perform entity extraction using both manually and automatically generated concept definition rules. Concept definition rules are automatically created using technology developed by SAS, and the unstructured reports are scored using the DS2/SAS^® Content Categorization API. Results are post-processed and added to tables compatible with SAS^® Visual Analytics, thus enabling users to visualize and explore data as required. We illustrate the interrelated components of this solution with examples of appropriate use cases and describe manual validation of performance and reliability with metrics such as precision and recall. We also provide examples of reports and visualizations created with SAS Visual Analytics.

Traditionally, web applications interact with back-end databases by means of JDBC/ODBC connections to retrieve and update data. With the growing need for real-time charting and complex analysis types of data representation on these web applications, SAS computing power can be put to use by adding a SAS web service layer between the application and the database. With the experience that we have with integrating these applications to SAS^® BI Web Services, this is our attempt to point out five things to do when using SAS BI Web Services. 1) Input Data Sources: always enable Allow rewinding stream while creating the stored process. 2) Use LIBNAME statements to define XML filerefs for the Input and Output Streams (Data Sources). 3) Define input prompts and output parameters as global macro variables in the stored process if the stored process calls macros that use these parameters. 4) Make sure that all of the output parameters values are set correctly as defined (data type) before the end of the stored process. 5) The Input Streams (if any) should have a consistent data type; essentially, every instance of the stream should have the same structure. This paper consist of examples and illustrations of errors and warnings associated with the previously mentioned cases.

No way. Not gonna happen. I am a real SAS^® programmer. (Spoken by a Real SAS Programmer.) SAS^® Enterprise Guide^® sometimes gets a bad rap. It was originally promoted as a code generator for non-programmers. The truth is, however, that SAS Enterprise Guide has always allowed programmers to write their own code. In addition, it offers many features that are not included in PC SAS^®. This presentation shows you the top ten features that people who like to write code care about. It will be taught by a programmer who now prefers using SAS Enterprise Guide.

This introductory presentation is intended for an audience new to mixed models who wants to get an overview of this useful class of models. Learn about mixed models as an extension of ordinary regression models, and see several examples of mixed models in social, agricultural, and pharmaceutical research.

For decades, mixed models have been used by researchers to account for random sources of variation in regression-type models. Now, they are gaining favor in business statistics for giving better predictions for naturally occurring groups of data, such as sales reps, store locations, or regions. Learn about how predictions based on a mixed model differ from predictions in ordinary regression and see examples of mixed models with business data.

Analyzing data from a complex probability survey involves weighting observations so that inferences are correct. This introductory presentation is intended for an audience new to analyzing survey data. Learn the essentials of using the SURVEYxx procedures in SAS/STAT^®.

Do you need a statistic that is not computed by any SAS^® procedure? Reach for the SAS/IML^® language! Many statistics are naturally expressed in terms of matrices and vectors. For these, you need a matrix-vector language. This hands-on workshop introduces the SAS/IML language to experienced SAS programmers. The workshop focuses on statements that create and manipulate matrices, read and write data sets, and control the program flow. You will learn how to write user-defined functions, interact with other SAS procedures, and recognize efficient programming techniques. Programs are written using the SAS/IML^® Studio development environment. This course covers Chapters 2 4 of Statistical Programming with SAS/IML Software (Wicklin, 2010).

This paper illustrates some SAS^® graphs that can be useful for variable selection in predictive modeling. Analysts are often confronted with hundreds of candidate variables available for use in predictive models, and this paper illustrates some simple SAS graphs that are easy to create and that are useful for visually evaluating candidate variables for inclusion or exclusion in predictive models. The graphs illustrated in this paper are bar charts with confidence intervals using the GCHART procedure and comparative histograms using the UNIVARIATE procedure. The graphs can be used for most combinations of categorical or continuous target variables with categorical or continuous input variables. This paper assumes the reader is familiar with the basic process of creating predictive models using multiple (linear or logistic) regression.

Speed, precision, reliability these are just three of the many challenges that today s banking institutions need to face. Join Austria s ERSTE GROUP Bank on their road from monolithic processing toward a highly flexible processing infrastructure using SAS^® Grid technology. This paper focuses on the central topics and decisions that go beyond the standard material about the product that is presented initially to SAS Grid prospects. Topics covered range from how to choose the correct hardware and critical architecture considerations to the necessary adaptions of existing code and logic all of which have shown to be a common experience for all the members of the SAS Grid community. After making the initial plans and successfully managing the initial hurdles, seeing it all come together makes you realize the endless possibilities for improving your processing landscape.

Have you ever needed additional data that was only accessible via a web service in XML or JSON? In some situations, the web service is set up to only accept parameter values that return data for a single observation. To get the data for multiple values, we need to iteratively pass the parameter values to the web service in order to build the necessary dataset. This paper shows how to combine the SAS^® hash object with the FILEVAR= option to iteratively pass a parameter value to a web service and input the resulting JSON or XML formatted data.

A group tasked with testing SAS^® software from the customer perspective has gathered a number of helpful hints for SAS^® 9.4 that will smooth the transition to its new features and products. These hints will help with the 'huh?' moments that crop up when you're getting oriented and will provide short, straightforward answers. And we can share insights about changes in your order contents. Gleaned from extensive multi-tier deployments, SAS^® Customer Experience Testing shares insiders' practical tips to ensure you are ready to begin your transition to SAS^® 9.4.

Do you create reports via a mainframe? If so, you can use SAS^® as a one-stop shop for all of your data manipulations. SAS can efficiently read data, create data sets without the need for multiple DATA steps, and produce Excel reports without the need for edits. This poster helps novice mainframe programmers by providing helpful tips to efficiently create reports using SAS in the mainframe environment. Topics covered are replacing JCL with SAS for reading data, efficient merging for efficient programming, and using PROQ FREQ for data quality and PROC TABULATE for superior reporting.

The role of the Data Scientist is the viral job description of the decade. And like LOLcats, there are many types of Data Scientists. What is this new role? Who is hiring them? What do they do? What skills are required to do their job? What does this mean for the SAS^® programmer and the statistician? Are they obsolete? And finally, if I am a SAS user, how can I become a Data Scientist? Come learn about this job of the future and what you can do to be part of it.

Recent studies suggest that unstructured data, such as customer comments or feedback, can enhance the power of existing predictive models. SAS^® Text Miner can generate singular value decomposition (SVD) units from text documents, which is a vectorial representation of terms in documents. These SVDs, when used as additional inputs along with the existing structured input variables, often prove to capture the response better. However, SVD units are sort of black box variables and are not easy to interpret or explain. This is a big hindrance to win over the decision makers in the organizations to incorporate these derived textual data components in the models. In this paper, we demonstrate a new and powerful feature in SAS^® Text Miner 12.1 that helps in explaining the SVDs or the text cluster components. We discuss two important methods that are useful to interpreting them. For this purpose, we used data from a television network company that has transcripts of its call center notes from three prior calls of each customer. We are able to extract the key terms from the call center notes in the form of Boolean rules, which have contributed to the prediction of customer churn. These rules provide an intuitive sense of which set of terms, when occurring in either the presence or absence of another set of terms in the call center notes, might lead to a churn. It also provides insights into which customers are at a bigger risk of churning from the company s services and, more importantly, why.

No matter how long you ve been programming in SAS^®, using and manipulating dates still seems to require effort. Learn all about SAS dates, the different ways they can be presented, and how to make them useful. This paper includes excellent examples for dealing with raw input dates, functions to manage dates, and outputting SAS dates into other formats. Included is all the date information you will need: date and time functions, Informats, formats, and arithmetic operations.

Understanding the actual gambling behavior of an individual over the Internet, we develop markers which identify behavioral patterns, which in turn can be used to predict the level of risk a subscriber is prone to gambling. The data set contains 4,056 subscribers. Using SAS^® Enterprise Miner^™ 12.1, a set of models are run to predict which subscriber is likely to become a high-risk internet gambler. The data contains 114 variables such as first active date and first active product used on the website as well as the characteristics of the game such as fixed odds, poker, casino, games, etc. Other measures of a subscriber s data such as money put at stake and what odds are being bet are also included. These variables provide a comprehensive view of a subscriber s behavior while gambling over the website. The target variable is modeled as a binary variable, 0 indicating a risky gambler and 1 indicating a controlled gambler. The data is a typical example of real-world data with many missing values and hence had to be transformed, imputed, and then later considered for analysis. The model comparison algorithm of SAS Enterprise Miner 12.1 was used to determine the best model. The stepwise Regression performs the best among a set of 25 models which were run using over a 100 permutations of each model. The Stepwise Regression model predicts a high-risk Internet gambler at an accuracy of 69.63% with variables such as wk4frequency and wk3frequency of bets.

The NLIN procedure fits a wide variety of nonlinear models. However, some models can be so nonlinear that standard statistical methods of inference are not trustworthy. That s when you need the diagnostic and inferential features that were added to PROC NLIN in SAS/STAT^® 9.3, 12.1, and 13.1. This paper presents these features and explains how to use them. Examples demonstrate how to use parameter profiling and confidence curves to identify the nonlinearcharacteristics of the model parameters. They also demonstrate how to use the bootstrap method to study the sampling distribution of parameter estimates and to make more accurate statistical inferences. This paper highlights how measures of nonlinearity help you diagnose models and decide on potential reparameterization. It also highlights how multithreading is used to tame the large number of nonlinear optimizations that are required for these features.

Traditional SAS^® programs typically consist of a series of SAS DATA steps, which refine input data sets until the final data set or report is reached. SAS DATA steps do not run in-database. However, SAS^® Enterprise Guide^® users can replicate this kind of iterative programming and have the resulting process flow run in-database by linking a series of SAS Enterprise Guide Query Builder tasks that output SAS views pointing at data that resides in a Teradata database, right up to the last Query Builder task, which generates the final data set or report. This session both explains and demonstrates this functionality.

Formats are an often under-valued tool in the SAS^® toolbox. They can be used in just about all domains to improve the readability of a report, or they can be used as a look-up table to recode your data. Out of the box, SAS includes a multitude of ready-defined formats that can be applied without modification to address most recode and redisplay requirements. And if that s not enough, there is also a FORMAT procedure for defining your own custom formats. This paper looks at using some of the formats supplied by SAS in some innovative ways, but primarily focuses on the techniques we can apply in creating our own custom formats.

Report automation and scheduling are very hot topics in many industries. They confer many advantages including reduced work load, elimination of repetitive tasks, generatation of accurate results, and better performance. This paper illustrates how to design an appropriate program to automate and schedule reports in SAS^® 9.1 and SAS^® Enterprise Guide^® 5.1 using a SAS^® server as well as the Windows Scheduler. The automation part includes good aspects of formatting Microsoft Excel tables using XML or VBA coding or any other formats, and conditional auto e-mailing with file attachments. We systematically walk through each step with a clear flow diagram from the data source to the final destination. We also discuss details of server-side and PC-side schedulers and how these schedulers involve invoking batch programs.

Despite its popularity in recent years, .NET development has yet to enjoy the quality, level, and depth of statistical support that has always been provided by SAS^®. And yet, many .NET applications could benefit greatly from the power of SAS and, likewise, some SAS applications could benefit from friendly graphical user interfaces (GUIs) supported by Microsoft s .NET Framework. What the author sets out to do here is to 1) outline the basic mechanics of automating SAS with .NET, 2) provide a framework and specific strategies for maintaining parallelism between the two platforms at runtime, and 3) sketch out put some simple applications that provide an exciting combination of powerful SAS analytics and highly accessible GUIs. The mechanics of automating SAS with .NET will be covered briefly. Attendees will learn the required objects and methods needed to pass information between the two platforms. The attendees will learn some strategies for organizing their projects and for writing SAS code that lends itself to automation. This will include embedding SAS scripts within a .NET project and managing communications between the two platforms. Specifically, the log and listing output will be captured and handled by .NET, and user actions will be interpreted and sent to the SAS engine. Example applications used throughout the session include a tool that converts between SAS variable types through simple drag-and-drop and an application that analyzes the growth of the user s computer hard drive.

Ensemble models combine two or more models to enable a more robust prediction, classification, or variable selection. This paper describes three types of ensemble models: boosting, bagging, and model averaging. It discusses go-to methods, such as gradient boosting and random forest, and newer methods, such as rotational forest and fuzzy clustering. The examples section presents a quick setup that enables you to take fullest advantage of the ensemble capabilities of SAS^® Enterprise Miner^™ by using existing nodes, Start Groups and End Groups nodes, and custom coding.

Effective graphs are indispensable for modern statistical analysis. They reveal tendencies that are not readily apparent in simple tables and add visual clarity to reports. My client is a big graph fan; he always shows me a lot of high-quality and complex sample graphs that were created by other software and asks me Can SAS^® duplicate these outputs? Often, by leveraging the capabilities of the ODS Graph Template Language and the SGRENDER procedure, the answer is Yes . Graph Template Language offers SAS users a more direct approach to customize the output and to overlay graphs in different levels. This paper uses cases drawn from a real work situation to demonstrate how to get the seemingly unattainable results with the power of Graph Template Language: utilizing bubble plots as your distribution density bars creating refreshing looking linear regression graphics with the slop information in the legend overlaying different plots together to create sophisticated analytical bottleneck test output

If you have been programming SAS^® for years, you have probably made Display Manager your own: customized window layout, program text colors, bookmarks, and abbreviations/keyboard macros. Now you are using SAS^® Enterprise Guide^®. Did you know you can have almost all the same modifications you had in Base SAS^® in SAS Enterprise Guide, plus more?

How do you compare group responses when the data are unbalanced or when covariates come into play? Simple averages will not do, but LS-means are just the ticket. Central to postfitting analysis in SAS/STAT^® linear modeling procedures, LS-means generalize the simple average for unbalanced data and complicated models. They play a key role both in standard treatment comparisons and Type III tests and in newer techniques such as sliced interaction effects and diffograms. This paper reviews the definition of LS-means, focusing on their interpretation as predicted population marginal means, and it illustrates their broad range of use with numerous examples.

When we start programming, we simply hope that the log comes out with no errors or warnings. Yet once we have programmed for a while, especially in the area of pharmaceutical research, we realize that having a log with specific, useful information in it improves quality and accountability. We discuss clearing the log, sending the log to an output file, helpful information to put in the log, which messages are permissible, automated log checking, adding messages regarding data changes, whether or not we want to see source code, and a few other log-related ideas. Hopefully, the log will become something that we keep in mind from the moment we start programming.

The capabilities of SAS^® have been extended by the use of macros and custom formats. SAS macro code libraries and custom format libraries can be stored in various locations, some of which may or may not always be easily and efficiently accessed from other operating environments. Code can be in various states of development ranging from global organization-wide approved libraries to very elementary just-getting-started code. Formalized yet flexible file structures for storing code are needed. SAS user environments range from standalone systems such as PC SAS or SAS on a server/mainframe to much more complex installations using multiple platforms. Strictest attention must be paid to (1) file location for macros and formats and (2) management of the lack of cross-platform portability of formats. Macros are relatively easy to run from their native locations. This paper covers methods of doing this with emphasis on: (a) the option sasautos to define the location and the search order for identifying macros being called, and (b) even more importantly the little-known SAS option MAUTOLOCDISPLAY to identify the location of the macro actually called in the saslog. Format libraries are more difficult to manage and cannot be created and run in a different operating system than that in which they were created. This paper will discuss the export, copying and importing of format libraries to provide cross-platform capability. A SAS macro used to identify the source of a format being used will be presented.

One of the most common questions about logistic regression is How do I know if my model fits the data? There are many approaches to answering this question, but they generally fall into two categories: measures of predictive power (like R-squared) and goodness of fit tests (like the Pearson chi-square). This presentation looks first at R-squared measures, arguing that the optional R-squares reported by PROC LOGISTIC might not be optimal. Measures proposed by McFadden and Tjur appear to be more attractive. As for goodness of fit, the popular Hosmer and Lemeshow test is shown to have some serious problems. Several alternatives are considered.

In applied statistical practice, incomplete measurement sequences are the rule rather than the exception. Fortunately, in a large variety of settings, the stochastic mechanism governing the incompleteness can be ignored without hampering inferences about the measurement process. While ignorability only requires the relatively general missing at random assumption for likelihood and Bayesian inferences, this result cannot be invoked when non-likelihood methods are used. We will first sketch the framework used for contemporary missing-data analysis. Apart from revisiting some of the simpler but problematic methods, attention will be paid to direct likelihood and multiple imputation. Because popular non-likelihood-based methods do not enjoy the ignorability property in the same circumstances as likelihood and Bayesian inferences, weighted versions have been proposed. This holds true in particular for generalized estimating equations (GEE). Even so-called doubly-robust versions have been derived. Apart from GEE, also pseudo-likelihood based strategies can be adapted appropriately. We describe a suite of corrections to the standard form of pseudo-likelihood, to ensure its validity under missingness at random. Our corrections follow both single and double robustness ideas, and is relatively simple to apply.

Mobile devices are taking over conventional ways of sharing and presenting information in today s businesses and working environments. Accessibility to this information is a key factor for companies and institutions in order to reach wider audiences more efficiently. SAS^® software provides a powerful set of tools that allows developers to fulfill the increasing demand in mobile reporting without needing to upgrade to the latest version of the platform. Here at University of Central Florida (UCF), we were able to create reports targeting our iPad consumers at our executive level by using the SAS^® 9.2 Enterprise Business Intelligence environment, specifically SAS^® Web Report Studio 4.3. These reports provide them with the relevant data for their decision-making process. At UCF, the goal is to provide executive consumers with reports that fit on one screen in order to avoid the need of scrolling and that are easily exportable to PDF. This is done in order to respond to their demand to be able to accomodate their increasing use of portable technology to share sensitive data in a timely manner. The technical challenge is to provide specific data to those executive users requesting access through their iPad devices. Compatibility issues arise but are successfully bypassed. We are able to provide reports that fit on one screen and that can be opened as a PDF if needed. These enhanced capabilities were requested and well received by our users. This paper presents techniques we use in order to create mobile reports.

While survey researchers make great attempts to standardize their questionnaires including the usage of ratings scales in order to collect unbiased data, respondents are still prone to introducing their own interpretation and bias to their responses. This bias can potentially affect the understanding of commonly investigated drivers of customer satisfaction and limit the quality of the recommendations made to management. One such problem is scale use heterogeneity, in which respondents do not employ a panoramic view of the entire scale range as provided, but instead focus on parts of the scale in giving their responses. Studies have found that bias arising from this phenomenon was especially prevalent in multinational research, e.g., respondents of some cultures being inclined to use only the neutral points of the scale. Moreover, personal variability in response tendencies further complicates the issue for researchers. This paper describes an implementation that uses a Bayesian hierarchical model to capture the distribution of heterogeneity while incorporating the information present in the data. More specifically, SAS^® PROC MCMC is used to carry out a comprehensive modeling strategy of ratings data that account for individual level scale usage. Key takeaways include an assessment of differences between key driver analyses that ignore this phenomenon versus the one that results from our implementation. Managerial implications are also emphasized in light of the prevalent use of more simplistic approaches.

For over three decades, SAS^® has provided capabilities for beating your data into submission. In June of 2000, SAS acquired a company called DataFlux in order to add data quality capabilities to its portfolio. Recently, SAS folded DataFlux into the mother ship. With SAS^® 9.4, SAS^® Enterprise Data Integration Server and baby brother SAS^® Data Integration Server were upgraded into a series of new bundles that still include the former DataFlux products, but those products have grown. These new bundles include data management, data governance, data quality, and master data management, and come in advanced and standard packaging. This paper explores these offerings and helps you understand what this means to both new and existing customers of SAS^® Data Management and DataFlux products. We break down the marketing jargon and give you real-world scenarios of what customers are using today (prior to SAS 9.4) and walk you through what that might look like in the SAS 9.4 world. Each scenario includes the software that is required, descriptions of what each of the components do (features and functions), as well as the likely architectures that you might want to consider. Finally, for existing SAS Enterprise Data Integration Server and SAS^® Data Integration Server customers, we discuss implications for migrating to SAS Data Management and detail some of the functionality that may be new to your organization.

This paper considers the %MRE macro for estimating multivariate ratio estimates. Also, we use PROC REG to estimate multivariate regression estimates and to show that regression estimates are superior to the ratio estimates.

Two examples of Vector Autoregressive Moving Average modeling with exogenous variables are given in this presentation. Data is from the real world. One example is about a two-dimensional time series for wages and prices in Denmark that spans more than a hundred years. The other is about the market for agricultural products, especially eggs! These examples give a general overview of the many possibilities offered by PROC VARMAX, such as handling of seasonality, causality testing and Bayesian modeling, and so on.

In business environments, a common obstacle to effective data-informed decision making occurs when key stakeholders are reluctant to embrace statistically derived predicted values or forecasts. If concerns regarding model inputs, underlying assumptions, and limitations are not addressed, decision makers might choose to trust their gut and reject the insight offered by a statistical model. This presentation explores methods for converting potential critics into partners by proactively involving them in the modeling process and by incorporating simple inputs derived from expert judgment, focus groups, market research, or other directional qualitative sources. Techniques include biasing historical data, what-if scenario testing, and Monte Carlo simulations.

It is often the case that parameters in a predictive model should be restricted to an interval that is either reasonable or necessary given the model s application. A simple and classic example of such a restriction is the regression model which requires that all parameters to be positive. In the case of multiple least squares (MLS) regression, the resulting model is therefore strictly additive and, in certain applications, not only appropriate but also intuitive. This special case of an MLS model is commonly referred to as a nonnegative least squares regression. While Base SAS^® contains a multitude of ways to perform a multiple least squares regression (PROC REG and PROC GLM, to name two), there exists no native SAS^® procedure to conduct a nonnegative least squares regression. The author offers a concise way to conduct the nonnegative least squares analysis by using PRON NLIN (proc non-linear ). PROC NLIN offers user restriction on parameter estimates. By fashioning a linear model in the framework of a nonlinear procedure, the end result can be achieved. As an additional corollary, the author will show how to calculate the _RSQUARE_ statistic for the resulting model, which has been left out of the PROC NLIN output for the reason that it is invalid in most cases (though not ours).

When viewing and working with SAS^® data sets especially wide ones it s often instinctive to rearrange the variables (columns) into some intuitive order. The RETAIN statement is one of the most commonly cited methods used for ordering variables. Though RETAIN can perform this task, its use as an ordering clause can cause a host of easily missed problems due to its intended function of retaining values across DATA step iterations. This risk is especially great for the more novice SAS programmer. Instead, two equally effective and less risky ways to order data set variables are recommended, namely, the FORMAT and SQL SELECT statements.

PROC TABULATE is a powerful tool for creating tabular summary reports. Its advantages, over PROC REPORT, are that it requires less code, allows for more convenient table construction, and uses syntax that makes it easier to modify a table s structure. However, its inability to compute the sum, difference, product, and ratio of column sums has hindered its use in many circumstances. This paper illustrates and discusses some creative approaches and methods for overcoming these limitations, enabling users to produce needed reports and still enjoy the simplicity and convenience of PROC TABULATE. These methods and skills can have prominent applications in a variety of business intelligence and analytics fields.

A time-consuming part of statistical analysis is building an analytic data set for statistical procedures. Whether it is recoding input values, transforming variables, or combining data from multiple data sources, the work to create an analytic data set can take time. The DS2 programming language in SAS^® 9.4 simplifies and speeds data preparation with user-defined methods, storing methods and attributes in shareable packages, and threaded execution on multi-core SMP and MPP machines. Come see how DS2 makes your job easier.

The effectiveness of visual interpretation of the differences between pairs of LS-means in a generalized linear model includes the graph's ability to display four inferential and two perceptual tasks. Among the types of graphs which display some or all of these tasks are the forest plot, the mean-mean scatter plot (diffogram), and closely related to it, the mean-mean multiple comparison (MMC) plot. These graphs provide essential visual perspectives for interpretation of the differences among pairs of LS-means from a generalized linear model (GLM). The diffogram is a graphical option now available through ODS statistical graphics with linear model procedures such as GLIMMIX. Through combining ODS output files of the LS-means and their differences, the SGPLOT procedure can efficiently produce forest and MMC plots.

The SQL procedure contains many powerful and elegant language features for intermediate and advanced SQL users. This presentation discusses topics that will help SAS^® users unlock the many powerful features, options, and other gems found in the SQL universe. Topics include CASE logic; a sampling of summary (statistical) functions; dictionary tables; PROC SQL and the SAS macro language interface; joins and join algorithms; PROC SQL statement options _METHOD, MAGIC=101, MAGIC=102, and MAGIC=103; and key performance (optimization) issues.

Over the years, there has been a growing concern about consumption of tobacco among youth. But no concrete studies have been done to find what exactly leads the children to start consuming tobacco. This study is an attempt to figure out the potential reasons for the same. Through our analysis, we have also tried to build A model to predict whether a child would smoke next year or not. This study is based on the 2011 National Youth Tobacco Survey data of 18,867 observations. In order to prepare data for insightful analysis, imputation operations were performed on the data using tree-based imputation methods. From a pool of 197 variables, 48 key variables were selected using variable selection methods, partial least squares, and decision tree models. Logistic Regression and Decision Tree models were built to predict whether a child would smoke in the next year or not. Comparing the models using Misclassification rate as the selection criteria, we found that the Stepwise Logistic Regression Model outperformed other models with a Validation Misclassification of 0.028497, 47.19% Sensitivity and 95.80% Specificity. Factors such as company of friends, cigarette brand ads, accessibility to the tobacco products, and passive smoking turned out to be the most important predictors in determining a child smoker. After this study, we could outline some important findings like the odds of a child taking up smoking are 2.17 times high when his close friends are also smoking.

An increase in sea levels is a potential problem that is affecting the human race and marine ecosystem. Many models are being developed to find out the factors that are responsible for it. In this research, the Memory-Based Reasoning model looks more effective than most other models. This is because this model takes the previous solutions and predicts the solutions for forthcoming cases. The data was collected from NASA. The data contains 1,072 observations and 10 variables such as emissions of carbon dioxide, temperature, and other contributing factors like electric power consumption, total number of industries established, and so on. Results of Memory-Based Reasoning models like RD tree, scan tree, neural networks, decision tree, and logistic regression are compared. Fit statistics, such as misclassification rate and average squared error are used to evaluate the model performance. This analysis is used to predict the rise in sea levels in the near future and to take the necessary actions to protect the environment from global warming and natural disasters.

In this session, you learn how Kaiser Permanente has taken a centralized production support approach to using SAS^® Enterprise Guide^® 4.3 in the healthcare industry. Kaiser Permanente Northwest (KPNW) has designed standardized processes and procedures that have allowed KPNW to streamline the support of production content, which enabled KPNW analytical resources to focus more on new content development rather than on maintenance and support of steady state programs and processes. We started with over 200 individual SAS^® processes across four different SAS platforms, SAS Enterprise Guide, Mainframe SAS^®, PC SAS^® and SAS^® Data Integration Studio, in oder to standardize our development approach on SAS Enterprise Guide and build efficient and scalable processes within our department and across the region. We walk through the need for change, how the team was set up, provide an overview of the UNIX SAS platform, walk through the standard production requirements (developer pack), and review lessons learned.

Do you find it difficult to dress up your graphs for your reports or presentations? SAS^® 9.4 introduced new capabilities in ODS Graphics that give you the ability to style your graphs without creating or modifying ODS styles. Some of the new capabilities include the following: a new option for controling how ODS styles are applied graph syntax for overriding ODS style attributes for grouped plots the ability to define font glyphs and images as plot markers enhanced attribute map support In this presentation, we discuss these new features in detail, showing examples in the context of Graph Template Language and ODS Graphics procedures.

Can you juggle? Maybe. Can you shuffle a deck of cards? Probably. Can you do both at the same time? Welcome to the world of SAS^® and LSF! Very few SAS Administrators start out learning LSF at the same time they learn SAS; most already know SAS, possibly starting out as a programmer or analyst, but now have to step up to an enterprise platform with shared resources. The biggest challenge on an enterprise platform? How to share! How to maximum the utilization of a SAS platform, yet still ensure everyone gets their fair share? This presentation will boil down the 2000+ pages of LSF documentation to provide an introduction into various LSF concepts: * Host * Clusters * Nodes * Queues * First-Come-First-Serve * Fairshare * and various configuration settings: UJOB_LIMIT, PJOB_LIMIT, etc. Plus some insight on where to configure all these settings which are set up by the installation process, and which can be configured by the SAS or LSF administrator. This session is definitely NOT for experts. It is for those about to step into an enterprise deployment of SAS, and want to understand how the SAS server sessions they know so well can run on a shared platform.

Are you time-poor and code-heavy? It's easy to get into a rut with your SAS^® code, and it can be time-consuming to spend your time learning and implementing improved techniques. This presentation is designed to share quick improvements that take five minutes to learn and about the same time to implement. The quick hits are applicable across versions of SAS and require only Base SAS^® knowledge. Included topics are: simple macro tricks little-known functions that get rid of messy coding dynamic conditional logic data summarization tips to reduce data and processing testing and space utilization tips. This presentation has proven valuable to beginner through experienced SAS users.

This presentation takes a look at DirectPay, a company that collects and buys consumer claims of all types. It developed a model with SAS^® Enterprise Miner^™ to determine the risk of fraud by a debtor and a debtor's creditworthiness. This model is focused on the added value of more and better data. Since 2010, all credit and fraud scores have been calculated using DirectPay's own data and models. In addition, the presentation explores the use of SAS^® Visual Analytics as both a management information and an analytical tool since early 2013.

The project focuses on using analytics to reveal unwarranted use of access to medical records, i.e. employees in health organizations that access information about neighbours, friends, celebrities, etc., without a sound reason to do so. The method is based on the natural assumption that the vast majority of lookups are legitimate lookups that differ from a statistically defined normal behavior will be subject to manual investigation. The work was carried out in collaboration between SAS Institute Norway and the largest Norwegian hospital, Oslo University Hospital (OUS) and was aimed at establishing whether the method is suitable for unveiling unwarranted lookups in medical records. A number of so called scenarios are used to indicate adverse behaviour, each responsible for looking at one particular aspect of journal access data. For instance, one scenario determines the timeliness of a lookup relative to the patient's admission history; another judges whether the medical competency of the employee is relevant to the situation of the patient at the time of the lookup. We have so far designed and developed a library of around 20 scenarios that together are used in weighted combination to render a final judgment of the appropriateness of the lookup. The approach has been proven highly successful, and a further development of these ideas is currently being done, the aim of which is to establish a joint Norwegian solution to the problem of unwarranted access. Furthermore, we believe that the approach and the framework may be utilised in many other industries where sensitive data is being processed, such as financial, police, tax and social services. In this paper, the method is outlined, as well as results of its application on data from OUS.

Have you been programming in SAS^® for a while and just aren t sure how SAS^® Enterprise Guide^® can help you? This presentation demonstrates how SAS programmers can use SAS Enterprise Guide 5.1 as their primary interface to SAS, while maintaining the flexibility of writing their own customized code. We explore: navigating and customizing the SAS Enterprise Guide environment using SAS Enterprise Guide to access existing programs and enhance processing exploiting the enhanced development environment including syntax completion and built-in function help using SAS^® Code Analyzer, Report Builder, and Document Builder adding Project Parameters to generalize the usability of programs and processes leveraging built-in capabilities available in SAS Enterprise Guide to further enhance the information you deliver Our audience is SAS users who understand the basics of SAS programming and want to learn how to use SAS Enterprise Guide. This paper is also appropriate for users of earlier versions of SAS Enterprise Guide who want to try the enhanced features available in SAS Enterprise Guide 5.1.

This workshop provides hands-on experience using tools in the SAS^® Data Management offering. Workshop participants will use the following products: SAS^® Data Integration Studio DataFlux^® Data Management Studio SAS^® Data Management Console

This workshop provides hands-on experience using SAS^® Enterprise Miner^™ high-performance nodes. Workshop participants will do the following: learn the similarities and differences between high-performance nodes and standard nodes build a project flow using high-performance nodes extract and save a score code for model deployment

This workshop provides hands-on experience with SAS^® Visual Analytics. Workshop participants will do the following: explore data with SAS^® Visual Analytics Explorer design reports with SAS^® Visual Analytics Designer

Traditionally, Java web applications interact with back-end databases by means of JDBC/ODBC connections to retrieve and update data. With the growing need for real-time charting and complex analysis types of data representation on these types of web applications, SAS^® computing power can be put to use by adding a SAS web service layer between the application and the database. This paper shows how a SAS web service layer can be used to render data to a JAVA application in a summarized form using SAS^® Stored Processes. This paper also demonstrates how inputs can be passed to a SAS Stored Process based on which computations/summarizations are made before output parameter and/or output data streams are returned to the Java application. SAS Stored Processes are then deployed as SAS^® BI Web Services using SAS^® Management Console, which are available to the JAVA application as a URL. We use the SOAP method to interact with the web services. XML data representation is used as a communication medium. We then illustrate how RESTful web services can be used with JSON objects being the communication medium between the JAVA application and SAS in SAS^® 9.3. Once this pipeline communication between the application, SAS engine, and database is set up, any complex manipulation or analysis as supported by SAS can be incorporated into the SAS Stored Process. We then illustrate how graphs and charts can be passed as outputs to the application.

Statistical mediation analysis is common in business, social sciences, epidemiology, and related fields because it explains how and why two variables are related. For example, mediation analysis is used to investigate how product presentation affects liking the product, which then affects the purchase of the product. Mediation analysis evaluates the mechanism by which a health intervention changes norms that then change health behavior. Research on mediation analysis methods is an active area of research. Some recent research in statistical mediation analysis focuses on extracting accurate information from small samples by using Bayesian methods. The Bayesian framework offers an intuitive solution to mediation analysis with small samples; namely, incorporating prior information into the analysis when there is existing knowledge about the expected magnitude of mediation effects. Using diffuse prior distributions with no prior knowledge allows researchers to reason in terms of probability rather than in terms of (or in addition to) statistical power. Using SAS^® PROC MCMC, researchers can choose one of two simple and effective methods to incorporate their prior knowledge into the statistical analysis, and can obtain the posterior probabilities for quantities of interest such as the mediated effect. This project presents four examples of using PROC MCMC to analyze a single mediator model with real data using: (1) diffuse prior information for each regression coefficient in the model, (2) informative prior distributions for each regression coefficient, (3) diffuse prior distribution for the covariance matrix of variables in the model, and (4) informative prior distribution for the covariance matrix.

The scatter plot is a basic tool for examining the relationship between two variables. While the basic plot is good, enhancements can make it better. In addition, there might be problems of overplotting. In this paper, I cover ways to create basic and enhanced scatter plots and to deal with overplotting.

Linear regression has been a widely used approach in social and medical sciences to model the association between a continuous outcome and the explanatory variables. Assessing the model assumptions, such as linearity, normality, and equal variance, is a critical step for choosing the best regression model. If any of the assumptions are violated, one can apply different strategies to improve the regression model, such as performing transformation of the variables or using a spline model. SAS^® has been commonly used to assess and validate the postulated model and SAS^® 9.3 provides many new features that increase the efficiency and flexibility in developing and analyzing the regression model, such as ODS Statistical Graphics. This paper aims to demonstrate necessary steps to find the best linear regression model in SAS 9.3 in different scenarios where variable transformation and the implementation of a spline model are both applicable. A simulated data set is used to demonstrate the model developing steps. Moreover, the critical parameters to consider when evaluating the model performance are also discussed to achieve accuracy and efficiency.

The Purchasing Department is considering contracting with your team for a new SAS^® Enterprise BI application. He's already met with SAS^® and seen the sales pitch, and he is very interested. But the manager is a tightwad and not sure about spending the money. Also, he wants his team to be the primary developers for this new application. Before investing his money on training, programming, and support, he would like a proof-of-concept. This paper will walk you through the seven steps to create a SAS Enterprise BI POC project: Develop a kick-off meeting including a full demo of the SAS Enterprise BI tools. Set up your UNIX file systems and security. Set up your SAS metadata ACTs, users, groups, folders, and libraries. Make sure the necessary SAS client tools are installed on the developers machines. Hold a SAS Enterprise BI workshop to introduce them to the basics, including SAS^® Enterprise Guide^®, SAS^® Stored Processes, SAS^® Information Maps, SAS^® Web Report Studio, SAS^® Information Delivery Portal, and SAS^® Add-In for Microsoft Office, along with supporting documentation. Work with them to develop a simple project, one that highlights the benefits of SAS Enterprise BI and shows several methods for achieving the desired results. Last but not least, follow up! Remember, your goal is not to launch a full-blown application. Instead, we ll strive toward helping them see the potential in your organization for applying this methodology.

Big data is all the rage these days, with the proliferation of data-accumulating electronic gadgets and instrumentation. At the heart of big data analytics is the MapReduce programming model. As a framework for distributed computing, MapReduce uses a divide-and-conquer approach to allow large-scale parallel processing of massive data. As the name suggests, the model consists of a Map function, which first splits data into key-value pairs, and a Reduce function, which then carries out the final processing of the mapper outputs. It is not hard to see how these functions can be simulated with the SAS^® hash objects technique, and in reality, implemented in the new SAS^® DS2 language. This paper demonstrates how hash object programming can handle data in a MapReduce fashion and shows some potential applications in physics, chemistry, biology, and finance.

Distributing SAS^® software to a large number of machines can be challenging at best and exhausting at worst. Common areas of concern for installers are silent automation, network traffic, ease of setup, standardized configurations, maintainability, and simply the sheer amount of time it takes to make the software available to end users. We describe a variety of techniques for easing the pain of provisioning SAS software, including the new standalone SAS^® Enterprise Guide^® and SAS^® Add-in for Microsoft Office installers, as well as the tried and true SAS^® Deployment Wizard record and playback functionality. We also cover ways to shrink SAS Software Depots, like the new 'subsetting recipe' feature, in order to ease scenarios requiring depot redistribution. Finally, we touch on alternate methods for workstation access to SAS client software, including application streaming, desktop virtualization, and Java Web Start.

All the documentation about the creation of graphs with SAS^® software states that ODS Graphics is not intended to replace SAS/GRAPH^®. However, ODS Graphics is included in the Base SAS^® license from SAS^® 9.3, but SAS/GRAPH still requires an additional component license, so there is definitely a financial incentive to convert to ODS Graphics. This paper gives examples that can be used to replace commonly created SAS/GRAPH plots, and highlights the small number of plots that are still very difficult, or impossible, to create in ODS Graphics.

Have you ever needed to use dates as values to loop through a table? For example, how many events occurred by 1, 2 , 3 & n months ahead? Maybe you just changed the dates manually and re-ran the query n times? This is a common need in economic and behavioral sciences. This presentation demonstrates how to create a table of dates that can be used with SAS^® macro variables to loop through a table. Using this dates table in combination with the SAS DO loop ensures accuracy and saves time.

Westat utilizes SAS^® software as a core capability for providing clients in government and private industry with analysis and characterization of survey data. Staff programmers, analysts, and statisticians use SAS to manage, store, and analyze client data, as well as to produce tabulations, reports, graphs, and summary statistics. Because SAS is so widely used at Westat, the organization has built a comprehensive infrastructure to support its deployment and use. This paper provides an overview of Westat s SAS support infrastructure, which supplies resources that are aimed at educating staff, strengthening their SAS skills, providing SAS technical support, and keeping the staff on the cutting edge of SAS programming techniques.

SAS^® platform installations are large, complex, growing, and ever-changing enterprise systems that support many diverse groups of users and content. A reliable metadata security implementation is critical for providing access to business resources in a methodical, organized, partitioned, and protected manner. With natural changes to users, groups, and folders from an organization s day-to-day activities, deviations from an original metadata security plan are very likely and can put protected resources at risk. Regular security testing can ensure compliance, but, given existing administrator commitments and the time consuming nature of manual testing procedures, it doesn't tend to happen. This paper discusses concepts and outlines several example test specifications from an automated metadata security testing framework being developed by Metacoda. With regularly scheduled, automated testing, using a well-defined set of test rules, administrators can focus on their other work, and let alerts notify them of any deviations from a metadata security test specification.

With smartphone and mobile apps market developing so rapidly, the expectations about effectiveness of mobile applications is high. Marketers and app developers need to analyze huge data available much before the app release, not only to better market the app, but also to avoid costly mistakes. The purpose of this poster is to build models to predict the success rate of an app to be released in a particular category. Data has been collected for 540 android apps under the Top free newly released apps category from https://play.google.com/store . The SAS^® Enterprise Miner^™ Text Mining node and SAS^® Sentiment Analysis Studio are used to parse and tokenize the collected customer reviews and also to calculate the average customer sentiment score for each app. Linear regression, neural, and auto-neural network models have been built to predict the rank of an app by considering average rating, number of installations, total number of reviews, number of 1-5 star ratings, app size, category, content rating, and average customer sentiment score as independent variables. A linear regression model with least Average Squared Error is selected as the best model, and number of installations, app maturity content are considered as significant model variables. App category, user reviews, and average customer sentiment score are also considered as important variables in deciding the success of an app. The poster summarizes the app success trends across various factors and also introduces a new SAS^® macro %getappdata, which we have developed for web crawling and text parsing.

'Can I have that in Excel?' This is a request that makes many of us shudder. Now your boss has discovered Microsoft Excel pivot tables. Unfortunately, he has not discovered how to make them. So you get to extract the data, massage the data, put the data into Excel, and then spend hours rebuilding pivot tables every time the corporate data is refreshed. In this workshop, you learn to be the armchair quarterback and build pivot tables without leaving the comfort of your SAS^® environment. In this workshop, you learn the basics of Excel pivot tables and, through a series of exercises, you learn how to augment basic pivot tables first in Excel, and then using SAS. No prior knowledge of Excel pivot tables is required.

The SAS^® Enterprise Guide^® Query Builder is one of the most powerful components of the software. It enables a user to bring in data, join, drop and add columns, compute new columns, sort, filter data, leverage the advanced expression builder, change column attributes, and more! This presentation provides an overview of the major features of this powerful tool and how to leverage it every day.

This is the way I have always done it and it works fine for me. Have you heard yourself or others say this when someone suggests a new technique to help solve a problem? Most of us have a set of tricks and techniques from which we draw when starting a new project. Over time we might overlook newer techniques because our old toolkit works just fine. Sometimes we actively avoid new techniques because our initial foray leaves us daunted by the steep learning curve to mastery. For me, the PRX functions and the SAS^® hash object fell into this category. In this workshop, we address possible objections to learning to use the SAS hash object. We start with the fundamentals of setting up the hash object and work through a variety of practical examples to help you master this powerful technique.

Applying models to analyze sports data has always been done by teams across the globe. The film Moneyball has generated much hype about how a sports team can use data and statistics to build a winning team. The objective of this poster is to use the model comparison algorithm of SAS^® Enterprise Miner^™ to pick the best model that can predict the outcome of a soccer game. It is hence important to determine which factors influence the results of a game. The data set used contains input variables about a team s offensive and defensive abilities and the outcome of a game is modeled as a target variable. Using SAS Enterprise Miner, multinomial regression, neural networks, decision trees, ensemble models and gradient boosting models are built. Over 100 different versions of these models are run. The data contains statistics from the 2012-13 English premier league season. The competition has 20 teams playing each other in a home and away format. The season has a total of 380 games; the first 283 games are used to predict the outcome of the last 97 games. The target variable is treated as both nominal variable and ordinal variable with 3 levels for home win, away win, and tie. The gradient boosting model is the winning model which seems to predict games with 65% accuracy and identifies factors such as goals scored and ball possession as more important compared to fouls committed or red cards received.

In the traveling salesman problem, a salesman must minimize travel distance while visiting each of a given set of cities exactly once. This paper uses the SAS/OR^® OPTMODEL procedure to formulate and solve the traveling baseball fan problem, which complicates the traveling salesman problem by incorporating scheduling constraints: a baseball fan must visit each of the 30 Major League ballparks exactly once, and each visit must include watching a scheduled Major League game. The objective is to minimize the time between the start of the first attended game and the end of the last attended game. One natural integer programming formulation involves a binary decision variable for each scheduled game, indicating whether the fan attends. But a reformulation as a side-constrained network flow problem yields much better solver performance.

This new SAS^® tool is a two-dimensional color chart for visualizing changes in a population or in a system over time. Data for one point in time appear as a thin horizontal band of color. Bands for successive periods are stacked up to make a two-dimensional plot, with the vertical direction showing changes over time. As a system evolves over time, different kinds of events have different characteristic patterns. Creation of Time Contour plots is explained step-by-step. Examples are given in astrostatistics, biostatistics, econometrics, and demographics.

As a longtime Base SAS^® programmer, whether to use a different application for programming is a constant question when powerful applications such as SAS^® Enterprise Guide^® are available. This paper provides some important tips for a programmer, such as the best way to use the code window and how to take advantage of system-generated code in SAS Enterprise Guide 5.1. This paper also explains the differences between some of the functions and procedures in Base SAS and SAS Enterprise Guide. It highlights features in SAS Enterprise Guide such as process flow, data access management, and report automation, including formatting using XML tag sets.

This paper gives you a better idea of how and where to use the record lookup functions to locate observations where a variable has some characteristic. Various related functions are illustrated to search numeric and character values in this process. Code is shown with time comparisons. I will discuss three possible ways to retrieve records using the SAS^® DATA step, PROC SQL, and Perl regular expressions. Real and CPU time processing issues will be highlighted when comparing to retrieve records using these methods. Although the program is written for the PC using SAS^® 9.2 in a Windows XP 32-bit environment, all the functions are applicable to any system. All the tools discussed are in Base SAS^®. The typical attendee or reader will have some experience in SAS, but not a lot of experience dealing with large amount of data.

This paper introduces basic-to-advanced strategies and syntax, the tools of the SAS^® trade, that enable client-quality PDF output to be delivered through a production system of macro programs. A variety of PROC REPORT output with proven client value serves to illustrate a discussion of the fundamental syntax used to create and share formats, macro programs, PROC REPORT output, inline styles, and style templates. The syntax is integrated into basic macro programs that demonstrate the the core functionality of the reporting system. Later sections of the paper describe in detail the macro programs used to start and end a PDF: (a) programs to save all current titles, footnotes, and option settings, establish standard titles, footnotes and option settings, and initially create the PDF document; and (b) programs to create a final standard data documentation page, end the PDF, and restore all original titles, footnotes, and option settings. The paper also shows how macro programs enable the setting of inline styles at the global, macro program, and macro program call-levels. The paper includes the style template syntax and the complete PROC REPORT syntax generated by the macro programs, and is designed for the intermediate to advanced SAS programmer using Foundation SAS^® for Release 9.2 on a Windows operating system.

SAS^® Add-In for Microsoft Office remains a popular tool for people who are not SAS^® programmers due to its easy interface with the SAS servers. In this session, you'll learn some of the many tricks that other organizations use for getting more value out of the tool.

As SAS^® professionals, we often wish our clients would make more use of the many excellent SAS tools at their disposal. However, it remains an indisputable fact that for many business users, Microsoft Excel is still their go-to application when it comes to carrying out any form of data analysis. There have been many attempts to integrate SAS and Excel, but none of these has up to now been entirely seamless. This paper addresses that problem by showing how, with a minimum of VBA (Visual Basic for Applications) code and by using the SAS Integrated Object Model (IOM) together with Microsoft s ActiveX Data Objects (ADO), we can create an Excel User Defined Function (UDF) that can accept parameters, carry out all data manipulations in SAS, and return the result to the spreadsheet in a way that is completely invisible to the user. They can nest or link these functions together just as if they were native Excel functions. We then go on to demonstrate how, using the same techniques, we can create small Excel applications that can perform sophisticated data analyses in SAS while not forcing users out of their Excel comfort zones.

You don't have to be with the CIA to discover why your SAS^® stored process is producing clandestine results. In this talk, you will learn how to use prompts to get the results you want, work with the metadata to ensure correct results, and even pick up simple coding tricks to improve performance. You will walk away with a new decoder ring that allows you to discover the secrets of the SAS logs!

Understanding previous research in key domain areas can help R&D organizations focus new research in non-duplicative areas and ensure that future endeavors do not repeat the mistakes of the past. However, manual analysis of previous research efforts can prove insufficient to meet these ends. This paper highlights how a combination of SAS^® Text Analytics and SAS^® Visual Analytics can deliver the capability to understand key topics and patterns in previous research and how it applies to a current research endeavor. We will explore these capabilities in two use cases. The first will be in uncovering trends in publicly visible government funded research (SBIR) and how these trends apply to future research in nanotechnology. The second will be visualizing past research trends in publicly available NASA publications, and how these might impact the development of next-generation spacecraft.

The DOW-loop is not official terminology that one can find in SAS^® documentation, but it has been well known and widely used among experienced SAS programmers. The DOW-loop was developed over a decade ago by a few SAS gurus, including Don Henderson, Paul Dorfman, and Ian Whitlock. A common construction of the DOW-loop consists of a DO-UNTIL loop with a SET and a BY statement within the loop. This construction isolates actions that are performed before and after the loop from the action within the loop, which results in eliminating the need for retaining or resetting the newly created variables to missing in the DATA step. In this talk, in addition to explaining the DOW-loop construction, we review how to apply the DOW-loop to various applications.

Are you a Java programmer who has been asked to work with SAS^®, or a SAS programmer who has been asked to provide an interface to your IT colleagues? Let s face it, not a lot of Java programmers are heavy SAS users. If this is the case in your company, then you are in luck because SAS provides a couple of really slick features to allow Java programmers to access both SAS data and SAS programming from within a Java program. This paper walks beginner Java or SAS programmers through the simple task of accessing SASdata and SAS programs from a Java program. All that you need is a Java environment and access to a running SAS process, such as a SAS server. This SAS server can either be a SAS/SHARE^® server or an IOM server. However, if you do not have either of these two servers that is okay; with the tools that are provided by SAS, you can start up a remote SAS session within Java and harness the power of SAS.

Have you found OS file permissions to be insufficient to tailor access controls to meet your SAS^® data security requirements? Have you found metadata permissions on tables useful for restricting access to SAS data, but then discovered that SAS programmers can avoid the permissions by issuing LIBNAME statements that do not use the metadata? Would you like to ensure that users have access to only particular rows or columns in SAS data sets, no matter how they access the SAS data sets? Metadata-bound libraries provide the ability to authorize access to SAS data by authenticated Metadata User and Group identities that cannot be bypassed by SAS programmers who attempt to avoid the metadata with direct LIBNAME statements. They also provide the ability to limit the rows and columns in SAS data sets that an authenticated user is allowed to see. The authorization decision is made in the bowels of the SAS^® I/O system, where it cannot be avoided when data is accessed. Metadata-bound libraries were first implemented in the second maintenance release of SAS^® 9.3 and were enhanced in SAS^® 9.4. This paper overviews the feature and discusses best practices for administering libraries bound to metadata and user experiences with bound data. It also discusses enhancements included in the first maintenance release of SAS 9.4.

SAS^® has a wide variety of functions and call routines available. More and more operating system-level functionality has become available as part of SAS language and functions over the versions of SAS. However, there is a wealth of other operating system functionality that can be accessed from within SAS with some preparation on the part of the SAS programmer. Much of the Microsoft Windows functionality is stored in easily re-usable system DLL (Dynamic Link Library) files. This paper describes some of the Windows functionality that might not be available directly as part of SAS language. It also describes methods of accessing that functionality from within SAS code. Using the methods described here, practically any Windows API should become accessible. User-created DLL functionality should also be accessible to SAS programs.

Regression is a helpful statistical tool for showing relationships between two or more variables. However, many users can find the barrage of numbers at best unhelpful, and at worst undecipherable. Using the shipments and inventories historical data from the U.S. Census Bureau's office of Manufacturers' Shipments, Inventories, and Orders (M3), we can create a graphical representation of two time series with PROC GPLOT and map out reported and expected results. By combining this output with results from PROC REG, we are able to highlight problem areas that might need a second look. The resulting graph shows which dates have abnormal relationships between our two variables and presents the data in an easy-to-use format that even users unfamiliar with SAS^® can interpret. This graph is ideal for analysts finding problematic areas such as outliers and trend-breakers or for managers to quickly discern complications and the effect they have on overall results.

The new Markov chain Monte Carlo (MCMC) procedure introduced in SAS/STAT^® 9.2 and further exploited in SAS/STAT^® 9.3 enables Bayesian computations to run efficiently with SAS^®. The MCMC procedure allows one to carry out complex statistical modeling within Bayesian frameworks under a wide spectrum of scientific research; in psychometrics, for example, the estimation of item and ability parameters is a kind. This paper describes how to use PROC MCMC for Bayesian inferences of item and ability parameters under a variety of popular item response models. This paper also covers how the results from SAS PROC MCMC are different from or similar to the results from WinBUGS. For those who are interested in the Bayesian approach to item response modeling, it is exciting and beneficial to shift to SAS, based on its flexibility of data managements and its power of data analysis. Using the resulting item parameter estimates, one can continue to test form constructions, test equatings, etc., with all these test development processes being accomplished with SAS!

Developing a good graph with ODS statistical graphics becomes a challenge when the input data maps to crowded displays with overlapping points or lines. Such is the case with the Framingham Heart Study of 5209 subjects captured in the Sashelp.Heart data set, a series of 100 booking curves for the airline industry, and three interleaving series plots that capture closing stock values over a twenty year period for three giants in the computer industry. In this paper, transparency, layering, data point rounding, and color coding are evaluated for their effectiveness to add visual clarity to graphics output. SAS^® Graph Template Language plotting statements (compatible with SAS^® 9.2) that are referenced in this paper include HISTOGRAM, SCATTERPLOT, BANDPLOT, and SERIESPLOT, as well as the layout statements OVERLAY, DATAPANEL, LATTICE, and GRIDDED, which produce single or multiple-panel graphs. SAS Graph Template Language is chosen over ODS Graphics procedures because of its greater graphics capability. While the original version of the paper used SAS 9.2, the latest version incorporates SAS^® 9.3 updates such as HEATMAPPARM for heat maps that add a third dimension to a graph via color, and the RANGEATTRMAP statement for grouping continuous data in a legend. If you have a license for SAS 9.3, you automatically have access to Graph Template Language. Since this is not a tutorial, you will get more out of this presentation if you have read introductory papers or Warren Kuhfeld s book Statistical Graphics in SAS^®: An Introduction to the Graph Template Language and the Statistical Graphics Procedures .

Dataprev has become the principal owner of social data on the citizens in Brazil by collecting information for over forty years in order to subsidize pension applications for the government. The use of this data can be expanded to provide new tools to aid policy and assist the government to optimize the use of its resources. Using SAS^® MDM, we are developing a solution that uniquely identifies the citizens of Brazil. Overcoming challenges with multiple government agencies and with the validation of survey records that suggest the same person requires rules for governance and a definition of what represents a particular Brazilian citizen. In short, how do you turn a repository of master data into an efficient catalyst for public policy? This is the goal for creating a repository focused on identifying the citizens of Brazil.

This presentation will teach the audience how to use SAS^® ODS Graphics. Now part of Base SAS^®, ODS Graphics is a great way to easily create clear graphics that enable any user to tell their story well. SGPLOT and SGPANEL are two of the procedures that can be used to produce powerful graphics that used to require a lot of work. The core of the procedures are explained, as well as the options available. Furthermore, we explore the ways to combine the individual statements to make more complex graphics that tell the story better. Any user of Base SAS on any platform will find great value from the SAS ODS Graphics procedures.

When reading data files or writing SAS^® programs, we are often hunting for the right format or informat. There are so many to choose from! Does it seem like too many to search the manual? Let SAS help find the right one! We use the SAS dictionary table VFORMAT and a very small SAS program. This presentation demonstrates how two simple functions unlock the potential of this great resource: SASHELP.VFORMAT.

Volatility estimation plays an important role in the elds of statistics and nance. Many different techniques address the problem of estimating volatility of nancial assets. Autoregressive conditional heteroscedasticity (ARCH) models and the related generalized ARCH models are popular models for volatility. This talk will introduce the need for volatility modeling as well as introduce the framework of ARCH and GARCH models. A brief discussion about the structure of ARCH and GARCH models will then be compared to other volatility modeling techniques.

Missing observations caused by dropouts or skipped visits present a problem in studies of longitudinal data. When the analysis is restricted to complete cases and the missing data depend on previous responses, the generalized estimating equation (GEE) approach, which is commonly used when the population-average effect is of primary interest, can lead to biased parameter estimates. The new GEE procedure in SAS/STAT^® 13.2 implements a weighted GEE method, which provides consistent parameter estimates when the dropout mechanism is correctly specified. When none of the data are missing, the method is identical to the usual GEE approach, which is available in the GENMOD procedure. This paper reviews the concepts and statistical methods. Examples illustrate how you can apply the GEE procedure to incomplete longitudinal data.

Do you know everything you need to know about missing values? Do you know how to assign a missing value to multiple variables with one statement? Can you display missing values as something other than . or blank? How many types of missing numeric values are there? This paper reviews techniques for assigning, displaying, referencing, and summarizing missing values for numeric variables and character variables.

Over the last year, the SAS^® Enterprise Miner^™ development team has made numerous and wide-ranging enhancements and improvements. New utility nodes that save data, integrate better with open-source software, and register models make your routine tasks easier. The area of time series data mining has three new nodes. There are also new models for Bayesian network classifiers, generalized linear models (GLMs), support vector machines (SVMs), and more.

The DATA step allows one to read, write, and manipulate many types of data. As data evolves to a more free-form state, the ability of SAS^® to handle character data becomes increasingly important. This paper addresses character data from multiple vantage points. For example, what is the default length of a character string, and why does it appear to change under different circumstances? What type of formatting is available for character data? How can we examine and manipulate character data? The audience for this paper is beginner to intermediate, and the goal is to provide an introduction to the numerous character functions available in SAS, including the basic LENGTH and SUBSTR functions, plus many others.

SAS^® Visual Analytics is one of the newer SAS^® products with a lot of excitement surrounding it. But what is SAS Visual Analytics really? By examining the similarities, differences, and synergies between SAS Visual Analytics and other SAS offerings, we can more clearly understand this new product.