SAS Global Forum 2014 Proceedings

SAS^® Grid Computing is a scale-out SAS^® solution that enables SAS applications to better utilize computing resources, which is extremely I/O and compute intensive. It requires the use of a high-performance shared storage (SS) that allows all servers to access the same file systems. SS may be implemented via traditional NFS NAS or clustered file systems (CFS) like GPFS. This paper uses the Lustre* file system, a parallel, distributed CFS, for a case study of performance scalability of SAS Grid Computing nodes on SS. The paper qualifies the performance of a standardized SAS workload running on Lustre at scale. Lustre has been traditionally used for large and sequential I/O. We will record and present the tuning changes necessary for the optimization of Lustre for the SAS applications. In addition, results from the scaling of SAS Cluster jobs running on Lustre will be presented.

The current study looks at recent health trends and behavior analyses of youth in America. Data used in this analysis was provided by the Center for Disease Control and Prevention and gathered using the Youth Risk Behavior Surveillance System (YRBSS). A factor analysis was performed to identify and define latent mental health and risk behavior variables. A series of logistic regression analyses were then performed using the risk behavior and demographic variables as potential contributing factors to each of the mental health variables. Mental health variables included disordered eating and depression/suicidal ideation data, while the risk behavior variables included smoking, consumption of alcohol and drugs, violence, vehicle safety, and sexual behavior data. Implications derived from the results of this research are a primary focus of this study. Risks and benefits of using a factor analysis with logistic regression in social science research will also be discussed in depth. Results included reporting differences between the years of 1991 and 2011. All results are discussed in relation to current youth health trend issues. Data was analyzed using SAS^® 9.3.

One of the first lessons that SAS^® programmers learn on the job is that numeric and character variables do not play well together, and that type mismatches are one of the more common source of errors in their otherwise flawless SAS programs. Luckily, converting variables from one type to another in SAS (that is, casting) is not difficult, requiring only the judicious use of either the input() or put() function. There remains, however, the danger of data being lost in the conversion process. This type of error is most likely to occur in cases of character-to-numeric variable conversion, most especially when the user does not fully understand the data contained in the data set. This paper will review the basics of data storage for character and numeric variables in SAS, the use of formats and informats for conversions, and how to ensure accurate type conversion of even high-precision numeric values.

Have you ever wished that with one click you could copy any SAS^® data set, including variable names, so that you could paste the text into a Microsoft Word file, Microsoft PowerPoint slide, or spreadsheet? You can and, with just Base SAS^®, there are some little-known but easy-to use methods that are available for automating many of your (or your users ) common tasks.

Stepwise regression includes regression models in which the predictive variables are selected by an automated algorithm. The stepwise method involves two approaches: backward elimination and forward selection. Currently, SAS^® has three procedures capable of performing stepwise regression: REG, LOGISTIC, and GLMSELECT. PROC REG handles the linear regression model, but does not support a CLASS statement. PROC LOGISTIC handles binary responses and allows for logit, probit, and complementary log-log link functions. It also supports a CLASS statement. The GLMSELECT procedure performs selections in the framework of general linear models. It allows for a variety of model selection methods, including the LASSO method of Tibshirani (1996) and the related LAR method of Efron et al. (2004). PROC GLMSELECT also supports a CLASS statement. We present a stepwise algorithm for generalized linear mixed models for both marginal and conditional models. We illustrate the algorithm using data from a longitudinal epidemiology study aimed to investigate parents beliefs, behaviors, and feeding practices that associate positively or negatively with indices of sleep quality.

SAS^® functions provide amazing power to your DATA step programming. Some of these functions are essential some of them save you writing volumes of unnecessary code. This paper covers some of the most useful SAS functions. Some of these functions might be new to you, and they will change the way you program and approach common programming tasks.

This paper simply develops a new SAS^® macro, which allows you to scrap user textual reviews from Apple iTunes store for iPhone applications. It not only can help you understand your customers experiences and needs, but also can help you be aware of your competitors user experiences. The macro uses API in iTunes and PROC HTTP in SAS to extract and create data sets. This paper also shows how you can use the application ID and country code to extract user reviews.

Structured Query Language (SQL) does not recognize the concept of row order. Instead, query results are thought of as unordered sets of rows. Most workarounds involve including serial numbers, which can then be compared or subtracted. This presentation illustrates and compares five techniques for creating serial numbers.

The SAS^® Data Quality Server allows SAS^® programmers to integrate the power of DataFlux^® into their data cleaning programs. The power of SAS Data Quality Server enables programmers to efficiently identify matching records across different datasets when exact matches are not present. During a recent educational research project, the DQMATCH function proved very capable when trying to link records from disparate data sources. Two key insights led to even greater success in linking records. The first insight was acknowledging that the hierarchical structure of data can greatly improve success in matching records. The second insight was that the names of individuals can be restructured to improve the chances of successful matches. This paper provides an overview of how these insights were implemented using the DQMATCH function to link educational data from multiple sources.

Cluster (group) randomization in trials is increasingly used over patient-level randomization. There are many reasons for this, including more pragmatic trials associated with comparative effectiveness research. Examples of clusters that could be randomized for study are clinics or hospitals, counties within a state, and other geographical areas such as communities. In many of these trials, the number of clusters is relatively small. This can be a problem if there are important covariates at the cluster level that are not balanced across the intervention and control groups. For example, if we randomize eight counties, a simple randomization could put all counties with high socioeconomic status in one group or the other, leaving us without good comparison data. There are strategies to prevent an unlucky cluster randomization. These include matching, stratification, minimization and covariate-constrained randomization. Each method is discussed, and a county-level Health Economics example of covariate-constrained randomization is shown for intermediate SAS^® users working with SAS^® Foundation for Release 9.2 and SAS/STAT^® on a Windows operating system.

The operational tempo of marketing in a digital world seems faster every day. New trends, issues, and ideas appear and spread like wildfire, demanding that sales and marketing adapt plans and priorities on-the-fly. The technology available to us can handle this, but traditional organizational processes often become the bottleneck. The solution is a new management approach called agile marketing. Drawing upon the success of agile software development and the lean start-up movement, agile marketing is a simple but powerful way to make marketing teams more nimble and marketing programs more responsive. You don't have to be small to be agile agile marketing has thrived at large enterprises such as Cisco and EMC. This session covers the basics of agile marketing what it is, why it works, and how to get started with agile marketing in your own team. In particular, we look at how agile marketing dovetails with the explosion of data-driven management in marketing by using the fast feedback from analytics to rapidly iterate and adapt new marketing programs in a rapid yet focused fashion.

Finding groups with similar attributes is at the core of knowledge discovery. To this end, Cluster Analysis automatically locates groups of similar observations. Despite successful applications, many practitioners are uncomfortable with the degree of automation in Cluster Analysis, which causes intuitive knowledge to be ignored. This is more true in text mining applications since individual words have meaning beyond the data set. Discovering groups with similar text is extremely insightful. However, blind applications of clustering algorithms ignore intuition and hence are unable to group similar text categories. The challenge is to integrate the power of clustering algorithms with the knowledge of experts. We demonstrate how SAS/STAT^® 9.2 procedures and the SAS^® Macro Language are used to ensemble the opinion of domain experts with multiple clustering models to arrive at a consensus. The method has been successfully applied to a large data set with structured attributes and unstructured opinions. The result is the ability to discover observations with similar attributes and opinions by capturing the wisdom of the crowds whether man or model.

SAS/ACCESS^® Interface to ODBC has been around forever. On one level, ODBC is very easy to use. That ease hides the flexibility that ODBC offers. This presentation uses examples to show you how to increase your program's performance and troubleshoot problems. You will learn: the differences between ODBC and OLE DB what the odbc.ini file is (and why it is important) how to discover what your ODBC driver is actually doing the difference between a native ACCESS engine and SAS/ACCESS Interface to ODBC

This paper expands upon A Multilevel Model Primer Using SAS^® PROC MIXED in which we presented an overview of estimating two- and three-level linear models via PROC MIXED. However, in our earlier paper, we, for the most part, relied on simple options available in PROC MIXED. In this paper, we present a more advanced look at common PROC MIXED options used in the analysis of social and behavioral science data, as well introduce users to two different SAS macros previously developed for use with PROC MIXED: one to examine model fit (MIXED_FIT) and the other to examine distributional assumptions (MIXED_DX). Specific statistical options presented in the current paper include (a) PROC MIXED statement options for estimating statistical significance of variance estimates (COVTEST, including problems with using this option) and estimation methods (METHOD =), (b) MODEL statement option for degrees of freedom estimation (DDFM =), and (c) RANDOM statement option for specifying the variance/covariance structure to be used (TYPE =). Given the importance of examining model fit, we also present methods for estimating changes in model fit through an illustration of the SAS macro MIXED_FIT. Likewise, the SAS macro MIXED_DX is introduced to remind users to examine distributional assumptions associated with two-level linear models. To maintain continuity with the 2013 introductory PROC MIXED paper, thus providing users with a set of comprehensive guides for estimating multilevel models using PROC MIXED, we use the same real-world data sources that we used in our earlier primer paper.

Overdispersion (extra variation) arises in binomial, multinomial, or count data when variances are larger than those allowed by the binomial, multinomial, or Poisson model. This phenomenon is caused by clustering of the data, lack of independence, or both. As pointed out by McCullagh and Nelder (1989), Overdispersion is not uncommon in practice. In fact, some would maintain that over-dispersion is the norm in practice and nominal dispersion the exception. Several approaches are found for handling overdispersed data, namely quasi-likelihood and likelihood models, generalized estimating equations, and generalized linear mixed models. Some classical likelihood models are presented. Among them are the beta-binomial, binomial cluster (a.k.a. random clumped binomial), negative-binomial, zero-inflated Poisson, zero-inflated negative-binomial, hurdle Poisson, and the hurdle negative-binomial. We focus on how these approaches or models can be implemented in a practical way using, when appropriate, the procedures GLIMMIX, GENMOD, FMM, COUNTREG, NLMIXED, and SURVEYLOGISTIC. Some real data set examples are discussed in order to illustrate these applications. We also provide some guidance on how to analyze generalized linear overdispersion mixed models and possible scenarios where we might encounter them.

Sampling is widely used in different fields for quality control, population monitoring, and modeling. However, the purposes of sampling might be justified by the business scenario, such as legal or compliance needs. This paper uses one probability sampling method, stratified sampling, combined with quality control review business cost to determine an optimized procedure of sampling that satisfies both statistical selection criteria and business needs. The first step is to determine the total number of strata by grouping the strata with a small number of sample units, using box-and-whisker plots outliers as a whole. Then, the cost to review the sample in each stratum is justified by a corresponding business counter-party, which is the human working hour. Lastly, using the determined number of strata and sample review cost, optimal allocation of predetermined total sample population is applied to allocate the sample into different strata.

The power of social media has increased to such an extent that businesses that fail to monitor consumer responses on social networking sites are now clearly at a disadvantage. In this paper, we aim to provide some insights on the impact of the Digital Rights Management (DRM) policies of Microsoft and the release of Xbox One on their customers' reactions. We have conducted preliminary research to compare the basic text mining capabilities of SAS^® and R, two very diverse yet powerful tools. A total of 6,500 Tweets were collected to analyze the impact of the DRM policies of Microsoft. The Tweets were segmented into three groups based on date: before Microsoft announced its Xbox One policies (May 18 to May 26), after the policies were announced (May 27 to June 16), and after changes were made to the announced policies (June 16 to July 1). Our results suggest that SAS works better than R when it comes to extensive analysis of textual data. In our following work, customers reactions to the release of Xbox One will be analyzed using SAS^® Sentiment Analysis Studio. We will collect Tweets on Xbox posted before and after the release of Xbox One by Microsoft. We will have two categories, Tweets posted between November 15 and November 21 and those posted between November 22 and November 29. Sentiment analysis will then be performed on these Tweets, and the results will be compared between the two categories.

This presentation is an open-ended discussion about techniques for transferring data and analytical results from SAS^® to Microsoft Excel. There will be some introductory comments, but this presentation does not have any set content. Instead, the topics discussed are dictated by attendee questions. Come prepared to ask and get answers to your questions. To submit your questions or suggestions for discussion in advance, go to http://support.sas.com/surveys/askvince.html.

Most marketers today are trying to use Facebook s network of 1.1 billion plus registered users for social media marketing. Local television stations and newspapers are no exception. This paper investigates what makes a post effective. A Facebook page that is owned by a brand has fans, or people who like the page and follow the stories posted on that page. The posts on a brand page, however, do not appear on all the fans News Feeds. This is determined by EdgeRank, a Facebook proprietary algorithm that determines what content users see and how it s prioritized on their News Feed. If marketers can understand how EdgeRank works, then they can develop more impactful posts and ultimately produce more effective social marketing using Facebook. The objective of this paper is to find the characteristics of a Facebook post that enhance the efficacy of a news outlet s page among their fans using Facebook Power Ratio as the target variable. Power Ratio, a surrogate to EdgeRank, was developed by experts at Frank N. Magid Associates, a research-based media consulting firm. Seventeen variables that describe the characteristics of a post were extracted from more than 8,000 posts, which were encoded by 10 media experts at Magid. Numerous models were built and compared to predict Power Ratio. The most useful model is a polynomial regression with the top three important factors as whether a post asks fans to like the post, content category of a post (including news, weather, etc.), and number of fans of the page.

The Challenge: assigning outbound calling agents in a telemarketing campaign to geographic districts. The districts have a variable number of leads, and each agent needs to be assigned entire districts with the total number of leads being as close as possible to a specified number for each of the agents (usually, but not always, an equal number). In addition, there are constraints concerning the distribution of assigned districts across time zones in order to maximize productivity and availability. Our Solution: use the SAS/OR^® procedure PROC CLP to formulate the challenge as a constraint satisfaction problem (CSP) since the objective is not necessarily to minimize a cost function, but rather to find a feasible solution to the constraint set. The input consists of the number of agents, the number of districts, the number of leads in each district, the desired number of leads per agent, the amount by which the actual number of leads can differ from the desired number, and the time zone for each district.

This paper kicks off a project to write a comprehensive book of best practices for documenting SAS^® projects. The presenter s existing documentation styles are explained. The presenter wants to discuss and gather current best practices used by the SAS user community. The presenter shows documentation styles at three different levels of scope. The first is a style used for project documentation, the second a style for program documentation, and the third a style for variable documentation. This third style enables researchers to repeat the modeling in SAS research, in an alternative language, or conceptually.

There is an ever-increasing number of study designs and analysis of clinical trials using Bayesian frameworks to interpret treatment effects. Many research scientists prefer to understand the power and probability of taking a new drug forward across the whole range of possible true treatment effects, rather than focusing on one particular value to power the study. Examples are used in this paper to show how to compute Bayesian probabilities using the SAS/STAT^® MIXED procedure and UNIVARIATE procedure. Particular emphasis is given to the application on efficacy analysis, including the comparison of new drugs to placebos and to standard drugs on the market.

There are many components that make up the middle tier and server tier of a SAS^® 9.4 deployment. There is also a variety of technologies that can be used to provide high availability of these components. This paper focuses on a small set of best practices recommended by SAS for a consistent high-availability strategy across the entire SAS 9.4 platform. We focus on two technologies: clustering, as well as the high-availability features of SAS^® Grid Manager. For the clustering, we detail newly introduced clustering capabilities in SAS 9.4 such as the middle-tier SAS^® Web Application Server and the server-tier SAS^® metadata clusters. We also introduce the small, medium, and large deployment scenarios or profiles, which make use of each of these technologies. These deployment scenarios reflect the typical customer's environment and address their high availability, performance, and scalability requirements.

This paper provides an overview of how to create a SAS^® Enterprise Guide^® process that is well designed, simple, documented, automated, modular, efficient, reliable, and easy to maintain. Topics include how to organize a SAS Enterprise Guide process, how to best document in SAS Enterprise Guide, when to leverage point-and-click functionality, and how to automate and simplify SAS Enterprise Guide processes. This paper has something for any SAS Enterprise Guide user, new or experienced!

The scheduling of surgical operations in a hospital is a complex problem, with each surgical specialty having to satisfy their demand while competing for resources with other hospital departments. This project extends the construction of a weekly timetable, the Master Surgery Schedule, which assigns surgical specialties to operating theater sessions by taking into account the post-surgery resource requirements, primarily post-operative beds on hospital wards. Using real data from the largest teaching hospital in Wales, UK, this paper describes how SAS^® has been used to analyze large data sets to investigate the relationship between the operating theater schedule and the demand for beds on wards in the hospital. By understanding this relationship, a more well-informed and robust operating theater schedule can be produced that delivers economic benefit to the hospital and a better experience for the patients by reducing the number of cancelled operations caused by the unavailability of beds on hospital wards.

The emerging discipline of data governance encompasses data quality assurance, data access and use policy, security risks and privacy protection, and longitudinal management of an organization s data infrastructure. In the interests of forestalling another bureaucratic solution to data governance issues, this presentation features database programming tools that provide rapid access to big data and make selective access to and restructuring of metadata practical.

In many organizations the amount of data we deal with increases far faster than the hardware and IT infrastructure to support them. As a result, we encounter significant bottlenecks and I/O bound processes. However, clever use of SAS^® software can help us find a way around. In this paper we will look at the clever use of PROC OLAP to show you how to address I/O bound processing spread I/O traffic to different servers to increase cube building efficiency. This paper assumes experience with SAS^® OLAP Cube Studio and/or PROC OLAP.

Simply using an ODS destination to replay PROC CONTENTS output does not provide the user with attractive, usable metadata. Harness the power of SAS^® and ODS output objects to create designer multi-tab metadata workbooks with the click of a mouse!

In the credit card industry, there is a group of people who use credit cards as an interest-free loan by transferring their balances between cards during 0% balance transfer (BT) periods in order to avoid paying interest. These people are called gamers. Gamers generate losses for banks due to their behavior of paying no interest and having no purchases. It is hard to use traditional ways, such as risk scorecards, to identify them since gamers tend to have very good credit histories. This paper uses Naive Bayes classifier to classify gamers into three segments, according to the proportion of gamers. Using this model, the targeting policy and underwriting policy can be significantly improved and the function of tracking the proportion of gamers in population can be realized. This result has been accomplished by using logistic regression in SAS^® combined with a Microsoft Excel pivot table. The procedure is described in detail in this paper.

At last you found it, the perfect software solution to meet the demands of your business. Now, how do you convince your company and its leaders to spend the money to bring it in-house and reap the business benefit? How do you build a compelling business justification? How should you think-through, communicate, and overcome this common business roadblock? Join Scott Sanders, a business and IT veteran who has effectively handled this challenge on numerous occasions. Hear some of his best practices and review some of his effective methods of approach, as he goes through his personal experience in building a business case and shepherding it through to final approval.

The use of Cohen s kappa has enjoyed a growing popularity in the social sciences as a way of evaluating rater agreement on a categorical scale. The kappa statistic can be calculated as Cohen first proposed it in his 1960 paper or by using any one of a variety of weighting schemes. The most popular among these are the linear weighted kappa and the quadratic weighted kappa. Currently, SAS^® users can produce the kappa statistic of their choice through PROC FREQ and the use of relevant AGREE options. Complications arise however when the data set does not contain a completely square cross-tabulation of data. That is, this method requires that both raters have to have at least one data point for every available category. There have been many solutions offered for this predicament. Most suggested solutions include the insertion of dummy records into the data and then assigning a weight of zero to those records through an additional class variable. The result is a multi-step macro, extraneous variable assignments, and potential data integrity issues. The author offers a much more elegant solution by producing a segment of code which uses brute force to calculate Cohen s kappa as well as all popular variants. The code uses nested PROC SQL statements to provide a single conceptual step which generates kappa statistics of all types even those that the user wishes to define for themselves.

Often in a clinical trial, measures are needed to describe pain, discomfort, or physical constraints that are visible but not measurable through lab tests or other vital signs. In these cases, researchers turn to questionnaires to provide documentation of improvement or statistically meaningful change in support safety and efficacy hypotheses. For example, in studies (like Parkinson s studies) where pain or depression are serious non-motor symptoms of the disease, these questionnaires provide primary endpoints for analysis. Questionnaire data presents unique challenges in both collection and analysis in the world of CDISC standards. The questions are usually aggregated into scale scores, as the underlying questions by themselves provide little additional usefulness. SAS^® is a powerful tool for extraction of the raw data from the collection databases and transposition of columns into a basic data structure in SDTM, which is vertical. The data is then processed further as per the instructions in the Statistical Analysis Plan (SAP). This involves translation of the originally collected values into sums, and the values of some questions need to be reversed. Missing values can be computed as means of the remaining questions. These scores are then saved as new rows in the ADaM (analysis-ready) data sets. This paper describes the types of questionnaires, how data collection takes place, the basic CDISC rules for storing raw data in SDTM, and how to create analysis data sets with derived records using ADaM standards, while maintaining traceability to the original question.

Usually, log files are checked by users only when SAS^® completes the execution of programs. If SAS finds any errors in the current line, it skips the current step and executes the next line. The process is completed only at the execution complete program. There are a few programs that will take more than a day to complete. In this case, the user opens the log file in Read-Only mode frequently to check for errors, warnings, and unexpected notes and terminates the execution of the program manually if any potential messages are identified. Otherwise, the user will be notified with the errors in the log file only at the end of the execution. Our suggestion is to run the parallel utility program along with the production program to check the log file of the currently running program and to notify the user through an e-mail when an error, warning, or unexpected note is found in the log file. Also, the execution can be terminated automatically and the user can be notified when potential messages are identified.

The life of a SAS^® program can be broken down into sets of changes made over time. Programmers are generally focused on the future, but when things go wrong, a look into the past can be invaluable. Determining what changes were made, why they were made, and by whom can save both time and headaches. This paper discusses version control and the current options available to SAS^® Enterprise Guide^® users. It then highlights the upcoming Program History feature of SAS Enterprise Guide. This feature enables users to easily track changes made to SAS programs. Properly managing the life cycle of your SAS programs will enable you to develop with peace of mind.

Can clustering discharge records improve a predictive model s overall fit statistic? Do the predictors differ across segments? This talk describes the methods and results of data mining pediatric IBD patient records from my Analytics 2013 poster. SAS^® Enterprise Miner^™ 12.1 was used to segment patients and model important predictors for the length of hospital stay using discharge records from the national Kid s Inpatient Database. Profiling revealed that patient segments were differentiated by primary diagnosis, operating room procedure indicator, comorbidities, and factors related to admission and disposition of patient. Cluster analysis of patient discharges improved the overall average square error of predictive models and distinguished predictors that were unique to patient segments.

When a SAS^® user asked for help scanning words in textual data and then matching them to pre-scored keywords, it struck a chord with SAS programmers! They contributed code that solved the problem using hash structures, SQL, informats, arrays, and PRX routines. Of course, the next question was which program is fastest! This paper compares the different approaches and evaluates the performance of the programs on varying amounts of data. The code for each program is provided to show how SAS has a variety of tools available to solve common problems. While this won t make you an expert on any of these programming techniques, you ll see each of them in action on a common problem.

The graphical display of the individual data is important in understanding the raw data and the relationship between the variables in the data set. You can explore your data to ensure statistical assumptions hold by detecting and excluding outliers if they exist. Since you can visualize what actually happens to individual subjects, you can make your conclusions more convincing in statistical analysis and interpretation of the results. SAS^® provides many tools for creating graphs of individual data. In some cases, multiple tools need to be combined to make a specific type of graph that you need. Examples are used in this paper to show how to create graphs of individual data using the SAS^® ODS Graphics procedures (SG procedures).

This paper describes a method that uses some simple SAS^® macros and SQL to merge data sets containing related data that contains rows with varying effective date ranges. The data sets are merged into a single data set that represents a serial list of snapshots of the merged data, as of a change in any of the effective dates. While simple conceptually, this type of merge is often problematic when the effective date ranges are not consecutive or consistent, when the ranges overlap, or when there are missing ranges from one or more of the merged data sets. The technique described was used by the Fairfax County Human Resources Department to combine various employee data sets (Employee Name and Personal Data, Personnel Assignment and Job Classification, Personnel Actions, Position-Related data, Pay Plan and Grade, Work Schedule, Organizational Assignment, and so on) from the County's SAP-HCM ERP system into a single Employee Action History/Change Activity file for historical reporting purposes. The technique currently is used to combine fourteen data sets, but is easily expandable by inserting a few lines of code using the existing macros.

Graphic software users are confronted with what I call Options Over-Choice, and with defaults that are designed to easily give you a result, but not necessarily the best result. This presentation and paper focus on guidelines for communication-effective data visualization. It demonstrates their practical implementation, using graphic examples likely to be adaptable to your own work. Code is provided for the examples. Audience members will receive the latest update of my tip sheet compendium of graphic design principles. The examples use SAS^® tools (traditional SAS/GRAPH^® or the newer ODS graphics procedures that are available with Base SAS^®), but the design principles are really software independent. Come learn how to use data visualization to inform and influence, to reveal and persuade, using tips and techniques developed and refined over 34 years of working to get the best out of SAS^® graphic software tools.

Non-cognitive assessments, which measure constructs such as time management, goal-setting, and personality, are becoming more prevalent today in research within the domains of academic performance and workforce readiness. Many instruments that are used for this purpose contain a large number of items that can each be assigned to specific facets of the larger construct. The factor structure of each instrument emerges from a mixture of psychological theory and empirical research, often by doing exploratory factor analysis (EFA) using the SAS^® procedure PROC FACTOR. Once an initial model is established, it is important to perform confirmatory factor analysis (CFA) to confirm that the hypothesized model provides a good fit to the data. If outcome data such as grades are collected, structural equation modeling (SEM) should also be employed to investigate how well the assessment predicts these measures. This paper demonstrates how the SAS procedure PROC CALIS is useful for performing confirmatory factor analysis and structural equation modeling. Examples of these methods are demonstrated and proper interpretation of the fit statistics and resulting output is illustrated.

The CDISC Study Data Tabulation Model (SDTM) provides a standardized structure and specification for a broad range of human and animal study data in pharmaceutical research, and is widely adopted in the industry for the submission of the clinical trial data. Because SDTM requires additional variables and datasets that are not normally available in the clinical database, further programming is required to convert the clinical database into the SDTM datasets. This presentation introduces the concept and general requirements of SDTM, and the different approaches in the SDTM data conversion process. The author discusses database design considerations, implementation procedures, and SAS^® macros that can be used to maximize the efficiency of the process. The creation of the metadata DEFINE.XML and the final STDM dataset validation are also discussed.

The INTCK function is used to obtain the number of time intervals between two dates. The INTCK function comes with arguments and argument-modifiers to enable us to perform a variety of date-related manipulations. This paper deals with a real-time simple usage of the INTCK function to calculate frequency of days of the week between the start and end day of a trip. The INTCK function with its arguments can directly calculate the number of days of the week as illustrated in this paper. The same usage of the INTCK function using PROC SQL is also presented in this paper. All the codes executed and presented in this paper involve Base SAS^® Release 9.3 only.

SAS^® Visual Analytics Designer enables you to create reports with different layouts. There are several basic graph objects that you can include in these reports. What if you wanted to create a report that wasn't possible with one of the out-of-the-box graph objects? No worries! The new SAS^® Visual Analytics Graph Builder available from the SAS^® Visual Analytics home page lets you create a custom graph object using built-in sample data. You can then include these graph objects in SAS Visual Analytics Designer and generate reports using them. Come see how you can create custom graph objects such as stock plots, butterfly charts, and more. These custom objects can be easily shared with others for use in SAS Visual Analytics Designer.

When submitting clinical data to the Food and Drug Administration (FDA), besides the usual trials results, we need to submit the information that helps the FDA to understand the data. The FDA has required the CDISC Case Report Tabulation Data Definition Specification (Define-XML), which is based on the CDISC Operational Data Model (ODM), for submissions using Study Data Tabulation Model (SDTM). Electronic submission to the FDA is therefore a process of following the guidelines from CDISC and FDA. This paper illustrates how to create an FDA guidance compliant define.xml v2 from metadata by using SAS^®.

This presentation explains how to use Base SAS^®9 software to create multi-sheet Microsoft Excel workbooks. You learn step-by-step techniques for quickly and easily creating attractive multi-sheet Excel workbooks that contain your SAS^® output using the ExcelXP ODS tagset. The techniques can be used regardless of the platform on which your SAS software is installed. You can even use them on a mainframe! Creating and delivering your workbooks on-demand and in real time using SAS server technology is discussed. Although the title is similar to previous presentations by this author, this presentation contains new and revised material not previously presented.

The first table in many research journal articles is a statistical comparison of demographic traits across study groups. It might not be exciting, but it s necessary. And although SAS^® calculates these numbers with ease, it is a time-consuming chore to transfer these results into a journal-ready table. Introducing the time-saving deluxe %MAKETABLE SAS macro it does the heavy work for you. It creates a Microsoft Word table of up to four comparative groups reporting t-tests, chi-square, ANOVA, or median test results, including a p-value. You specify only a one-line macro call for each line in the table, and the macro takes it from there. The result is a tidily formatted journal-ready Word table that you can easily include in a manuscript, report, or Microsoft PowerPoint presentation. For statisticians and researchers needing to summarize group comparisons in a table, this macro saves time and relieves you from the drudgery of trying to make your output neat and pretty. And after all, isn t that what we want computing to do for us?

Patient safety in a neonatal intensive care unit (NICU) as in any hospital unit is critically dependent on appropriate staffing. We used SAS^® Simulation Studio to create a discrete-event simulation model of a specific NICU that can be used to predict the number of nurses needed per shift. This model incorporates the complexities inherent in determining staffing needs, including variations in patient acuity, referral patterns, and length of stay. To build our model, the group first estimated probability distributions for the number and type of patients admitted each day to the unit. Using both internal and published data, the team also estimated distributions for various NICU-specific patient morbidities, including type and timing of each morbidity event and its temporal effect on a patient s acuity. We then built a simulation model that samples from these input distributions and simulates the flow of individual patients through the NICU (consisting of critical-care and step-down beds) over a one-year time period. The general basis of our model represents a method that can be applied to any unit in any hospital, thereby providing clinicians and administrators with a tool to rigorously and quantitatively support staffing decisions. With additional refinements, the use of such a model over time can provide significant benefits in both patient safety and operational efficiency.

Energy companies that operate in a highly regulated environment and are constrained in pricing flexibility must employ a multitude of approaches to maintain high levels of customer satisfaction. Many investor-owned utilities are just starting to embrace a customer-centric business model to improve the customer experience and hold the line on costs while operating in an inflationary business setting. Faced with these challenges, it is natural for utility executives to ask: 'What drives customer satisfaction, and what is the optimum balance between influencing customer perceptions and improving actual process performance in order to be viewed as a top-tier performer by our customers?' J.D. Power, for example, cites power quality and reliability as the top influencer of overall customer satisfaction. But studies have also shown that customer perceptions of reliability do not always match actual reliability experience. This apparent gap between actual and perceived performance raises a conundrum: Should the utility focus its efforts and resources on improving actual reliability performance or would it be better to concentrate on influencing customer perceptions of reliability? How can this conundrum be unraveled with an analytically driven approach? In this paper, we explore how the design of experiment techniques can be employed to help understand the relationship between process performance and customer perception, thereby leading to important insights into the energy customer equation and higher customer satisfaction!

The Washington D.C. aqueduct was completed in 1863, carrying desperately needed clean water to its many residents. Just as the aqueduct was vital and important to its residents, a lifeline if you will, so too is the supply of data to the business. Without the flow of vital information, many businesses would not be able to make important decisions. The task of building my company s first dashboard was brought before us by our CIO; the business had not asked for it. In this poster, I discuss how we were able to bring fresh ideas and data to our business units by converting the data they saw on a daily basis in reports to dashboards. The road to success was long with plenty of struggles from creating our own business requirements to building data marts, synching SQL to SAS^®, using information maps and SAS^® Enterprise Guide^® projects to move data around, all while dealing with technology and other I.T. team roadblocks. Then on to designing what would become our real-time dashboards, fighting for SharePoint single sign-on, and, oh yeah, user adoption. My story of how dashboards revitalized the business is a refreshing tale for all levels.

With increased concern about privacy and simultaneous pressure to make survey data available, statistical disclosure control (SDC) treatments are performed on survey microdata to reduce disclosure risk prior to dissemination to the public. This situation is all the more problematic in the push to provide data online for immediate user query. Two SDC approaches are data coarsening, which reduces the information collected, and data swapping, which is used to adjust data values. Data coarsening includes recodes, top-codes and variable suppression. Challenges related to creating a SAS^® macro for data coarsening include providing flexibility for conducting different coarsening approaches, and keeping track of the changes to the data so that variable and value labels can be assigned correctly. Data swapping includes selecting target records for swapping, finding swapping partners, and swapping data values for the target variables. With the goal of minimizing the impact on resulting estimates, challenges for data swapping are to find swapping partners that are close matches in terms of both unordered categorical and ordered categorical variables. Such swapping partners ensure that enough change is made to the target variables, that data consistency between variables is retained, and that the pool of potential swapping partners is controlled. An example is presented using each algorithm.

A revolution is taking place in the U.S. at both the national and state level in the area of health care transparency. Large amounts of data on the health of communities, the quality of health care providers, and the cost of health care is being collected and is being made available by both levels of government to a variety of stakeholders. The surfacing of this data and the consumption of it by health care decision makers unfolds a new opportunity to view, explore, and analyze health care data in novel ways. Furthermore, this data provides the health care system an opportunity to advance the achievement of the Triple Aim. Data transparency will bring a sea change to the world of health care by necessitating new ways of communicating information to end users such as payers, providers, researchers, and consumers of health care. This paper examines the information needs of public health care payers such as Medicare and Medicaid, and discusses the convergence of health care and data visualization in creating consumable health insights that will aid in achieving cost containment, quality improvement, and increased accessibility for populations served. Moreover, using claims data and SAS^® Visual Analytics, it examines how data visualization can help identify the most critical insights necessary to managing population health. If health care payers can analyze large amounts of claims data effectively, they can improve service and care delivery to their recipients.

Debugging SAS^® code contained in a macro can be frustrating because the SAS error messages refer only to the line in the SAS log where the macro was invoked. This can make it difficult to pinpoint the problem when the macro contains a large amount of SAS code. Using a macro that contains one small DATA step, this paper shows how to use the MPRINT and MFILE options along with the fileref MPRINT to write just the SAS code generated by a macro to a file. The 'de-macroified' SAS code can be easily executed and debugged.

Your company s chronically overloaded SAS^® environment, adversely impacted user community, and the resultant lackluster productivity have finally convinced your upper management that it is time to upgrade to a SAS^® grid to eliminate all the resource problems once and for all. But after the contract is signed and implementation begins, you as the SAS administrator suddenly realize that your company-wide standard mode of SAS operations, that is, using the traditional SAS^® Display Manager on a server machine, runs counter to the expectation of the SAS grid your users are now supposed to switch to SAS^® Enterprise Guide^® on a PC. This is utterly unacceptable to the user community because almost everything has to change in a big way. If you like to play a hero in your little world, this is your opportunity. There are a number of things you can do to make the transition to the SAS grid as smooth and painless as possible, and your users get to keep their favorite SAS Display Manager.

Very often, there is a need to present the analysis output from SAS^® through web applications. On these occasions, it would make a lot of difference to have highly interactive charts over static image charts and graphs. Not only this is visually appealing, with features like zooming, filtering, etc., it enables consumers to have a better understanding of the output. There are a lot of charting libraries available in the market which enable us to develop cool charts without much effort. Some of the packages are Highcharts, Highstock, KendoUI, and so on. They are developed in JavaScript and use the latest HTML5 components, and they also support a variety of chart types such as line, spline, area, area spline, column, bar, pie, scatter, angular gauges, area range, area spline range, column range, bubble, box plot, error bars, funnel, waterfall, polar chart types etc. This paper demonstrates how we can combine the data processing and analytic powers of SAS with the visualization abilities of these charting libraries. Since most of them consume JSON-formatted data, the emphasis is on JSON producing capabilities of SAS, both with PROC JSON and other custom programming methods. The example would show how easy it is to do develop a stored process which produces JSON data which would be consumed by the charting library with minimum change to the sample program.

Evaluation of the efficacy of an intervention is often complicated because the intervention is not randomly assigned. Usually, interventions in marketing, such as coupons or retention campaigns, are directed at customers because their spending is below some threshold or because the customers themselves make a purchase decision. The presence of nonrandom assignment of the stimulus can lead to over- or underestimating the value of the intervention. This can cause future campaigns to be directed at the wrong customers or cause the impacts of these effects to be over- or understated. This paper gives a brief overview of selection bias, demonstrates how selection in the data can be modeled, and shows how to apply some of the important consistent methods of estimating selection models, including Heckman's two-step procedure, in an empirical example. Sample code is provided in an appendix.

'NOTE: No unequal values were found. All values compared are exactly equal.' Do your eyes automatically drop to the end of your PROC COMPARE output in search of these words? Do you then conclude that your data sets match? Be careful here! Major discrepancies might still lurk in the shadows, and you ll never know about them if you make this common mistake. This paper describes several of PROC COMPARE s blind spots and how to steer clear of them. Watch in horror as PROC COMPARE glosses over important differences while boldly proclaiming that all is well. See the gruesome truth about what PROC COMPARE does, and what it doesn t do! Learn simple techniques that allow you to peer into these blind spots and avoid getting blindsided by PROC COMPARE!

Response rates, churn models, customer lifetime value today's marketing departments are more analytically driven than ever. Marketers have had their heads down developing analytic capabilities for some time. The results have been game-changing. But it's time for marketers to look up and discover which analytic results from other departments can enhance the analytics of marketing. What if you knew the demand forecast for your products? What could you do? What if you understood the price sensitivity for your products? How would this impact the actions that your marketing team takes? Using the hospitality industry as an example, we explore how marketing teams can use the analytic outputs from other departments to get better results overall.

Ebony and Ivory was a number one song by Paul McCartney and Stevie Wonder about making music together, proper integration, unity, and harmony on a deeper level. With SAS^® Visual Analytics, current Enterprise Business Intelligence (BI) customers can rest assured that their years of existing BI work and content can coexist until they can fully transition over to SAS Visual Analytics. This presentation covers 10 inter-operability integration points between SAS^® BI and SAS Visual Analytics.

Both recent banking and insurance risk regulations require effective aggregation of risks. To determine the total enterprise risk for a financial institution, all risks must be aggregated and analyzed. Typically, there are two approaches: bottom-up and top-down risk aggregation. In either approach, financial institutions face challenges due to various levels of risks with differences in metrics, data source, and availability. First, it is especially complex to aggregate risk. A common view of the dependence between all individual risks can be hard to achieve. Second, the underlying data sources can be updated at different times and can have different horizons. This in turn requires an incremental update of the overall risk view. Third, the risk needs to be analyzed across on-demand hierarchies. This paper presents SAS^® solutions to these challenges. To address the first challenge, we consider a mixed approach to specify copula dependence between individual risks and allow step-by-step specification with a minimal amount of information. Next, the solution leverages an event-driven architecture to update results on a continuous basis. Finally, the platform provides a self-service reporting and visualization environment for designing and deploying reports across any hierarchy and granularity on the fly. These capabilities enable institutions to create an accurate, timely, comprehensive, and adaptive risk-aggregation and reporting system.

Data quality depends on review by operational stewards of the content. Volumes of complex data disappear as e-mail attachments. Is there a critical data shift that might be missed? Embedding a summary image drives expert data review from 15% to 87%. Downstream error rate is significantly reduced. Increased accuracy to variable physician compensation measures results.

For data analysts, one of the most important steps after manipulating and analyzing the data set is to create a report for it. Nowadays, many statistics tables and reports are generated as HTML files that can be easily accessed through the Internet. However, the SAS^® Output Delivery System (ODS) HTML output has many limitations on interacting with users. In this paper, we introduce a method to enhance the traditional ODS HTML output by using jQuery (a JavaScript library). A macro was developed to implement this idea. Compared to the standard HTML output, this macro can add sort, pagination, search, and even dynamic drilldown function to the ODS HTML output file.

In evaluation instruments and tests, individual items are often collected using an ordinal measurement or Likert type scale. Typically measures such as Cronbach s alpha are estimated using the standard Pearson correlation. Gadderman and Zumbo (2012) illustrate how using the standard Pearson correlations may yield biased estimates of reliability when the data are ordinal and present methodology for using the polychoric correlation in reliability estimates as an alternative. This session shows how to implement the methods of Gadderman and Zumbo using SAS^® software. An example will be presented that incorporates these methods in the estimation of the reliability of an active learning post-occupancy evaluation instrument developed by Steelcase Education Solutions researchers.

SAS^® is an outstanding suite of software, but not everyone in the workplace speaks SAS. However, almost everyone speaks Excel. Often, the data you are analyzing, the data you are creating, and the report you are producing is a form of a Microsoft Excel spreadsheet. Every year at SAS^® Global Forum, there are SAS and Excel presentations, not just because Excel isso pervasive in the workplace, but because there s always something new to learn (or re-learn)! This paper summarizes and references (and pays homage to!) previous SAS Global Forum presentations, as well as examines some of the latest Excel capabilities with the latest versions of SAS^® 9.4 and SAS^® Visual Analytics.

Business Intelligence (BI) dashboards serve as an invaluable, high-level, visual reference tool for decision-making processes in many business industries. A request was made to our department to develop some BI dashboards that could be incorporated in an academic setting. These dashboards would aim to serve various undergraduate executive and administrative staff at the university. While most business data may lend itself to work very well and easily in the development of dashboards, academic data is typically modeled differently and, therefore, faces unique challenges. In this paper, the authors detail and share the design and development process of creating dashboards for decision making in an academic environment utilizing SAS^® BI Dashboard 4.3 and other SAS^® Enterprise Business Intelligence 9.2 tools. The authors also provide lessons learned as well as recommendations for future implementations of BI dashboards utilizing academic data.

Explore the various DATA step merge and PROC SQL join processes. This presentation examines the similarities and differences between merges and joins, and provides examples of effective coding techniques. Attendees examine the objectives and principles behind merges and joins, one-to-one merges (joins), and match-merge (equi-join), as well as the coding constructs associated with inner and outer merges (joins) and PROC SQL set operators.

The growing adoption of electronic systems for keeping medical records provides an opportunity for health care practitioners and biomedical researchers to access traditionally unstructured data in a new and exciting way. Pathology reports, progress notes, and many other sections of the patient record that are typically written in a narrative format can now be analyzed by employing natural language processing contextual extraction techniques to identify specific concepts contained within the text. Linking these concepts to a standardized nomenclature (for example, SNOMED CT, ICD-9, ICD-10, and so on) frees analysts to explore and test hypotheses using these observational data. Using SAS^® software, we have developed a solution in order to extract data from the unstructured text found in medical pathology reports, link the extracted terms to biomedical ontologies, join the output with more structured patient data, and view the results in reports and graphical visualizations. At its foundation, this solution employs SAS^® Enterprise Content Categorization to perform entity extraction using both manually and automatically generated concept definition rules. Concept definition rules are automatically created using technology developed by SAS, and the unstructured reports are scored using the DS2/SAS^® Content Categorization API. Results are post-processed and added to tables compatible with SAS^® Visual Analytics, thus enabling users to visualize and explore data as required. We illustrate the interrelated components of this solution with examples of appropriate use cases and describe manual validation of performance and reliability with metrics such as precision and recall. We also provide examples of reports and visualizations created with SAS Visual Analytics.

Traditionally, web applications interact with back-end databases by means of JDBC/ODBC connections to retrieve and update data. With the growing need for real-time charting and complex analysis types of data representation on these web applications, SAS computing power can be put to use by adding a SAS web service layer between the application and the database. With the experience that we have with integrating these applications to SAS^® BI Web Services, this is our attempt to point out five things to do when using SAS BI Web Services. 1) Input Data Sources: always enable Allow rewinding stream while creating the stored process. 2) Use LIBNAME statements to define XML filerefs for the Input and Output Streams (Data Sources). 3) Define input prompts and output parameters as global macro variables in the stored process if the stored process calls macros that use these parameters. 4) Make sure that all of the output parameters values are set correctly as defined (data type) before the end of the stored process. 5) The Input Streams (if any) should have a consistent data type; essentially, every instance of the stream should have the same structure. This paper consist of examples and illustrations of errors and warnings associated with the previously mentioned cases.

No way. Not gonna happen. I am a real SAS^® programmer. (Spoken by a Real SAS Programmer.) SAS^® Enterprise Guide^® sometimes gets a bad rap. It was originally promoted as a code generator for non-programmers. The truth is, however, that SAS Enterprise Guide has always allowed programmers to write their own code. In addition, it offers many features that are not included in PC SAS^®. This presentation shows you the top ten features that people who like to write code care about. It will be taught by a programmer who now prefers using SAS Enterprise Guide.

For almost two decades, Western Kentucky University's Office of Institutional Research (WKU-IR) has used SAS^® to help shape the future of the institution by providing faculty and administrators with information they can use to make a difference in the lives of their students. This presentation provides specific examples of how WKU-IR has shaped the policies and practices of our institution and discusses how WKU-IR moved from a support unit to a key strategic partner. In addition, the presentation covers the following topics: How the WKU Office of Institutional Research developed over time; Why WKU abandoned reactive reporting for a more accurate, convenient system using SAS^® Enterprise Intelligence Suite for Education; How WKU shifted from investigating what happened to predicting outcomes using SAS^® Enterprise Miner^™ and SAS^® Text Miner; How the office keeps the system relevant and utilized by key decision makers; What the office has accomplished and key plans for the future.

Based on selection criteria, the SAS^® Data Integration Studio loop or splitter transformations can be used to generate multiple output files. The ETL developer or SAS^® administrator can decide which transformation is better suited for the design, priorities, and SAS configuration at their site. Factors to consider are the setup, maintenance, and performance of the ETL job. The loop transformation requires an understanding of macros and a control table. The splitter transformation is more straightforward and self documenting. If time allows, creating and running a job with each transformation can provide benchmarking to measure performance. For a comparison of these two options, this paper shows an example of the same job using the loop or splitter transformation. For added testing metrics, one can adapt the LOGPARSE SAS macro to parse the job logs.

Do you need a statistic that is not computed by any SAS^® procedure? Reach for the SAS/IML^® language! Many statistics are naturally expressed in terms of matrices and vectors. For these, you need a matrix-vector language. This hands-on workshop introduces the SAS/IML language to experienced SAS programmers. The workshop focuses on statements that create and manipulate matrices, read and write data sets, and control the program flow. You will learn how to write user-defined functions, interact with other SAS procedures, and recognize efficient programming techniques. Programs are written using the SAS/IML^® Studio development environment. This course covers Chapters 2 4 of Statistical Programming with SAS/IML Software (Wicklin, 2010).

This paper illustrates some SAS^® graphs that can be useful for variable selection in predictive modeling. Analysts are often confronted with hundreds of candidate variables available for use in predictive models, and this paper illustrates some simple SAS graphs that are easy to create and that are useful for visually evaluating candidate variables for inclusion or exclusion in predictive models. The graphs illustrated in this paper are bar charts with confidence intervals using the GCHART procedure and comparative histograms using the UNIVARIATE procedure. The graphs can be used for most combinations of categorical or continuous target variables with categorical or continuous input variables. This paper assumes the reader is familiar with the basic process of creating predictive models using multiple (linear or logistic) regression.

Speed, precision, reliability these are just three of the many challenges that today s banking institutions need to face. Join Austria s ERSTE GROUP Bank on their road from monolithic processing toward a highly flexible processing infrastructure using SAS^® Grid technology. This paper focuses on the central topics and decisions that go beyond the standard material about the product that is presented initially to SAS Grid prospects. Topics covered range from how to choose the correct hardware and critical architecture considerations to the necessary adaptions of existing code and logic all of which have shown to be a common experience for all the members of the SAS Grid community. After making the initial plans and successfully managing the initial hurdles, seeing it all come together makes you realize the endless possibilities for improving your processing landscape.

Missing data is an ever-present issue, and analysts should exercise proper care when dealing with it. Depending on the data and the analytical approach, this problem can be addressed by simply removing records with missing data. However, in most cases, this is not the best approach. In fact, this can potentially result in inaccurate or biased analyses. The SAS^® programming language offers many DATA step processes and functions for handling missing values. However, some analysts might not like or be comfortable with programming. Fortunately, SAS^® Enterprise Guide^® can provide those analysts with a number of simple built-in tasks for discovering missing data and diagnosing their distribution across fields. In addition, various techniques are available in SAS Enterprise Guide for imputing missing values, varying from simple built-in tasks to more advanced tasks that might require some customized SAS code. The focus of this presentation is to demonstrate how SAS Enterprise Guide features such as Query Builder, Filter and Sort Wizard, Describe Data, Standardize Data, and Create Time Series address missing data issues through the point-and-click interface. As an example of code integration, we demonstrate the use of a code node for more advanced handling of missing data. Specifically, this demonstration highlights the power and programming simplicity of PROC EXPAND (SAS/ETS^® software) in imputing missing values for time series data.

Have you ever needed additional data that was only accessible via a web service in XML or JSON? In some situations, the web service is set up to only accept parameter values that return data for a single observation. To get the data for multiple values, we need to iteratively pass the parameter values to the web service in order to build the necessary dataset. This paper shows how to combine the SAS^® hash object with the FILEVAR= option to iteratively pass a parameter value to a web service and input the resulting JSON or XML formatted data.

When first presented with SAS^® Enterprise Guide^®, many existing SAS^® programmers don't know where to begin. They want to understand, 'What's in it for me?' if they switch over. These longtime users of SAS are accustomed to typing all of their code into the Program Editor window and clicking Submit. This beginning tutorial introduces SAS Enterprise Guide 6.1 to old and new users of SAS who need to code. It points out advantages and tips that demonstrate why a user should be excited about the switch. This tutorial focuses on the key points of a session involving coding and introduces new features. It covers the top three items for a user to consider when switching over to a server-based environment. Attendees will return to the office with a new motivation and confidence to start coding with SAS Enterprise Guide.

A group tasked with testing SAS^® software from the customer perspective has gathered a number of helpful hints for SAS^® 9.4 that will smooth the transition to its new features and products. These hints will help with the 'huh?' moments that crop up when you're getting oriented and will provide short, straightforward answers. And we can share insights about changes in your order contents. Gleaned from extensive multi-tier deployments, SAS^® Customer Experience Testing shares insiders' practical tips to ensure you are ready to begin your transition to SAS^® 9.4.

Portfolio segmentation is key in all forecasting projects. Not all products are equally predictable. Nestl uses animal names for its segmentation, and the animal behavior translates well into how the planners should plan these products. Mad Bulls are those products that are tough to predict, if we don't know what is causing their unpredictability. The Horses are easier to deal with. Modern time series based statistical forecasting methods can tame Mad Bulls, as they allow to add explanatory variables into the models. Nestl now complements its Demand Planning solution based on SAP with predictive analytics technology provided by SAS^®, to overcome these issues in an industry that is highly promotion-driven. In this talk, we will provide an overview of the relationship Nestl is building with SAS, and provide concrete examples of how modern statistical forecasting methods available in SAS^® Demand-Driven Planning and Optimization help us to increase forecasting performance, and therefore to provide high service to our customers with optimized stock, the primary goal of Nestl 's supply chains.

This case study shows how SAS^® Enterprise Guide^® and SAS^® Enterprise BI made it possible to easily implement reports of fraud prevention in BF Financial Services and also how to help operational areas to increase efficiency through automation of information delivery. The fraud alert report was made using a program developed in SAS Enterprise Guide to detect frauds on loan applications and later published in SAS^® Web Report Studio in order to be analyzed by a team. The second example is the automation by SAS BI of a payment report that spent 30% of the time of a six-worker staff.

The role of the Data Scientist is the viral job description of the decade. And like LOLcats, there are many types of Data Scientists. What is this new role? Who is hiring them? What do they do? What skills are required to do their job? What does this mean for the SAS^® programmer and the statistician? Are they obsolete? And finally, if I am a SAS user, how can I become a Data Scientist? Come learn about this job of the future and what you can do to be part of it.

No matter how long you ve been programming in SAS^®, using and manipulating dates still seems to require effort. Learn all about SAS dates, the different ways they can be presented, and how to make them useful. This paper includes excellent examples for dealing with raw input dates, functions to manage dates, and outputting SAS dates into other formats. Included is all the date information you will need: date and time functions, Informats, formats, and arithmetic operations.

Retail price setting is influenced by two distinct factors: the regular price and the promotion price. Together, these factors determine the list price for a specific item at a specific time. These data are often reported only as a singular list price. Separating this one price into two distinct prices is critical for accurate price elasticity modeling in retail. These elasticities are then used to make sales forecasts, manage inventory, and evaluate promotions. This paper describes a new time-series feature extraction utility within SAS^® Forecast Server that allows for automated separation of promotional and regular prices.

The DATA step has served SAS^® programmers well over the years, and although it is powerful, it has not fundamentally changed. With DS2, SAS has introduced a significant alternative to the DATA step by introducing an object-oriented programming environment. In this paper, we share our experiences with getting started with DS2 and learning to use it to access, manage, and share data in a scalable, threaded, and standards-based way.

Determining what, when, and how to migrate SAS^® software from one major version to the next is a common challenge. SAS provides documentation and tools to help make the assessment, planning, and eventual deployment go smoothly. We describe some of the keys to making your migration a success, including the effective use of the SAS^® Migration Utility, both in the analysis mode and the execution mode. This utility is responsible for analyzing each machine in an existing environment, surfacing product-specific migration information, and creating packages to migrate existing product configurations to later versions. We show how it can be used to simplify each step of the migration process, including recent enhancements to flag product version compatibility and incompatibility.

Power producers are looking for ways to not only improve efficiency of power plant assets but also to grow concerns about the environmental impacts of power generation without compromising their market competitiveness. To meet this challenge, this study demonstrates the application of data mining techniques for process optimization in a coal-fired power plant in Thailand with 97,920 data records. The main purpose is to determine which factors have a great impact on both (1) heat rate (kJ/kWh) of electrical energy output and (2) opacity of the flue gas exhaust emissions. As opposed to the traditional regression analysis currently employed at the plant and based on Microsoft Excel, more complex analytical models using SAS^® Enterprise Miner^™ help supporting managerial decision to improve the overall performance of the existing energy infrastructure while reducing emissions through a change in the energy supply structure.

Administrators at Western Kentucky University rely on the Institutional Research department to perform detailed statistical analyses to deepen the understanding of issues associated with enrollment management, student and faculty performance, and overall program operations. This paper presents several instances of analyses performed for the university to help it identify and recruit suitable candidates, uncover root causes in grade and enrollment trends, evaluate faculty effectiveness, and assess the impact of student characteristics, programs, or student activities on retention and graduation rates. The paper briefly discusses the data infrastructure created and used by Institutional Research. For each analysis performed, it reviews the SAS^® program and key components of the SAS code involved. The studies presented include the use of SAS^® Enterprise Miner^™ to create a retention model incorporating dozens of student background variables. It shows an examination of grade trends in the same courses taught by different faculty and subsequent student behavior and success, providing insights into the nuances and subtleties of evaluating faculty performance. Another analysis uncovers the possible influence of fraternities and sororities in freshmen algebra courses. Two investigations explore the impact of programs on student retention and graduation rates. Each example and its findings illustrate how Institutional Research can support the administration of university operations. The target audience is any SAS professional interested in learning more about Institutional Research in higher education and how SAS software is used by an Institutional Research department to serve its organization.

The use, limits, and misuse of statistical models in different industries are propelling new techniques and best practices in forecasting. Until recently, many factors such as data collection and storage constraints, poor data synchronization capabilities, technology limitations, and limited internal analytical expertise have made it impossible to forecast intermittent demand. In addition, integrating consumer demand data (that is, point-of-sale [POS]/syndicated scanner data from ACNielsen/ Information Resources Inc. [IRI]/Intercontinental Marketing Services [IMS]) to shipment forecasts was a challenge. This presentation gives practical how-to advice on intermittent forecasting and outlines a framework, using multi-tiered causal analysis (MTCA), that links demand to supply. The framework uses a process of nesting causal models together by using data and analytics.

This presentation addresses two main topics: The first topic focuses on the industry's norms and the best practices for building internal credit ratings (PD, EAD, and LGD). Although there is not any capital relief to local US banks using internal credit ratings (the US hs not adopted the Internal Rating Based approach of Basel2, with the exception of the top 10 banks), there is an increased responsiveness in credit ratings modeling for the last two years in the US banking industry. The main reason is the added value a bank can achieve from these ratings, and that is the focus of the second part of this presentation. It describes our journey (a client story) for getting there, introducing the SAS^® project. Even more importantly, it describes how we use credit ratings in order to achieve effective credit risk management and get real added value out of that investment. The key success factor for achieving it is to effectively implement ratings within the credit process and throughout decision making . Only then can ratings be used to improve risk-adjusted return on capital, which is the high-end objective of all of us.

Understanding the actual gambling behavior of an individual over the Internet, we develop markers which identify behavioral patterns, which in turn can be used to predict the level of risk a subscriber is prone to gambling. The data set contains 4,056 subscribers. Using SAS^® Enterprise Miner^™ 12.1, a set of models are run to predict which subscriber is likely to become a high-risk internet gambler. The data contains 114 variables such as first active date and first active product used on the website as well as the characteristics of the game such as fixed odds, poker, casino, games, etc. Other measures of a subscriber s data such as money put at stake and what odds are being bet are also included. These variables provide a comprehensive view of a subscriber s behavior while gambling over the website. The target variable is modeled as a binary variable, 0 indicating a risky gambler and 1 indicating a controlled gambler. The data is a typical example of real-world data with many missing values and hence had to be transformed, imputed, and then later considered for analysis. The model comparison algorithm of SAS Enterprise Miner 12.1 was used to determine the best model. The stepwise Regression performs the best among a set of 25 models which were run using over a 100 permutations of each model. The Stepwise Regression model predicts a high-risk Internet gambler at an accuracy of 69.63% with variables such as wk4frequency and wk3frequency of bets.

Formats are an often under-valued tool in the SAS^® toolbox. They can be used in just about all domains to improve the readability of a report, or they can be used as a look-up table to recode your data. Out of the box, SAS includes a multitude of ready-defined formats that can be applied without modification to address most recode and redisplay requirements. And if that s not enough, there is also a FORMAT procedure for defining your own custom formats. This paper looks at using some of the formats supplied by SAS in some innovative ways, but primarily focuses on the techniques we can apply in creating our own custom formats.

SAS^® 9.4 has improved clustering capabilities that allow for scalability and failover for middle-tier servers and the metadata server. In this presentation, we share our experiences with high-availability and failover testing done prior to SAS 9.4 availability. We discuss what we tested and lessons learned (good and bad) while doing the testing.

Report automation and scheduling are very hot topics in many industries. They confer many advantages including reduced work load, elimination of repetitive tasks, generatation of accurate results, and better performance. This paper illustrates how to design an appropriate program to automate and schedule reports in SAS^® 9.1 and SAS^® Enterprise Guide^® 5.1 using a SAS^® server as well as the Windows Scheduler. The automation part includes good aspects of formatting Microsoft Excel tables using XML or VBA coding or any other formats, and conditional auto e-mailing with file attachments. We systematically walk through each step with a clear flow diagram from the data source to the final destination. We also discuss details of server-side and PC-side schedulers and how these schedulers involve invoking batch programs.

Being flexible and highlighting important details in your output is critical. The use of ODS ESCAPECHAR allows the SAS^® programmer to insert inline formatting functions into variable values through the DATA step, and it makes for a quick and easy way to highlight specific data values or modify the style of the table cells in your output. What is an easier and more efficient way to concatenate those inline formatting functions to the variable values? This paper shows how the CAT functions can simplify this task.

Traditional merchandise planning processes have been primarily product and location focused, with decisions about assortment selection, breadth and depth, and distribution based on the historical performance of merchandise in stores. However, retailers are recognizing that in order to compete and succeed in an increasingly complex marketplace, assortments must become customer-centric. Advanced analytics can be leveraged to generate actionable insights into the relevance of merchandise to a retailer's various customer segments and purchase channel preferences. These insights enrich the merchandise and assortment planning process. This paper describes techniques for using advanced analytics to impact customer-centric assortments. Topics covered include approaches for scoring merchandise based on customer relevance and preferences, techniques for gaining insight into customer relevance without customer data, and an overall approach to a customer-driven merchandise planning process.

U.S. educators face a critical new imperative: to prepare all students for work and civic roles in a globalized environment in which success increasingly requires the ability to compete, connect, and cooperate on an international scale. The Asia Society and the Longview Foundation are collaborating on a project to show both the need for and supply of globally competent graduates. This presentation shows you how SAS assisted these organizations with a solution that leverages SAS^® visualization technologies in order to produce a heatmap application. The heatmap application surfaces data from over 300 indicators and surfaces over a quarter million data points in a highly iterative heatmap application. The application features a drillable map that shows data at the state level as well as at the county level for all 50 states. This endeavor involves new SAS^® 9.4 technology to both combine the data and to create the interface. You'll see how SAS procedures, such as PROC JSON, which came out in SAS 9.4, were used to prepare the data for the web application. The user interface demonstrates how SAS/GRAPH^® output can be combined with popular JavaScript frameworks like Dojo and Twitter Bootstrap to create an HTML5 application that works on desktop, mobile, and tablet devices.

SAS^® Environment Manager is included with the release of SAS^® 9.4. This exciting new product enables administrators to monitor the performance and operation of their SAS^® deployments. What very few people are aware of is that the data collected by SAS Environment Manager is stored in a centralized data mart that's designed to help administrators better understand the behavior and performance of the components of their SAS solution stack. This data mart could also be used to help organizations to meet their ITIL reporting and measurement requirements. In addition to the information about alerts, events, and performance metrics collected by the SAS Environment Manager agent technology, this data mart includes the metadata audit and content usage data previously available only from the SAS^® Audit, Performance and Measurement Package.

A common complaint from users working on identifying fraud and abuse in Medicare is that teams focus on operational applications, static reports, and high-level outliers. But, when faced with the need to constantly evaluate changing Medicare provider and beneficiary or enrollee dynamics, users are clamoring for more dynamic and accurate detection approaches. Providing these organizations with a data discovery and predictive analytics framework that leverages Hadoop and other big data approaches, while providing a clear path for teams to make more fact-based decisions more quickly is very important in pre- and post-fraud and abuse analysis. Organizations that do pursue a framework and a reusable services-based data discovery and analytics framework and architecture approach enjoy greater success in supporting data management, reporting, and analytics demands. They can quickly turn models into prioritized alerts and avoid improper or fraudulent payments. A successful framework should enable organizations to come up with efficient fraud, waste, and abuse models to address complex schemes; identify fraud, waste, and abuse vulnerabilities; and shorten triage efforts using a variety of data sourced from big data platforms like Hadoop and other relational database management systems. This paper talks about the data management, data discovery, predictive analytics, and social network analysis capabilities that are included in the SAS fraud framework and how a unified approach can significantly reduce the lifecycle of building and deploying fraud models. We hope this paper will provide IT leaders with a clear path for resolving issues from the simple to the incredibly complex, through a measured and scalable approach for delivering value for fraud, waste, and abuse models by providing deep insights to support evidence-based investigations.

When we start programming, we simply hope that the log comes out with no errors or warnings. Yet once we have programmed for a while, especially in the area of pharmaceutical research, we realize that having a log with specific, useful information in it improves quality and accountability. We discuss clearing the log, sending the log to an output file, helpful information to put in the log, which messages are permissible, automated log checking, adding messages regarding data changes, whether or not we want to see source code, and a few other log-related ideas. Hopefully, the log will become something that we keep in mind from the moment we start programming.

Today's business needs require 24/7 access to your data in order to perform complex queries to your analytical store. In addition, you might need to periodically update your analytical store to append new data, delete old data, or modify some existing data. Historically, the size of your analytical tables or the window in which the table must be updated can cause unacceptable downtime for queries. This paper describes how you can use new SAS^® Scalable Performance Data Server 5.1 cluster table features to simulate transaction isolation in order to modify large sections of your cluster table. These features are optimized for extremely fast operation and can be done without affecting any on-going queries. This provides both the continuous query access and periodic update requirements for your analytical store that your business model requires.

Bootstrapped Decision Tree is a variable selection method used to identify and eliminate unintelligent variables from a large number of initial candidate variables. Candidates for subsequent modeling are identified by selecting variables consistently appearing at the top of decision trees created using a random sample of all possible modeling variables. The technique is best used to reduce hundreds of potential fields to a short list of 30 50 fields to be used in developing a model. This method for variable selection has recently become available in JMP^® under the name BootstrapForest; this paper presents an implementation in Base SAS^®9. The method does accept but does not require a specific outcome to be modeled and will therefore work for nearly any type of model, including segmentation, MCMC, multiple discrete choice, in addition to standard logistic regression. Keywords: Bootstrapped Decision Tree, Variable Selection

While survey researchers make great attempts to standardize their questionnaires including the usage of ratings scales in order to collect unbiased data, respondents are still prone to introducing their own interpretation and bias to their responses. This bias can potentially affect the understanding of commonly investigated drivers of customer satisfaction and limit the quality of the recommendations made to management. One such problem is scale use heterogeneity, in which respondents do not employ a panoramic view of the entire scale range as provided, but instead focus on parts of the scale in giving their responses. Studies have found that bias arising from this phenomenon was especially prevalent in multinational research, e.g., respondents of some cultures being inclined to use only the neutral points of the scale. Moreover, personal variability in response tendencies further complicates the issue for researchers. This paper describes an implementation that uses a Bayesian hierarchical model to capture the distribution of heterogeneity while incorporating the information present in the data. More specifically, SAS^® PROC MCMC is used to carry out a comprehensive modeling strategy of ratings data that account for individual level scale usage. Key takeaways include an assessment of differences between key driver analyses that ignore this phenomenon versus the one that results from our implementation. Managerial implications are also emphasized in light of the prevalent use of more simplistic approaches.

For over three decades, SAS^® has provided capabilities for beating your data into submission. In June of 2000, SAS acquired a company called DataFlux in order to add data quality capabilities to its portfolio. Recently, SAS folded DataFlux into the mother ship. With SAS^® 9.4, SAS^® Enterprise Data Integration Server and baby brother SAS^® Data Integration Server were upgraded into a series of new bundles that still include the former DataFlux products, but those products have grown. These new bundles include data management, data governance, data quality, and master data management, and come in advanced and standard packaging. This paper explores these offerings and helps you understand what this means to both new and existing customers of SAS^® Data Management and DataFlux products. We break down the marketing jargon and give you real-world scenarios of what customers are using today (prior to SAS 9.4) and walk you through what that might look like in the SAS 9.4 world. Each scenario includes the software that is required, descriptions of what each of the components do (features and functions), as well as the likely architectures that you might want to consider. Finally, for existing SAS Enterprise Data Integration Server and SAS^® Data Integration Server customers, we discuss implications for migrating to SAS Data Management and detail some of the functionality that may be new to your organization.

Over the past decade, sports analytics has seen an explosion in research and model development to calculate wins, reaching cult popularity with the release of the film 'Moneyball.' The purpose of this paper is to explore the methodology of solving a real-life Moneyball problem in basketball. An optimal basketball lineup will be selected in an attempt to maximize the total points per game while maximizing court coverage. We will briefly review some of the literature that has explored this type of problem, traditionally called the maximum coverage problem (MCP) in operations research. An exploratory data analysis will be performed, including visualizations and clustering in order to prep the modeling dataset for optimization. Finally, SAS^® will be used to formulate an MCP problem, and additional constraints will be added to run different business scenarios.

This presentation features implementation leads from SAS^® Professional Services and Health Canada's Non-Insured Health benefits (NIHB) program, on a joint implementation of SAS^® Fraud Framework for Health Care. The presentation walks through the fast-paced implementation of NIHB's Pharmacy Surveillance System that guards Canadian taxpayers from undue costs, and protects the safety of NIHB clients. This presentation is a blend of project management and technical material, and presents both the client (NIHB) and consultant (SAS) perspectives throughout the story. The presentation converges onto several core principles needed to successfully deliver analytical solutions.

Most programmers are familiar with the directive Know your data. But not everyone knows about all the data and metadata that a SAS^® data set holds or understands what to do with this information. This presentation talks about the majority of these attributes, how to obtain them, why they are important, and what you can do with them. For example, data sets that have been around for a while might have an inordinate number of deleted observations that you are carrying around unnecessarily. Or you might be able to quickly check to determine whether the data set is indexed and if so, by what variables in order to increase your program s performance. Also, engine-dependent data such as owner name and file size is found in PROC CONTENTS output, which is useful for understanding and managing your data. You can also use ODS output in order to use the values of these many attributes programmatically. This presentation shows you how.

In business environments, a common obstacle to effective data-informed decision making occurs when key stakeholders are reluctant to embrace statistically derived predicted values or forecasts. If concerns regarding model inputs, underlying assumptions, and limitations are not addressed, decision makers might choose to trust their gut and reject the insight offered by a statistical model. This presentation explores methods for converting potential critics into partners by proactively involving them in the modeling process and by incorporating simple inputs derived from expert judgment, focus groups, market research, or other directional qualitative sources. Techniques include biasing historical data, what-if scenario testing, and Monte Carlo simulations.

It is often the case that parameters in a predictive model should be restricted to an interval that is either reasonable or necessary given the model s application. A simple and classic example of such a restriction is the regression model which requires that all parameters to be positive. In the case of multiple least squares (MLS) regression, the resulting model is therefore strictly additive and, in certain applications, not only appropriate but also intuitive. This special case of an MLS model is commonly referred to as a nonnegative least squares regression. While Base SAS^® contains a multitude of ways to perform a multiple least squares regression (PROC REG and PROC GLM, to name two), there exists no native SAS^® procedure to conduct a nonnegative least squares regression. The author offers a concise way to conduct the nonnegative least squares analysis by using PRON NLIN (proc non-linear ). PROC NLIN offers user restriction on parameter estimates. By fashioning a linear model in the framework of a nonlinear procedure, the end result can be achieved. As an additional corollary, the author will show how to calculate the _RSQUARE_ statistic for the resulting model, which has been left out of the PROC NLIN output for the reason that it is invalid in most cases (though not ours).

When creating an OLAP cube, you have the option of specifying a drill-through table, also known as a Show Details table. This quick tip discusses the implications of using your detail table as your drill-through table and explores some viable alternatives.

When viewing and working with SAS^® data sets especially wide ones it s often instinctive to rearrange the variables (columns) into some intuitive order. The RETAIN statement is one of the most commonly cited methods used for ordering variables. Though RETAIN can perform this task, its use as an ordering clause can cause a host of easily missed problems due to its intended function of retaining values across DATA step iterations. This risk is especially great for the more novice SAS programmer. Instead, two equally effective and less risky ways to order data set variables are recommended, namely, the FORMAT and SQL SELECT statements.

Part of being a good analyst and statistician is being able to understand the output of a statistical test in SAS^®. P-values are ubiquitous in statistical output as well as medical literature and can be the deciding factor in whether a paper gets published. This shows a somewhat dictatorial side of them. But do we really know what they mean? In a democratic process, people vote for another person to represent them, their values, and their opinions. In this sense, the sample of research subjects, their characteristics, and their experience, are combined and represented to a certain degree by the p-value. This paper discusses misconceptions about and misinterpretations of the p-value, as well as how things can go awry in calculating a p-value. Alternatives to p-values are described, with advantages and disadvantages of each. Finally, some thoughts about p-value interpretation are given. To disarm the dictator, we need to understand what the democratic p-value can tell us about what it represents&.and what it doesn't. This presentation is aimed at beginning to intermediate SAS statisticians and analysts working with SAS/STAT^®.

ODS is a power tool for generating HTML-based reports. Quite often, however, there are exacting requirements for report content, layout, and placement that can be done with HTML (and especially HTML5) that can t be done with ODS. This presentation shows several examples that use PROC STREAM and SAS^® Server Pages in a batch (for example, scheduled tasks, using SAS^® Display Manager, using SAS^® Enterprise Guide^®) to generate such custom reports. And yes, despite the name SAS Server Pages, this technology, including the use of jQuery widgets, does apply to batch environments. This paper describes and shows several examples that are similar to those presented in the SAS^® Press book SAS Server Pages: Generating Dynamic Content (http://support.sas.com/publishing/authors/extras/64993b.html) and on the author s blog Jurassic SAS in the BI/EBI World (http://hcsbi.blogspot.com/): creating a custom calendar; a sample mail-merge application; generating a custom Microsoft Excel-based report; and generating an expanding drill-down table.

Quite often when building web applications that use either the SAS^® Stored Process Server or the SAS/IntrNet^® Applications Dispatcher, it is necessary to create a custom user interface to prompt for the needed parameters. For example, generating a custom user interface can be accomplished by chaining stored processes together. The first stored process generates the user interface where the user selects the desired options and uses PROC STREAM to process and input SAS^® Server Pages to display the user interface. The second (or later) stored process in the chain generates the desired output. This paper describes and shows several examples similar to those presented in the SAS^® Press book SAS Server Pages: Generating Dynamic Content (http://support.sas.com/publishing/authors/extras/64993b.html) and on the author s blog Jurassic SAS in the BI/EBI World (http://hcsbi.blogspot.com/).

A time-consuming part of statistical analysis is building an analytic data set for statistical procedures. Whether it is recoding input values, transforming variables, or combining data from multiple data sources, the work to create an analytic data set can take time. The DS2 programming language in SAS^® 9.4 simplifies and speeds data preparation with user-defined methods, storing methods and attributes in shareable packages, and threaded execution on multi-core SMP and MPP machines. Come see how DS2 makes your job easier.

Graphs in oncology studies are essential for getting more insight about the clinical data. This presentation demonstrates how ODS Graphics can be effectively and easily used to create graphs used in oncology studies. We discuss some examples and illustrate how to create plots like drug concentration versus time plots, waterfall charts, comparative survival plots, and other graphs using Graph Template Language and ODS Graphics procedures. These can be easily incorporated into a clinical report.

Over the years, there has been a growing concern about consumption of tobacco among youth. But no concrete studies have been done to find what exactly leads the children to start consuming tobacco. This study is an attempt to figure out the potential reasons for the same. Through our analysis, we have also tried to build A model to predict whether a child would smoke next year or not. This study is based on the 2011 National Youth Tobacco Survey data of 18,867 observations. In order to prepare data for insightful analysis, imputation operations were performed on the data using tree-based imputation methods. From a pool of 197 variables, 48 key variables were selected using variable selection methods, partial least squares, and decision tree models. Logistic Regression and Decision Tree models were built to predict whether a child would smoke in the next year or not. Comparing the models using Misclassification rate as the selection criteria, we found that the Stepwise Logistic Regression Model outperformed other models with a Validation Misclassification of 0.028497, 47.19% Sensitivity and 95.80% Specificity. Factors such as company of friends, cigarette brand ads, accessibility to the tobacco products, and passive smoking turned out to be the most important predictors in determining a child smoker. After this study, we could outline some important findings like the odds of a child taking up smoking are 2.17 times high when his close friends are also smoking.

An increase in sea levels is a potential problem that is affecting the human race and marine ecosystem. Many models are being developed to find out the factors that are responsible for it. In this research, the Memory-Based Reasoning model looks more effective than most other models. This is because this model takes the previous solutions and predicts the solutions for forthcoming cases. The data was collected from NASA. The data contains 1,072 observations and 10 variables such as emissions of carbon dioxide, temperature, and other contributing factors like electric power consumption, total number of industries established, and so on. Results of Memory-Based Reasoning models like RD tree, scan tree, neural networks, decision tree, and logistic regression are compared. Fit statistics, such as misclassification rate and average squared error are used to evaluate the model performance. This analysis is used to predict the rise in sea levels in the near future and to take the necessary actions to protect the environment from global warming and natural disasters.

In this session, you learn how Kaiser Permanente has taken a centralized production support approach to using SAS^® Enterprise Guide^® 4.3 in the healthcare industry. Kaiser Permanente Northwest (KPNW) has designed standardized processes and procedures that have allowed KPNW to streamline the support of production content, which enabled KPNW analytical resources to focus more on new content development rather than on maintenance and support of steady state programs and processes. We started with over 200 individual SAS^® processes across four different SAS platforms, SAS Enterprise Guide, Mainframe SAS^®, PC SAS^® and SAS^® Data Integration Studio, in oder to standardize our development approach on SAS Enterprise Guide and build efficient and scalable processes within our department and across the region. We walk through the need for change, how the team was set up, provide an overview of the UNIX SAS platform, walk through the standard production requirements (developer pack), and review lessons learned.

Multi-site health science-related, distributed data networks are becoming increasingly popular, particularly at a time where big data and privacy are often competing priorities. Distributed data networks allow individual-level data to remain behind the firewall of the data holder, permitting the secure execution of queries against those local data and the return of aggregated data produced from those queries to the requester. These networks allow the use of multiple, varied sources of data for study purposes ranging from public health surveillance to comparative effectiveness research, without compromising data holders concerns surrounding data security, patient privacy, or proprietary interests. This paper focuses on the experiences of the Mini-Sentinel pilot project as a case study for using SAS^® to design and build infrastructure for a successful multi-site, collaborative, distributed data network. Mini-Sentinel is a pilot project sponsored by the U.S. Food and Drug Administration (FDA) to create an active surveillance system the Sentinel System to monitor the safety of FDA-regulated medical products. The paper focuses on the data and programming aspects of distributed data networks but also visits governance and administrative issues as they relate to the maintenance of a technical network.

Do you find it difficult to dress up your graphs for your reports or presentations? SAS^® 9.4 introduced new capabilities in ODS Graphics that give you the ability to style your graphs without creating or modifying ODS styles. Some of the new capabilities include the following: a new option for controling how ODS styles are applied graph syntax for overriding ODS style attributes for grouped plots the ability to define font glyphs and images as plot markers enhanced attribute map support In this presentation, we discuss these new features in detail, showing examples in the context of Graph Template Language and ODS Graphics procedures.

Can you juggle? Maybe. Can you shuffle a deck of cards? Probably. Can you do both at the same time? Welcome to the world of SAS^® and LSF! Very few SAS Administrators start out learning LSF at the same time they learn SAS; most already know SAS, possibly starting out as a programmer or analyst, but now have to step up to an enterprise platform with shared resources. The biggest challenge on an enterprise platform? How to share! How to maximum the utilization of a SAS platform, yet still ensure everyone gets their fair share? This presentation will boil down the 2000+ pages of LSF documentation to provide an introduction into various LSF concepts: * Host * Clusters * Nodes * Queues * First-Come-First-Serve * Fairshare * and various configuration settings: UJOB_LIMIT, PJOB_LIMIT, etc. Plus some insight on where to configure all these settings which are set up by the installation process, and which can be configured by the SAS or LSF administrator. This session is definitely NOT for experts. It is for those about to step into an enterprise deployment of SAS, and want to understand how the SAS server sessions they know so well can run on a shared platform.

Are you time-poor and code-heavy? It's easy to get into a rut with your SAS^® code, and it can be time-consuming to spend your time learning and implementing improved techniques. This presentation is designed to share quick improvements that take five minutes to learn and about the same time to implement. The quick hits are applicable across versions of SAS and require only Base SAS^® knowledge. Included topics are: simple macro tricks little-known functions that get rid of messy coding dynamic conditional logic data summarization tips to reduce data and processing testing and space utilization tips. This presentation has proven valuable to beginner through experienced SAS users.

If someone comes to you with hundreds of questionnaire forms in Microsoft Word file format and asks you to extract the data from the forms into a SAS^® data set, you might have several ways to handle this. However, as a SAS programmer, the shortcut is to write a SAS program to read the data directly from Word files into a SAS data set. This paper shows how it can be done with simple SAS programming skills, such as using FILENAME with the PIPE option, DDE, function call EXECUTE( ), and so on.

Healthcare expenditure growth continues to be a prominent healthcare policy issue, and the uncertain impact of the Affordable Care Act (ACA) has put increased pressure on payers to find ways to exercise control over costs. Fueled by provider performance analytics, BCBSNC has developed innovative strategies that recognize, reward, and assist providers delivering high-quality and efficient care. A leading strategy has been the introduction of a new tiered network product called Blue Select, which was launched in 2013 and will be featured in the State Health Exchange. Blue Select is a PPO with differential member cost-sharing for tiered providers. Tier status of providers is determined by comparing providers to their peers on the efficiency and quality of the care they delivery. Providers who meet or exceed the standard for quality, measured using Healthcare Effectiveness Data and Information Set (HEDIS) adherence rates and potentially avoidable complication rates for certain procedures, are then evaluated on their case-mix adjusted costs for total episodic costs. Each practice s performance is compared, through indirect standardization, to expected performance, given the patients and conditions treated within a practice. A ratio of observed to expected performance is calculated for both cost and quality to use in determining the tier status of providers for Blue Select. While the primary goal of provider tiering is cost containment through member steerage, the initiative has also resulted in new and strengthened collaborative relationships between BCBSNC and providers. The strategy offers the opportunity to bend the cost curve and provide meaningful change in the quality of healthcare delivery.

Analyzing automobile policies and claims is an ongoing area of interest to the insurance industry. Although there have been many data mining projects in insurance sector over the past decade, the following questions How can insurance firms retain their best customers? Will this damaged car be covered and get claim payment? How much of loss of claims associated with this policy will be? do remain as common. This study applies data mining techniques using SAS^® Enterprise Miner^™ to enhance insurance policies and claims. The main focus is on assessing how corporate fleet customers policy characteristics and claim behavior are different from that of non-fleet customers. With more than 100,000 data records, implementing advanced analytics help create better planning for policy and claim management strategy.

This paper reveals the human mobility behavior in the metropolitan area of Rio de Janeiro, Brazil. The base for this study is the mobile phone data provided by one of the largest mobile carriers in Brazil. Mobile phone data comprises a reasonable variety of information, including data about time and location for call activity throughout urban areas. This information might be used to build users trajectories over time, describing the major characteristics of the urban mobility within the city. A variety of distribution analyses is presented in this paper aiming clearly describes the most relevant characteristics of the overall mobility in the metropolitan area of Rio de Janeiro. In addition to that, methods from physics to describe trends in trips such as gravity and radiation models were computed and compared in terms of granularity of the geographic scales and also in relation to traditional data mining approach such as linear regressions. A brief comparison in terms of performance in predicting the amount of trips between pairs of locations is presented at the end.

The project focuses on using analytics to reveal unwarranted use of access to medical records, i.e. employees in health organizations that access information about neighbours, friends, celebrities, etc., without a sound reason to do so. The method is based on the natural assumption that the vast majority of lookups are legitimate lookups that differ from a statistically defined normal behavior will be subject to manual investigation. The work was carried out in collaboration between SAS Institute Norway and the largest Norwegian hospital, Oslo University Hospital (OUS) and was aimed at establishing whether the method is suitable for unveiling unwarranted lookups in medical records. A number of so called scenarios are used to indicate adverse behaviour, each responsible for looking at one particular aspect of journal access data. For instance, one scenario determines the timeliness of a lookup relative to the patient's admission history; another judges whether the medical competency of the employee is relevant to the situation of the patient at the time of the lookup. We have so far designed and developed a library of around 20 scenarios that together are used in weighted combination to render a final judgment of the appropriateness of the lookup. The approach has been proven highly successful, and a further development of these ideas is currently being done, the aim of which is to establish a joint Norwegian solution to the problem of unwarranted access. Furthermore, we believe that the approach and the framework may be utilised in many other industries where sensitive data is being processed, such as financial, police, tax and social services. In this paper, the method is outlined, as well as results of its application on data from OUS.

SAS/STAT^® 13.1 brings valuable new techniques to all sectors of the audience forSAS statistical software. Updates for survival analysis include nonparametricmethods for interval censoring and models for competing risks. Multipleimputation methods are extended with the addition of sensitivity analysis.Bayesian discrete choice models offer a modern approach for consumer research.Path diagrams are a welcome addition to structural equation modeling, and itemresponse models are available for educational assessment. This paper providesoverviews and introductory examples for each of the new focus areas in SAS/STAT13.1. The paper also provides a sneak preview of the follow-up release,SAS/STAT 13.2, which brings additional strategies for missing data analysis andother important updates to statistical customers.

SAS^® Enterprise Guide^® has become the place through which many SAS^® users access the power of SAS. Some like it, some loathe it, some have never known anything else. In my experience, the following attitudes prevail regarding the product: 1) I don't know what SAS is, but I can use a mouse and I know what my business needs are. 2) I've used SAS before, but now my company has moved to SAS Enterprise Guide and I love it! 3) I've used SAS before, but now my company has done something really stupid. SAS Enterprise Guide offers a place to learn as well as work. The product offers environments for point-and-click for those who want that, and a type-your-code-with-semi-colons environment for those who want that. Even better, a user can mix and match, using the best of both worlds. I show that SAS Enterprise Guide is a great place for building up business solutions using a step-by-step method, how we can make the best of both environments, and how we can dip our toes into parts of SAS that might have frustrated us in the past and made us run away and cry I ll do it in Excel! I demonstrate that there are some very nice aspects to SAS Enterprise Guide, out of the box, that are often ignored but that can improve the overall SAS experience. We look at my personal nemeses, SAS/GRAPH^® and PROC TABULATE, with a side-trip to the mysterious world that is ODS, or the Output Delivery System.

This presentation provides users with an update on retail solution releases that have occurred in the past year and a roadmap for moving forward.

The UNIX host group delivers many utilities that go unnoticed. What are these utilities, and what can they tell you about your SAS^® system? Are you having authentication problems? Are you unable to get a result from a workspace server? What hot fixes have you applied? These are subjects that come up during a tech support call. It would be good to have background information about these tools before you have to use them.

Traditionally, Java web applications interact with back-end databases by means of JDBC/ODBC connections to retrieve and update data. With the growing need for real-time charting and complex analysis types of data representation on these types of web applications, SAS^® computing power can be put to use by adding a SAS web service layer between the application and the database. This paper shows how a SAS web service layer can be used to render data to a JAVA application in a summarized form using SAS^® Stored Processes. This paper also demonstrates how inputs can be passed to a SAS Stored Process based on which computations/summarizations are made before output parameter and/or output data streams are returned to the Java application. SAS Stored Processes are then deployed as SAS^® BI Web Services using SAS^® Management Console, which are available to the JAVA application as a URL. We use the SOAP method to interact with the web services. XML data representation is used as a communication medium. We then illustrate how RESTful web services can be used with JSON objects being the communication medium between the JAVA application and SAS in SAS^® 9.3. Once this pipeline communication between the application, SAS engine, and database is set up, any complex manipulation or analysis as supported by SAS can be incorporated into the SAS Stored Process. We then illustrate how graphs and charts can be passed as outputs to the application.

Using Lilypond typesetting software, you can write publication-grade music scores. The input for Lilypond is a text file that can be written once and then transferred to SAS^® for patterned repetition, so that you can cycle through patterns that occur in music. The author plays a sequence of notes and then writes this into Lilypond code. The sequence starts in the key of C with only a two-note sequence. Then the sequence is extended to three-, four-, then five-note sequences, always contained in one octave. SAS is then used to write the same code for all other eleven keys and in seven scale modes. The method is very simple and not advanced programming. Lookup files are used in the programming, demonstrating efficient lookup techniques. The result is a lengthy book or exercise for practicing music in a PDF file, and a sound source file in midi format is created that you can hear. This method shows how various programming languages can be used to write other programming languages.

How does the SAS^® server architecture fit within your IT infrastructure? What functional aspects does the architecture support? This session helps attendees understand the logical server topology of the SAS technology stack: resource and process management in-memory architecture in-database processing The session also discusses process flows from data acquisition through analytical information to visual insight. IT architects, data administrators, and IT managers from all industries should leave with an understanding of how SAS has evolved to better fit into the IT enterprise and to help IT's internal customers make better decisions.

The scatter plot is a basic tool for examining the relationship between two variables. While the basic plot is good, enhancements can make it better. In addition, there might be problems of overplotting. In this paper, I cover ways to create basic and enhanced scatter plots and to deal with overplotting.

Linear regression has been a widely used approach in social and medical sciences to model the association between a continuous outcome and the explanatory variables. Assessing the model assumptions, such as linearity, normality, and equal variance, is a critical step for choosing the best regression model. If any of the assumptions are violated, one can apply different strategies to improve the regression model, such as performing transformation of the variables or using a spline model. SAS^® has been commonly used to assess and validate the postulated model and SAS^® 9.3 provides many new features that increase the efficiency and flexibility in developing and analyzing the regression model, such as ODS Statistical Graphics. This paper aims to demonstrate necessary steps to find the best linear regression model in SAS 9.3 in different scenarios where variable transformation and the implementation of a spline model are both applicable. A simulated data set is used to demonstrate the model developing steps. Moreover, the critical parameters to consider when evaluating the model performance are also discussed to achieve accuracy and efficiency.

All successful organizations seek ways of communicating the identity of subject matter experts to employees. This information exists as common knowledge when an organization is first starting out, but the common knowledge becomes fragmented as the organization grows. SAS^® Text Analytics can be used on an organization's internal unstructured data to reunite these knowledge fragments. This paper demonstrates how to extract and surface this valuable information from within an organization. First, the organization s unstructured textual data are analyzed by SAS^® Enterprise Content Categorization to develop a topic taxonomy that associates subject matter with subject matter experts in the organization. Then, SAS Text Analytics can be used successfully to build powerful semantic models that enhance an organization's unstructured data. This paper shows how to use those models to process and deliver real-time information to employees, increasing the value of internal company information.

Business analysts commonly use Microsoft Excel with the SAS^® System to answer difficult business questions. While you can use these applications independently of each other to obtain the information you need, you can also combine the power of those applications, using the SAS Output Delivery System (ODS) tagsets, to completely automate the process. This combination delivers a more efficient process that enables you to create fully functional and highly customized Excel worksheets within SAS. This paper starts by discussing common questions and problems that SAS Technical Support receives from users when they try to generate Excel worksheets. The discussion continues with methods for automating Excel worksheets using ODS tagsets and customizing your worksheets using the CSS style engine and extended tagsets. In addition, the paper discusses tips and techniques for moving from the current MSOffice2K and ExcelXP tagsets to the new Excel destination, which generates output in the native Excel 2010 format.

Distributing SAS^® software to a large number of machines can be challenging at best and exhausting at worst. Common areas of concern for installers are silent automation, network traffic, ease of setup, standardized configurations, maintainability, and simply the sheer amount of time it takes to make the software available to end users. We describe a variety of techniques for easing the pain of provisioning SAS software, including the new standalone SAS^® Enterprise Guide^® and SAS^® Add-in for Microsoft Office installers, as well as the tried and true SAS^® Deployment Wizard record and playback functionality. We also cover ways to shrink SAS Software Depots, like the new 'subsetting recipe' feature, in order to ease scenarios requiring depot redistribution. Finally, we touch on alternate methods for workstation access to SAS client software, including application streaming, desktop virtualization, and Java Web Start.

Have you ever needed to use dates as values to loop through a table? For example, how many events occurred by 1, 2 , 3 & n months ahead? Maybe you just changed the dates manually and re-ran the query n times? This is a common need in economic and behavioral sciences. This presentation demonstrates how to create a table of dates that can be used with SAS^® macro variables to loop through a table. Using this dates table in combination with the SAS DO loop ensures accuracy and saves time.

One beautiful graph provides visual clarity of data summaries reported in tables and listings. Waterfall graphs show, at a glance, the increase or decrease of data analysis results from various industries. The introduction of SAS^® 9.2 ODS Statistical Graphics enables SAS^® programmers to produce high-quality results with less coding effort. Also, SAS programmers can create sophisticated graphs in stylish custom layouts using the SAS^® 9.3 Graph Template Language and ODS style template. This poster presents two sets of example waterfall graphs in the setting of clinical trials using SAS^® 9.3 and later. The first example displays colorful graphs using new SAS 9.3 options. The second example displays simple graphs with gray-scale color coding and patterns. SAS programmers of all skill levels can create these graphs on UNIX or Windows.

Systematic reviews have become increasingly important in healthcare, particularly when there is a need to compare new treatment options and to justify clinical effectiveness versus cost. This paper describes a method in SAS/STAT^® 9.2 for computing weighted averages and weighted standard deviations of clinical variables across treatment options while correctly using these summary measures to make accurate statistical inference. The analyses of data from systematic reviews typically involve computations of weighted averages and comparisons across treatment groups. However, the application of the TTEST procedure does not currently take into account weighted standard deviations when computing p-values. The use of a default non-weighted standard deviation can lead to incorrect statistical inference. This paper introduces a method for computing correct p-values using weighted averages and weighted standard deviations. Given a data set containing variables for three treatment options, we want to make pairwise comparisons of three independent treatments. This is done by creating two temporary data sets using PROC MEANS, which yields the weighted means and weighted standard deviations. Subsequently, we then perform a t-test on each temporary data set.The resultant data sets containing all comparisons of each treatment options are merged and then transposed to obtain the necessary statistics. The resulting output provides pairwise comparisons of each treatment option and uses the weighted standard deviations to yield the correct p-values in a desired format. This method allows the use of correct weighted standard deviations using PROC MEANS and PROC TTEST in summarizing data from a systematic review while providing correct p-values.

Contrasting two sets of textual data points out important differences. For example, consider social media data that have been collected on the race between incumbent Kay Hagan and challenger Thom Tillis in the 2014 election for the seat of US Senator from North Carolina. People talk about the candidates in different terms for different topics, and you can extract the words and phrases that are used more in messages about one candidate than about the other. By using SAS^® Sentiment Analysis on the extracted information, you can discern not only the most important topics and sentiments for each candidate, but also the most prominent and distinguishing terms that are used in the discussion. Find out if Republicans and Democrats speak different languages!

Westat utilizes SAS^® software as a core capability for providing clients in government and private industry with analysis and characterization of survey data. Staff programmers, analysts, and statisticians use SAS to manage, store, and analyze client data, as well as to produce tabulations, reports, graphs, and summary statistics. Because SAS is so widely used at Westat, the organization has built a comprehensive infrastructure to support its deployment and use. This paper provides an overview of Westat s SAS support infrastructure, which supplies resources that are aimed at educating staff, strengthening their SAS skills, providing SAS technical support, and keeping the staff on the cutting edge of SAS programming techniques.

One in every four people dies of heart disease in the United States, and stress is an important factor which contributes towards a cardiac event. As the condition of the heart gradually worsens with age, the factors that lead to a myocardial infarction when the patients are subjected to stress are analyzed. The data used for this project was obtained from a survey conducted through the Department of Biostatistics at Vanderbilt University. The objective of this poster is to predict the chance of survival of a patient after a cardiac event. Then by using decision trees, neural networks, regression models, bootstrap decision trees, and ensemble models, we predict the target which is modeled as a binary variable, indicating whether a person is likely to survive or die. The top 15 models, each with an accuracy of over 70%, were considered. The model will give important survival characteristics of a patient which include his history with diabetes, smoking, hypertension, and angioplasty.

With smartphone and mobile apps market developing so rapidly, the expectations about effectiveness of mobile applications is high. Marketers and app developers need to analyze huge data available much before the app release, not only to better market the app, but also to avoid costly mistakes. The purpose of this poster is to build models to predict the success rate of an app to be released in a particular category. Data has been collected for 540 android apps under the Top free newly released apps category from https://play.google.com/store . The SAS^® Enterprise Miner^™ Text Mining node and SAS^® Sentiment Analysis Studio are used to parse and tokenize the collected customer reviews and also to calculate the average customer sentiment score for each app. Linear regression, neural, and auto-neural network models have been built to predict the rank of an app by considering average rating, number of installations, total number of reviews, number of 1-5 star ratings, app size, category, content rating, and average customer sentiment score as independent variables. A linear regression model with least Average Squared Error is selected as the best model, and number of installations, app maturity content are considered as significant model variables. App category, user reviews, and average customer sentiment score are also considered as important variables in deciding the success of an app. The poster summarizes the app success trends across various factors and also introduces a new SAS^® macro %getappdata, which we have developed for web crawling and text parsing.

'Can I have that in Excel?' This is a request that makes many of us shudder. Now your boss has discovered Microsoft Excel pivot tables. Unfortunately, he has not discovered how to make them. So you get to extract the data, massage the data, put the data into Excel, and then spend hours rebuilding pivot tables every time the corporate data is refreshed. In this workshop, you learn to be the armchair quarterback and build pivot tables without leaving the comfort of your SAS^® environment. In this workshop, you learn the basics of Excel pivot tables and, through a series of exercises, you learn how to augment basic pivot tables first in Excel, and then using SAS. No prior knowledge of Excel pivot tables is required.

Once upon a time, a writer compared a desert to a labyrinth. A desert has no walls or stairways, but you can still find yourself utterly lost in it. And oftentimes, when you think you found that oasis you were looking for, what you are really seeing is an illusion, a mirage. Similarly, logical fallacies and misleading data patterns can easily deceive the unaware data explorer. In this paper, we discuss how they can be recognized and neutralized with the power of the SAS^® Visual Analytics Explorer. Armed with this knowledge, you will be able to safely navigate the dunes to find true insights and avoid false conclusions.

The SAS^® Enterprise Guide^® Query Builder is one of the most powerful components of the software. It enables a user to bring in data, join, drop and add columns, compute new columns, sort, filter data, leverage the advanced expression builder, change column attributes, and more! This presentation provides an overview of the major features of this powerful tool and how to leverage it every day.

Direct marketing is the practice of delivering promotional messages directly to potential customers on an individual basis rather than by using mass medium. In this project, we build a finely tuned response model that helps a financial services company to select high-quality receptive customers for their future campaigns and to identify the important factors that influence marketing to effectively manage their resources. This study was based on the customer solicitation center s marketing campaign data (45,211 observations and 18 variables) available on UC Irvine's web site with attributes of present and past campaign information (communication type, contact duration, previous campaign outcome, and so on) and customer s personal and banking information. As part of data preparation, we had performed mean imputation to handle missing values and categorical recoding for reducing levels of class variables. In this study, we had built several predictive models using the SAS^® Enterprise Miner^™ models Decision Tree, Neural Network, Logistic Regression, and SVM to predict whether the customer responds to the loan offer by subscribing. The results showed that the Stepwise Logistic Regression model was the best when chosen based on the misclassification rate criteria. When the top 3 decile customers were selected based on the best model, the cumulative response rate was 14.5% in contrast to the baseline response rate of 5%. Further analysis showed that the customers are more likely to subscribe to the loan offer if they have the following characteristics: never been contacted in the past, no default history, and provided cell phone as primary contact information.

This is the way I have always done it and it works fine for me. Have you heard yourself or others say this when someone suggests a new technique to help solve a problem? Most of us have a set of tricks and techniques from which we draw when starting a new project. Over time we might overlook newer techniques because our old toolkit works just fine. Sometimes we actively avoid new techniques because our initial foray leaves us daunted by the steep learning curve to mastery. For me, the PRX functions and the SAS^® hash object fell into this category. In this workshop, we address possible objections to learning to use the SAS hash object. We start with the fundamentals of setting up the hash object and work through a variety of practical examples to help you master this powerful technique.

Applying models to analyze sports data has always been done by teams across the globe. The film Moneyball has generated much hype about how a sports team can use data and statistics to build a winning team. The objective of this poster is to use the model comparison algorithm of SAS^® Enterprise Miner^™ to pick the best model that can predict the outcome of a soccer game. It is hence important to determine which factors influence the results of a game. The data set used contains input variables about a team s offensive and defensive abilities and the outcome of a game is modeled as a target variable. Using SAS Enterprise Miner, multinomial regression, neural networks, decision trees, ensemble models and gradient boosting models are built. Over 100 different versions of these models are run. The data contains statistics from the 2012-13 English premier league season. The competition has 20 teams playing each other in a home and away format. The season has a total of 380 games; the first 283 games are used to predict the outcome of the last 97 games. The target variable is treated as both nominal variable and ordinal variable with 3 levels for home win, away win, and tie. The gradient boosting model is the winning model which seems to predict games with 65% accuracy and identifies factors such as goals scored and ball possession as more important compared to fouls committed or red cards received.

Identifying claim fraud using predictive analytics represents a unique challenge. 1. Predictive analytics generally requires that you have a target variable which can be analyzed. Fraud is unique in this regard in that there is a lot of fraud that has occurred historically that has not been identified. Therefore, the definition of the target variable is difficult. 2.There is also a natural assumption that the past will bear some resemblance to the future. In the case of fraud, methods of defrauding insurance companies change quickly and can make the analysis of a historical database less valuable for identifying future fraud. 3. In an underlying database of claims that may have been determined to be fraudulent by an insurance company, there is many times an inconsistency between different claim adjusters regarding which claims are referred for investigation. This inconsistency can lead to erroneous model results due to data that is not homogenous. This paper will demonstrate how analytics can be used in several ways to help identify fraud: 1. More consistent referral of suspicious claims 2. Better identification of new types of suspicious claims 3. Incorporating claim adjuster insight into the analytics results. As part of this paper, we will demonstrate the application of several approaches to fraud identification: 1. Clustering 2. Association analysis 3. PRIDIT (Principal Component Analysis of RIDIT scores).

HTML5 has become the de facto standard for web applications. As a result, the lingua franca object notation of the web services that the web applications call has switched from XML to JSON. JSON is remarkably easy to parse in JavaScript, but so far SAS doesn't have any native JSON parsers. The Facebook Graph API dropped XML support a few years ago. This paper shows how we can parse the JSON in SAS by calling an external script, using PROC GROOVY to parse it inside of SAS, or by parsing the JSON manually with a DATA step. We'll extract the data from the Facebook Graph API and import it into an OLAP data mart to report and analyze a marketing campaign's effectiveness.

Changes in health insurance and other industries often have a spatial component. Maps can be used to convey this type of information to the user more quickly than tabular reports and other non-graphical formats. SAS^® provides programmers and analysts with the tools to not only create professional and colorful maps, but also the ability to display spatial data on these maps in a meaningful manner that aids in the understanding of the changes that have transpired. This paper illustrates the creation of a number of different maps for displaying change over time with examples from the health insurance arena.

As a longtime Base SAS^® programmer, whether to use a different application for programming is a constant question when powerful applications such as SAS^® Enterprise Guide^® are available. This paper provides some important tips for a programmer, such as the best way to use the code window and how to take advantage of system-generated code in SAS Enterprise Guide 5.1. This paper also explains the differences between some of the functions and procedures in Base SAS and SAS Enterprise Guide. It highlights features in SAS Enterprise Guide such as process flow, data access management, and report automation, including formatting using XML tag sets.

This paper gives you a better idea of how and where to use the record lookup functions to locate observations where a variable has some characteristic. Various related functions are illustrated to search numeric and character values in this process. Code is shown with time comparisons. I will discuss three possible ways to retrieve records using the SAS^® DATA step, PROC SQL, and Perl regular expressions. Real and CPU time processing issues will be highlighted when comparing to retrieve records using these methods. Although the program is written for the PC using SAS^® 9.2 in a Windows XP 32-bit environment, all the functions are applicable to any system. All the tools discussed are in Base SAS^®. The typical attendee or reader will have some experience in SAS, but not a lot of experience dealing with large amount of data.

SAS^® Add-In for Microsoft Office remains a popular tool for people who are not SAS^® programmers due to its easy interface with the SAS servers. In this session, you'll learn some of the many tricks that other organizations use for getting more value out of the tool.

Understanding previous research in key domain areas can help R&D organizations focus new research in non-duplicative areas and ensure that future endeavors do not repeat the mistakes of the past. However, manual analysis of previous research efforts can prove insufficient to meet these ends. This paper highlights how a combination of SAS^® Text Analytics and SAS^® Visual Analytics can deliver the capability to understand key topics and patterns in previous research and how it applies to a current research endeavor. We will explore these capabilities in two use cases. The first will be in uncovering trends in publicly visible government funded research (SBIR) and how these trends apply to future research in nanotechnology. The second will be visualizing past research trends in publicly available NASA publications, and how these might impact the development of next-generation spacecraft.

SAS^® provides a wide variety of products and solutions that address analytics, data management, and reporting. It can be challenging to understand how the data and processes in a SAS deployment relate to each other and how changes in your processes affect downstream consumers. This paper presents visualization and reporting tools for lineage and impact analysis. These tools enable you to understand where the data for any report or analysis originates or how data is consumed by data management, analysis, or reporting processes. This paper introduces new capabilities to import metadata from third-party systems to provide lineage and impact analysis across your enterprise.

SAS^® Visual Analytics is a unique tool that provides both exploratory and predictive data analysis capabilities. As the visual part of the name suggests, the rendering of this analysis in the form of visuals (crosstabs, line charts, histograms, scatter plots, geo maps, treemaps, and so on) make this a very useful tool. Join me as I walk you down the path of exploring the capabilities of SAS Visual Analytics 6.3, starting with data stored in a desktop application as multiple Microsoft Excel files. Together, we import the data into SAS Visual Analytics, prepare the data using the data builder, load the data into SAS^® LASR™ Analytic Server, explore data, and create reports.

Are you a Java programmer who has been asked to work with SAS^®, or a SAS programmer who has been asked to provide an interface to your IT colleagues? Let s face it, not a lot of Java programmers are heavy SAS users. If this is the case in your company, then you are in luck because SAS provides a couple of really slick features to allow Java programmers to access both SAS data and SAS programming from within a Java program. This paper walks beginner Java or SAS programmers through the simple task of accessing SASdata and SAS programs from a Java program. All that you need is a Java environment and access to a running SAS process, such as a SAS server. This SAS server can either be a SAS/SHARE^® server or an IOM server. However, if you do not have either of these two servers that is okay; with the tools that are provided by SAS, you can start up a remote SAS session within Java and harness the power of SAS.

Regression is a helpful statistical tool for showing relationships between two or more variables. However, many users can find the barrage of numbers at best unhelpful, and at worst undecipherable. Using the shipments and inventories historical data from the U.S. Census Bureau's office of Manufacturers' Shipments, Inventories, and Orders (M3), we can create a graphical representation of two time series with PROC GPLOT and map out reported and expected results. By combining this output with results from PROC REG, we are able to highlight problem areas that might need a second look. The resulting graph shows which dates have abnormal relationships between our two variables and presents the data in an easy-to-use format that even users unfamiliar with SAS^® can interpret. This graph is ideal for analysts finding problematic areas such as outliers and trend-breakers or for managers to quickly discern complications and the effect they have on overall results.

The new Markov chain Monte Carlo (MCMC) procedure introduced in SAS/STAT^® 9.2 and further exploited in SAS/STAT^® 9.3 enables Bayesian computations to run efficiently with SAS^®. The MCMC procedure allows one to carry out complex statistical modeling within Bayesian frameworks under a wide spectrum of scientific research; in psychometrics, for example, the estimation of item and ability parameters is a kind. This paper describes how to use PROC MCMC for Bayesian inferences of item and ability parameters under a variety of popular item response models. This paper also covers how the results from SAS PROC MCMC are different from or similar to the results from WinBUGS. For those who are interested in the Bayesian approach to item response modeling, it is exciting and beneficial to shift to SAS, based on its flexibility of data managements and its power of data analysis. Using the resulting item parameter estimates, one can continue to test form constructions, test equatings, etc., with all these test development processes being accomplished with SAS!

Existing health literacy assessment tools developed for research purposes have constraints that limit their utility for clinical practice. The measurement of health literacy in clinical practice can be impractical due to the time requirements of existing assessment tools. Single Item Literacy Screener (SILS) items, which are self-administered brief screening questions, have been developed to address this constraint. We developed a model to predict limited health literacy that consists of two SILS and demographic information (for example, age, race, and education status) using a sample of patients in a St. Louis emergency department. In this paper, we validate this prediction model in a separate sample of patients visiting a primary care clinic in St. Louis. Using the prediction model developed in the previous study, we use SAS/STAT^® software to validate this model based on three goodness of fit criteria: rescaled R-squared, AIC, and BIC. We compare models using two different measures of health literacy, Newest Vital Sign (NVS) and Rapid Assessment of Health Literacy in Medicine Revised (REALM-R). We evaluate the prediction model by examining the concordance, area under the ROC curve, sensitivity, specificity, kappa, and gamma statistics. Preliminary results show 69% concordance when comparing the model results to the REALM-R and 66% concordance when comparing to the NVS. Our conclusion is that validating a prediction model for inadequate health literacy would provide a feasible way to assess health literacy in fast-paced clinical settings. This would allow us to reach patients with limited health literacy with educational interventions and better meet their information needs.

This paper discusses the techniques I used at the Census Bureau to overcome the issue of dealing with large amounts of data while modernizing some of their public-facing web applications by using service oriented architecture (SOA) to deploy Flex web applications powered by SAS^®. The paper covers techniques that resulted in reducing 142,293 XML lines (3.6 MB) down to 15,813 XML lines (1.8 MB), a 50% size reduction on the server side (HTTP Response), and 196,167 observations down to 283 observations, a reduction of 99.8% in summarized data on the client side (XML Lookup file).

Comprehensive cancer centers have been mandated to engage communities in their work; thus, measurement of community engagement is a priority area. Siteman Cancer Center s Program for the Elimination of Cancer Disparities (PECaD) projects seek to align with 11 Engagement Principles (EP) previously developed in the literature. Participants in a PECaD pilot project were administered a survey with questions on community engagement in order to evaluate how well the project aligns with the EPs. Internal consistency is examined using PROC CORR with the ALPHA option to calculate Cronbach s alpha for questions that relate to the same EP. This allows items that have a lack of internal consistency to be identified and to be edited or removed from the assessment. EP-specific scores are developed on quantity and quality scales. Lack of internal consistency was found for six of the 16 EP s examined items (alpha<.70). After editing the items, all EP question groups had strong internal consistency (alpha>.85). There was a significant positive correlation between quantity and quality scores (r=.918, P<.001). Average EP-specific scores ranged from 6.87 to 8.06; this suggests researchers adhered to the 11 EPs between sometime and most of the time on the quantity scale and between good and very good on the quality scale. Examining internal consistency is necessary to develop measures that accurately determine how well PECaD projects align with EPs. Using SAS^® to determine internal consistency is an integral step in the development of community engagement scores.

Especially in this current financial climate, many of us are being asked to do more with less. For several years, the Office of Institutional Research and Testing at Baylor University has been using SAS^® software to increase the efficiency of the office and of the University as a whole. Reports that were once prepared manually have been automated. Data quality processes have been implemented in order to reduce the number of duplicate mailings. Predictive modeling is used to focus recruiting efforts on those prospective students most likely to respond. A web-based portal has been created to provide self-service report generation for many administrators across campus. Along with this, a number of data processing functions have been centralized, eliminating the need for additional programming skills and software support. This presentation discusses these improvements in more detail and provides examples of the end results.

When providing lengthy cost and utilization data to medical providers, it is ideal to sort the report by descending cost (or utilization) so that the important categories are at the top. This task can be easily solved using PROC SORT. However, when you need other variables (such as unit cost per procedure or national average) to follow the sort but not be sorted themselves, the solution is not as intuitive. This paper looks at several sorting algorithms to solve this problem. First, we look at the basic bubble sort (which is still effective for smaller data sets), which sets up arrays for each variable and then sorts on just one of them. Next, we discuss the quicksort algorithm, which is effective for large data sets, too. The results of the sorts provide sorted data that is easy to read and makes for effective analysis.

When reading data files or writing SAS^® programs, we are often hunting for the right format or informat. There are so many to choose from! Does it seem like too many to search the manual? Let SAS help find the right one! We use the SAS dictionary table VFORMAT and a very small SAS program. This presentation demonstrates how two simple functions unlock the potential of this great resource: SASHELP.VFORMAT.

Researchers often rely on self-report for survey based studies. The accuracy of this self-reported data is often unknown, particularly in a medical setting that serves an under-insured patient population with varying levels of health literacy. We recruited participants from the waiting room of a St. Louis primary care safety net clinic to participate in a survey investigating the relationship between health environments and health outcomes. The survey included questions regarding personal and family history of chronic disease (diabetes, heart disease, and cancer) as well as BMI and self-perceived weight. We subsequently accessed the participant s electronic medical record (EMR) and collected physician-reported data on the same variables. We calculated concordance rates between participant answers and information gathered from EMRs using McNemar s chi-squared test. Logistic regression was then performed to determine the demographic predictors of concordance. Three hundred thirty-two patients completed surveys as part of the pilot phase of the study; 64% female, 58% African American, 4% Hispanic, 15% with less than high school level education, 76% annual household income less than $20,000, and 29% uninsured. Preliminary findings suggest an 82-94% concordance rate between self-reported and medical record data across outcomes, with the exception of family history of cancer (75%) and heart disease (42%). Our conclusion is that determining the validity of the self-reported data in the pilot phase influences whether self-reported personal and family history of disease and BMI are appropriate for use in this patient population.

The world's first wind resource assessment buoy, residing in Lake Michigan, uses a pulsing laser wind sensor to accurately measure wind speed, direction, and turbulence offshore up to wind turbine hub-height and across the blade span every second. Understanding wind behavior would be tedious and fatiguing with such large data sets. However, SAS/GRAPH^® 9.4 helps the user grasp wind characteristics over time and at different altitudes by exploring the data visually. This paper covers graphical approaches to evaluate wind speed validity, seasonal wind speed variation, and storm systems to inform engineers on the candidacy of Lake Michigan offshore wind farms.

Data quality is at the very heart of accurate, relevant, and trusted information, but traditional techniques that require the data to be moved, cleansed, and repopulated simply can't scale up to cover the ultra-jumbo nature of big data environments. This paper describes how SAS^® Data Quality accelerators for databases like Teradata and Hadoop deliver data quality for big data by operating in situ and in parallel on each of the nodes of these clustered environments. The paper shows how data quality operations can be easily modified to leverage these technologies. It examines the results of performance benchmarks that show how in-database operations can scale to meet the demands of any use case, no matter how big a big data mammoth you have.

Missing observations caused by dropouts or skipped visits present a problem in studies of longitudinal data. When the analysis is restricted to complete cases and the missing data depend on previous responses, the generalized estimating equation (GEE) approach, which is commonly used when the population-average effect is of primary interest, can lead to biased parameter estimates. The new GEE procedure in SAS/STAT^® 13.2 implements a weighted GEE method, which provides consistent parameter estimates when the dropout mechanism is correctly specified. When none of the data are missing, the method is identical to the usual GEE approach, which is available in the GENMOD procedure. This paper reviews the concepts and statistical methods. Examples illustrate how you can apply the GEE procedure to incomplete longitudinal data.

Do you know everything you need to know about missing values? Do you know how to assign a missing value to multiple variables with one statement? Can you display missing values as something other than . or blank? How many types of missing numeric values are there? This paper reviews techniques for assigning, displaying, referencing, and summarizing missing values for numeric variables and character variables.

The latest releases of SAS^® Data Integration Studio and SAS^® Data Management provide an integrated environment for managing and transforming your data to meet new and increasingly complex data management challenges. The enhancements help develop efficient processes that can clean, standardize, transform, master, and manage your data. The latest features include: capabilities for building complex job processes web and tablet environments for managing your data enhanced ELT transformation capabilities big data transformation capabilities for Hadoop integration with the SAS^® LASR™ platform enhanced features for lineage tracing and impact analysis new features for master data and metadata management This paper provides an overview of the latest features of the products and includes use cases and examples for leveraging product capabilities.

Over the last year, the SAS^® Enterprise Miner^™ development team has made numerous and wide-ranging enhancements and improvements. New utility nodes that save data, integrate better with open-source software, and register models make your routine tasks easier. The area of time series data mining has three new nodes. There are also new models for Bayesian network classifiers, generalized linear models (GLMs), support vector machines (SVMs), and more.

SAS^® Visual Analytics is one of the newer SAS^® products with a lot of excitement surrounding it. But what is SAS Visual Analytics really? By examining the similarities, differences, and synergies between SAS Visual Analytics and other SAS offerings, we can more clearly understand this new product.

We receive a daily file with information about patients who use our drug. It s updated every day so that we have the most current information. Nearly every variable on a patient s record can be different from one day to the next. But what if you wanted to capture information that changed? For example, what if a patient switched doctors sometime along the way, and the original prescribing doctor is different than the patient's present doctor? With this type of daily file, that information is lost. To avoid losing these changes, you have to build a cumulative data set. I ll show you how to build it.