SAS Global Forum 2014 Proceedings

SAS^® Grid Computing is a scale-out SAS^® solution that enables SAS applications to better utilize computing resources, which is extremely I/O and compute intensive. It requires the use of a high-performance shared storage (SS) that allows all servers to access the same file systems. SS may be implemented via traditional NFS NAS or clustered file systems (CFS) like GPFS. This paper uses the Lustre* file system, a parallel, distributed CFS, for a case study of performance scalability of SAS Grid Computing nodes on SS. The paper qualifies the performance of a standardized SAS workload running on Lustre at scale. Lustre has been traditionally used for large and sequential I/O. We will record and present the tuning changes necessary for the optimization of Lustre for the SAS applications. In addition, results from the scaling of SAS Cluster jobs running on Lustre will be presented.

SAS/ACCESS^® Interface to ODBC has been around forever. On one level, ODBC is very easy to use. That ease hides the flexibility that ODBC offers. This presentation uses examples to show you how to increase your program's performance and troubleshoot problems. You will learn: the differences between ODBC and OLE DB what the odbc.ini file is (and why it is important) how to discover what your ODBC driver is actually doing the difference between a native ACCESS engine and SAS/ACCESS Interface to ODBC

In our previous work, we often needed to perform large numbers of repetitive and data-driven post-campaign analyses to evaluate the performance of marketing campaigns in terms of customer response. These routine tasks were usually carried out manually by using Microsoft Excel, which was tedious, time-consuming, and error-prone. In order to improve the work efficiency and analysis accuracy, we managed to automate the analysis process with SAS^® programming and replace the manual Excel work. Through the use of SAS macro programs and other advanced skills, we successfully automated the complicated data-driven analyses with high efficiency and accuracy. This paper presents and illustrates the creative analytical ideas and programming skills for developing the automatic analysis process, which can be extended to apply in a variety of business intelligence and analytics fields.

SAS^® Visual Analytics and the SAS^® LASR™ Analytic Server provide many capabilities to analyze data fast. Depending on your organization, data can be loaded as a self-service operation. Or, your data can be large and shared with many people. And, as data gets large, effectively loading it and keeping it updated become important. This presentation discusses the range of data scenarios from self-service spreadsheets to very large databases, from single-subject data to large star schema topologies, and from single-use data to continually updated data that requires high levels of resilience and monitoring. Fast and easy access to big data is important to empower your organization to make better business decisions. Understanding how to have a responsive and reliable data tier on which to make these decisions is within your reach.

Rapid adoption of high-performance scalable systems supporting big data acquisition, absorption, management, and analysis has exposed a potential gap in asserting governance and oversight over the information production flow. In a traditional organizational data management environment, business rules encompassed within data controls can be used to govern the creation and consumption of data across the enterprise. However, as more big data analytics applications absorbing massive data sets from external sources whose creation points are far removed from their various repurposed uses, the ability to control the production of data must give way to a different kind of data governance. In this talk, we discuss a rational approach to scoping data governance for big data. Instead of presuming the ability to validate and cleanse data prior to its loading into the analytical platform, we will explore pragmatic expectations and measures for scoring data utility and believability within the context of big data analytics applications. Attendees will learn about: Expectations for data variation The impact of variance on analytical results Focusing on the quality of your data management processesand infrastructure Measurements for data usability

The emerging discipline of data governance encompasses data quality assurance, data access and use policy, security risks and privacy protection, and longitudinal management of an organization s data infrastructure. In the interests of forestalling another bureaucratic solution to data governance issues, this presentation features database programming tools that provide rapid access to big data and make selective access to and restructuring of metadata practical.

In many organizations the amount of data we deal with increases far faster than the hardware and IT infrastructure to support them. As a result, we encounter significant bottlenecks and I/O bound processes. However, clever use of SAS^® software can help us find a way around. In this paper we will look at the clever use of PROC OLAP to show you how to address I/O bound processing spread I/O traffic to different servers to increase cube building efficiency. This paper assumes experience with SAS^® OLAP Cube Studio and/or PROC OLAP.

With the growth in size and complexity of organizations investing in SAS^® platform technologies, the size and complexity of ETL subsystems and data integration (DI) jobs is growing at a rapid rate. Developers are pushed to come up with new and innovative ways to improve process efficiency in their DI jobs to meet increasingly demanding service level agreements (SLAs). The ability to conditionally execute or switch paths in a DI job is an extremely useful technique for improving process efficiency. How can a SAS^® Data Integration developer design a job to best suit conditional execution? This paper discusses a technique for providing a parameterized dynamic execution custom transformation that can be easily incorporated into SAS^® Data Integration Studio jobs to provide process path switching capabilities. The aim of any data integration task is to ensure that all sources of business data are integrated as efficiently as possible. It is concerned with the repurposing of data via transformation, should be a value-adding process, and also should be the product of collaboration. Modularization of common or repeatable processes is a fundamental part of the collaboration process in DI design and development. Switch path a custom transformation built to conditionally execute branches or nodes in SAS Data Integration Studio provides a reusable module for solving the conditional execution limitations of standard SAS Data Integration Studio transformations and jobs. Switch Path logic in SAS Data Integration Studio can serve many purposes in day-to-day business needs for a SAS data integration developer as it is completely reusable

Having data that are consistent, reliable, and well linked is one of the biggest challenges faced by financial institutions. The paper describes how the SAS^® Data Management offering helps to connect people, processes, and technology to deliver consistent results for data sourcing and analytics teams, and minimizes the cost and time involved in the development life cycle. The paper concludes with best practices learned from various enterprise data initiatives.

Your company s chronically overloaded SAS^® environment, adversely impacted user community, and the resultant lackluster productivity have finally convinced your upper management that it is time to upgrade to a SAS^® grid to eliminate all the resource problems once and for all. But after the contract is signed and implementation begins, you as the SAS administrator suddenly realize that your company-wide standard mode of SAS operations, that is, using the traditional SAS^® Display Manager on a server machine, runs counter to the expectation of the SAS grid your users are now supposed to switch to SAS^® Enterprise Guide^® on a PC. This is utterly unacceptable to the user community because almost everything has to change in a big way. If you like to play a hero in your little world, this is your opportunity. There are a number of things you can do to make the transition to the SAS grid as smooth and painless as possible, and your users get to keep their favorite SAS Display Manager.

APP is an unofficial collective abbreviation for the SAS^® functions ADDR, PEEK, PEEKC, the CALL POKE routine, and their so-called LONG 64-bit counterparts the SAS tools designed to directly read from and write to physical memory in the DATA step. APP functions have long been a SAS dark horse. First, the examples of APP usage in SAS documentation amount to a few technical report tidbits intended for mainframe system programming, with nary a hint how the functions can be used for data management programming. Second, the documentation note on the CALL POKE routine is so intimidating in tone that many potentially receptive folks might decide to avoid the allegedly precarious route altogether. However, little can stand in the way of an inquisitive SAS programmer daring to take a close look, and it turns out that APP functions are very simple and useful tools! They can be used to explore how things really work, to make code more concise, to implement en masse data movement, and they can often dramatically improve execution efficiency. The author and many other SAS experts (notably Peter Crawford, Koen Vyverman, Richard DeVenezia, Toby Dunn, and the fellow masked by his 'Puddin' Man' sobriquet) have been poking around the SAS APP realm on SAS-L and in their own practices since 1998, occasionally letting the SAS community at large to peek at their findings. This opus is an attempt to circumscribe the results in a systematic manner. Welcome to the APP world! You are in for a few glorious surprises.

Both recent banking and insurance risk regulations require effective aggregation of risks. To determine the total enterprise risk for a financial institution, all risks must be aggregated and analyzed. Typically, there are two approaches: bottom-up and top-down risk aggregation. In either approach, financial institutions face challenges due to various levels of risks with differences in metrics, data source, and availability. First, it is especially complex to aggregate risk. A common view of the dependence between all individual risks can be hard to achieve. Second, the underlying data sources can be updated at different times and can have different horizons. This in turn requires an incremental update of the overall risk view. Third, the risk needs to be analyzed across on-demand hierarchies. This paper presents SAS^® solutions to these challenges. To address the first challenge, we consider a mixed approach to specify copula dependence between individual risks and allow step-by-step specification with a minimal amount of information. Next, the solution leverages an event-driven architecture to update results on a continuous basis. Finally, the platform provides a self-service reporting and visualization environment for designing and deploying reports across any hierarchy and granularity on the fly. These capabilities enable institutions to create an accurate, timely, comprehensive, and adaptive risk-aggregation and reporting system.

Business Intelligence (BI) dashboards serve as an invaluable, high-level, visual reference tool for decision-making processes in many business industries. A request was made to our department to develop some BI dashboards that could be incorporated in an academic setting. These dashboards would aim to serve various undergraduate executive and administrative staff at the university. While most business data may lend itself to work very well and easily in the development of dashboards, academic data is typically modeled differently and, therefore, faces unique challenges. In this paper, the authors detail and share the design and development process of creating dashboards for decision making in an academic environment utilizing SAS^® BI Dashboard 4.3 and other SAS^® Enterprise Business Intelligence 9.2 tools. The authors also provide lessons learned as well as recommendations for future implementations of BI dashboards utilizing academic data.

Potential of One, Power of All. That has a really nice ring to it, especially as it pertains to accessing all of your corporate data through one single data access point. It means the potential of having a single source for all of your data connections from throughout the enterprise. It also means that the complexities of connecting to these data assets from the various source systems throughout the enterprise are hidden from the end user. With this, however, comes the possibility of placing personally identifiable information in the hands of a user who should not have access to it. The bottom line is that there is risk and uncertainty with allowing users to have access to data that is disallowed by your existing data governance strategy. Blocking these data elements from specific users or groups of users is a challenge that many corporations face today, whether it is secure financial information, confidential personnel records, or personal medical information protected by strict regulations. How do you surface All necessary data to All necessary users, while at the same time maintaining the security of the data? SAS^® Federation Server Manager is an easy-to-use interface that allows the data administrator to manage your data assets in such a way that it alleviates this risk by controlling access to critical data elements and maintaining the proper level of data disclosure control. This session focuses on how to employ various data access control strategies from within SAS Federation Server Manager.

Based on selection criteria, the SAS^® Data Integration Studio loop or splitter transformations can be used to generate multiple output files. The ETL developer or SAS^® administrator can decide which transformation is better suited for the design, priorities, and SAS configuration at their site. Factors to consider are the setup, maintenance, and performance of the ETL job. The loop transformation requires an understanding of macros and a control table. The splitter transformation is more straightforward and self documenting. If time allows, creating and running a job with each transformation can provide benchmarking to measure performance. For a comparison of these two options, this paper shows an example of the same job using the loop or splitter transformation. For added testing metrics, one can adapt the LOGPARSE SAS macro to parse the job logs.

A group tasked with testing SAS^® software from the customer perspective has gathered a number of helpful hints for SAS^® 9.4 that will smooth the transition to its new features and products. These hints will help with the 'huh?' moments that crop up when you're getting oriented and will provide short, straightforward answers. And we can share insights about changes in your order contents. Gleaned from extensive multi-tier deployments, SAS^® Customer Experience Testing shares insiders' practical tips to ensure you are ready to begin your transition to SAS^® 9.4.

The role of the Data Scientist is the viral job description of the decade. And like LOLcats, there are many types of Data Scientists. What is this new role? Who is hiring them? What do they do? What skills are required to do their job? What does this mean for the SAS^® programmer and the statistician? Are they obsolete? And finally, if I am a SAS user, how can I become a Data Scientist? Come learn about this job of the future and what you can do to be part of it.

Organizations today make numerous decisions within their businesses that affect almost every aspect of their daily operations. Many of these decisions are now automatically generated by sophisticated enterprise decision management systems. These decisions include what offers to make to customers, sales transaction processing, payment processing, call center interactions, industrial maintenance, transportation scheduling, and thousands of other applications that all have a significant impact on the business bottom line. Concurrently, many of these same companies have developed or are now developing analytics that provide valuable insight into their customers, their products, and their markets. Unfortunately, many of the decision systems cannot maximize the power of analytics in the business processes at the point where the decisions are made. SAS^® Decision Manager is a new product that integrates analytical models with business rules and deploys them to operational systems where the decisions are made. Analytically driven decisions can be monitored, assessed, and improved over time. This paper describes the new product and its use and shows how models and business rules can be joined into a decision process and deployed to either batch processes or to real-time web processes that can be consumed by business applications.

The soaring number of publicly available data sets across disciplines have allowed for increased access to real-life data for use in both research and educational settings. These data often leverage cost-effective complex sampling designs including stratification and clustering, which allow for increased efficiency in survey data collection and analyses. Weighting becomes a necessary component in these survey data in order to properly calculate variance estimates and arrive at sound inferences through statistical analysis. Generally speaking, these weights are included with the variables provided in the public use data, though an explanation for how and when to use these weights is often lacking. This paper presents an analysis using the California Health Interview Survey to compare weighted and non-weighted results using SAS^® PROC LOGISTIC and PROC SURVEYLOGISTIC.

Today's business needs require 24/7 access to your data in order to perform complex queries to your analytical store. In addition, you might need to periodically update your analytical store to append new data, delete old data, or modify some existing data. Historically, the size of your analytical tables or the window in which the table must be updated can cause unacceptable downtime for queries. This paper describes how you can use new SAS^® Scalable Performance Data Server 5.1 cluster table features to simulate transaction isolation in order to modify large sections of your cluster table. These features are optimized for extremely fast operation and can be done without affecting any on-going queries. This provides both the continuous query access and periodic update requirements for your analytical store that your business model requires.

Data governance combines the disciplines of data quality, data management, data policy management, business process management, and risk management into a methodology that ensures important data assets are formally managed throughout an enterprise. SAS^® has developed a cohesive suite of technologies that can be used to implement efficient and effective data governance initiatives, thereby improving an enterprise s overall data management efficiency. This paper discusses data governance use cases and challenges, and provides an example of how to manage the data governance lifecycle to ensure success.

One of the most common questions about logistic regression is How do I know if my model fits the data? There are many approaches to answering this question, but they generally fall into two categories: measures of predictive power (like R-squared) and goodness of fit tests (like the Pearson chi-square). This presentation looks first at R-squared measures, arguing that the optional R-squares reported by PROC LOGISTIC might not be optimal. Measures proposed by McFadden and Tjur appear to be more attractive. As for goodness of fit, the popular Hosmer and Lemeshow test is shown to have some serious problems. Several alternatives are considered.

For over three decades, SAS^® has provided capabilities for beating your data into submission. In June of 2000, SAS acquired a company called DataFlux in order to add data quality capabilities to its portfolio. Recently, SAS folded DataFlux into the mother ship. With SAS^® 9.4, SAS^® Enterprise Data Integration Server and baby brother SAS^® Data Integration Server were upgraded into a series of new bundles that still include the former DataFlux products, but those products have grown. These new bundles include data management, data governance, data quality, and master data management, and come in advanced and standard packaging. This paper explores these offerings and helps you understand what this means to both new and existing customers of SAS^® Data Management and DataFlux products. We break down the marketing jargon and give you real-world scenarios of what customers are using today (prior to SAS 9.4) and walk you through what that might look like in the SAS 9.4 world. Each scenario includes the software that is required, descriptions of what each of the components do (features and functions), as well as the likely architectures that you might want to consider. Finally, for existing SAS Enterprise Data Integration Server and SAS^® Data Integration Server customers, we discuss implications for migrating to SAS Data Management and detail some of the functionality that may be new to your organization.

This presentation features implementation leads from SAS^® Professional Services and Health Canada's Non-Insured Health benefits (NIHB) program, on a joint implementation of SAS^® Fraud Framework for Health Care. The presentation walks through the fast-paced implementation of NIHB's Pharmacy Surveillance System that guards Canadian taxpayers from undue costs, and protects the safety of NIHB clients. This presentation is a blend of project management and technical material, and presents both the client (NIHB) and consultant (SAS) perspectives throughout the story. The presentation converges onto several core principles needed to successfully deliver analytical solutions.

PROC TABULATE is a powerful tool for creating tabular summary reports. Its advantages, over PROC REPORT, are that it requires less code, allows for more convenient table construction, and uses syntax that makes it easier to modify a table s structure. However, its inability to compute the sum, difference, product, and ratio of column sums has hindered its use in many circumstances. This paper illustrates and discusses some creative approaches and methods for overcoming these limitations, enabling users to produce needed reports and still enjoy the simplicity and convenience of PROC TABULATE. These methods and skills can have prominent applications in a variety of business intelligence and analytics fields.

A time-consuming part of statistical analysis is building an analytic data set for statistical procedures. Whether it is recoding input values, transforming variables, or combining data from multiple data sources, the work to create an analytic data set can take time. The DS2 programming language in SAS^® 9.4 simplifies and speeds data preparation with user-defined methods, storing methods and attributes in shareable packages, and threaded execution on multi-core SMP and MPP machines. Come see how DS2 makes your job easier.

Are you wondering what is causing your valuable machine asset to fail? What could those drivers be, and what is the likelihood of failure? Do you want to be proactive rather than reactive? Answers to these questions have arrived with SAS^® Predictive Asset Maintenance. The solution provides an analytical framework to reduce the amount of unscheduled downtime and optimize maintenance cycles and costs. An all new (R&D-based) version of this offering is now available. Key aspects of this paper include: Discussing key business drivers for and capabilities of SAS Predictive Asset Maintenance. Detailed analysis of the solution, including: Data model Explorations Data selections Path I: analysis workbench maintenance analysis and stability monitoring Path II: analysis workbench JMP^®, SAS^® Enterprise Guide^®, and SAS^® Enterprise Miner^™ Analytical case development using SAS Enterprise Miner, SAS^® Model Manager, and SAS^® Data Integration Studio SAS Predictive Asset Maintenance Portlet for reports A realistic business example in the oil and gas industry is used.

This workshop provides hands-on experience using tools in the SAS^® Data Management offering. Workshop participants will use the following products: SAS^® Data Integration Studio DataFlux^® Data Management Studio SAS^® Data Management Console

How does the SAS^® server architecture fit within your IT infrastructure? What functional aspects does the architecture support? This session helps attendees understand the logical server topology of the SAS technology stack: resource and process management in-memory architecture in-database processing The session also discusses process flows from data acquisition through analytical information to visual insight. IT architects, data administrators, and IT managers from all industries should leave with an understanding of how SAS has evolved to better fit into the IT enterprise and to help IT's internal customers make better decisions.

SAS^® OLAP technology is used to organize and present summarized data for business intelligence applications. It features flexible options for creating and storing aggregations to improve performance and brings a powerful multi-dimensional approach to querying data. This paper focuses on managing security features available to OLAP cubes through the combination of SAS metadata and MDX logic.

Businesses today are inundated with unstructured data not just social media but books, blogs, articles, journals, manuscripts, and even detailed legal documents. Manually managing unstructured data can be time consuming and frustrating, and might not yield accurate results. Having an analyst read documents often introduces bias because analysts have their own experiences, and those experiences help shape how the text is interpreted. The fact that people become fatigued can also impact the way that the text is interpreted. Is the analyst as motivated at the end of the day as they are at the beginning? Data science involves using data management, analytical, and visualization strategies to uncover the story that the data is trying to tell in a more automated fashion. This is important with structured data but becomes even more vital with unstructured data. Introducing automated processes for managing unstructured data can significantly increase the value and meaning gleaned from the data. This paper outlines the data science processes necessary to ingest, transform, analyze, and visualize three Star Wars movie scripts: A New Hope, The Empire Strikes Back, and Return of the Jedi. It focuses on the need to create structure from unstructured data using SAS^® Data Management, SAS^® Text Miner, and SAS^® Content Categorization. The results are featured using SAS^® Visual Analytics.

Before you can analyze your big data, you need to prepare the data for analysis. This paper discusses capabilities and techniques for using the power of SAS^® to prepare big data for analytics. It focuses on how a SAS user can write code that will run in a Hadoop cluster and take advantage of the massive parallel processing power of Hadoop.

SAS^® provides a wide variety of products and solutions that address analytics, data management, and reporting. It can be challenging to understand how the data and processes in a SAS deployment relate to each other and how changes in your processes affect downstream consumers. This paper presents visualization and reporting tools for lineage and impact analysis. These tools enable you to understand where the data for any report or analysis originates or how data is consumed by data management, analysis, or reporting processes. This paper introduces new capabilities to import metadata from third-party systems to provide lineage and impact analysis across your enterprise.

The new Markov chain Monte Carlo (MCMC) procedure introduced in SAS/STAT^® 9.2 and further exploited in SAS/STAT^® 9.3 enables Bayesian computations to run efficiently with SAS^®. The MCMC procedure allows one to carry out complex statistical modeling within Bayesian frameworks under a wide spectrum of scientific research; in psychometrics, for example, the estimation of item and ability parameters is a kind. This paper describes how to use PROC MCMC for Bayesian inferences of item and ability parameters under a variety of popular item response models. This paper also covers how the results from SAS PROC MCMC are different from or similar to the results from WinBUGS. For those who are interested in the Bayesian approach to item response modeling, it is exciting and beneficial to shift to SAS, based on its flexibility of data managements and its power of data analysis. Using the resulting item parameter estimates, one can continue to test form constructions, test equatings, etc., with all these test development processes being accomplished with SAS!

Dataprev has become the principal owner of social data on the citizens in Brazil by collecting information for over forty years in order to subsidize pension applications for the government. The use of this data can be expanded to provide new tools to aid policy and assist the government to optimize the use of its resources. Using SAS^® MDM, we are developing a solution that uniquely identifies the citizens of Brazil. Overcoming challenges with multiple government agencies and with the validation of survey records that suggest the same person requires rules for governance and a definition of what represents a particular Brazilian citizen. In short, how do you turn a repository of master data into an efficient catalyst for public policy? This is the goal for creating a repository focused on identifying the citizens of Brazil.

Data quality is at the very heart of accurate, relevant, and trusted information, but traditional techniques that require the data to be moved, cleansed, and repopulated simply can't scale up to cover the ultra-jumbo nature of big data environments. This paper describes how SAS^® Data Quality accelerators for databases like Teradata and Hadoop deliver data quality for big data by operating in situ and in parallel on each of the nodes of these clustered environments. The paper shows how data quality operations can be easily modified to leverage these technologies. It examines the results of performance benchmarks that show how in-database operations can scale to meet the demands of any use case, no matter how big a big data mammoth you have.

The latest releases of SAS^® Data Integration Studio and SAS^® Data Management provide an integrated environment for managing and transforming your data to meet new and increasingly complex data management challenges. The enhancements help develop efficient processes that can clean, standardize, transform, master, and manage your data. The latest features include: capabilities for building complex job processes web and tablet environments for managing your data enhanced ELT transformation capabilities big data transformation capabilities for Hadoop integration with the SAS^® LASR™ platform enhanced features for lineage tracing and impact analysis new features for master data and metadata management This paper provides an overview of the latest features of the products and includes use cases and examples for leveraging product capabilities.