SAS Global Forum 2015 Proceedings

A B C D E F G H I L M N P R S T U W Y

A

Paper SAS1877-2015:

Access, Modify, Enhance: Self-Service Data Management in SAS^® Visual Analytics

SAS^® Visual Analytics provides self-service capabilities for users to analyze, explore, and report on their own data. As users explore their data, there is always a need to bring in more data sources, create new variables, combine data from multiple sources, and even update your data occasionally. SAS Visual Analytics provides targeted user capabilities to access, modify, and enhance data suitable for specific business needs. This paper provides a clear understanding of these capabilities and suggests best practices for self-service data management in SAS Visual Analytics.

Read the paper (PDF).

Gregor Herrmann, SAS

Paper 3327-2015:

Automated Macros to Extract Data from the National (Nationwide) Inpatient Sample (NIS)

The use of administrative databases for understanding practice patterns in the real world has become increasingly apparent. This is essential in the current health-care environment. The Affordable Care Act has helped us to better understand the current use of technology and different approaches to surgery. This paper describes a method for extracting specific information about surgical procedures from the Healthcare Cost and Utilization Project (HCUP) database (also referred to as the National (Nationwide) Inpatient Sample (NIS)).The analyses provide a framework for comparing the different modalities of surgerical procedures of interest. Using an NIS database for a single year, we want to identify cohorts based on surgical approach. We do this by identifying the ICD-9 codes specific to robotic surgery, laparoscopic surgery, and open surgery. After we identify the appropriate codes using an ARRAY statement, a similar array is created based on the ICD-9 codes. Any minimally invasive procedure (robotic or laparoscopic) that results in a conversion is flagged as a conversion. Comorbidities are identified by ICD-9 codes representing the severity of each subject and merged with the NIS inpatient core file. Using a FORMAT statement for all diagnosis variables, we create macros that can be regenerated for each type of complication. These created macros are compiled in SAS^® and stored in the library that contains the four macros that are called by tables. They call the macros for different macros variables. In addition, they create the frequencies of all cohorts and create the table structure with the title and number of the table. This paper describes a systematic method in SAS/STAT^® 9.2 to extract the data from NIS using the ARRAY statement for the specific ICD-9 codes, to format the extracted data for the analysis, to merge the different NIS databases by procedures, and to use automatic macros to generate the report.

Read the paper (PDF).

Ravi Tejeshwar Reddy Gaddameedi, California State University,Eastbay

Usha Kreaden, Intuitive Surgical

Paper 2060-2015:

Automating Your Metadata Reporting

There is a goldmine of information that is available to you in SAS^® metadata. The challenge, however, is being able to retrieve and leverage that information. While there is useful functionality available in SAS^® Management Console as well as a collection of functional macros provided by SAS to help accomplish this, getting a complete metadata picture in an automated way has proven difficult. This paper discusses the methods we have used to find core information within SAS^® 9.2 metadata and how we have been able to pull this information in a programmatic way. We used Base SAS^®, SAS^® Data Integration Studio, PC SAS^®, and SAS^® XML Mapper to build a solution that now provides daily metadata reporting about our SAS Data Integration Studio jobs, job flows, tables, and so on. This information can now be used for auditing purposes as well as for helping us build our full metadata inventory as we prepare to migrate to SAS^® 9.4.

Read the paper (PDF).

Rupinder Dhillon, Bell Canada

Rupinder Dhillon, Bell Canada

Darryl Prebble, Prebble Consulting Inc.

B

Paper 3120-2015:

"BatchStats": SAS^® Batch Statistics, A Click Away!

Over the years, the SAS^® Business Intelligence platform has proved its importance in this big data world with its suite of applications that enable us to efficiently process, analyze, and transform huge amounts of business data. Within the data warehouse universe, 'batch execution' sits in the heart of SAS Data Integration technologies. On a day-to-day basis, batches run, and the current status of the batch is generally sent out to the team or to the client as a 'static' e-mail or as a report. From experience, we know that they don't provide much insight into the real 'bits and bytes' of a batch run. Imagine if the status of the running batch is automatically captured in one central repository and is presented on a beautiful web browser on your computer or on your iPad. All this can be achieved without asking anybody to send reports and with all 'post-batch' queries being answered automatically with a click. This paper aims to answer the same with a framework that is designed specifically to automate the reporting aspects of SAS batches and, yes, it is all about collecting statistics of the batch, and we call it - 'BatchStats.'

Prajwal Shetty, Tesco HSC

Paper SAS1501-2015:

Best Practices for Configuring Your I/O Subsystem for SAS^®9 Applications

The power of SAS^®9 applications allows information and knowledge creation from very large amounts of data. Analysis that used to consist of 10s-100s of gigabytes (GBs) of supporting data has rapidly grown into the 10s to 100s of terabytes (TBs). This data expansion has resulted in more and larger SAS data stores. Setting up file systems to support these large volumes of data with adequate performance, as well as ensuring adequate storage space for the SAS^® temporary files, can be very challenging. Technology advancements in storage and system virtualization, flash storage, and hybrid storage management require continual updating of best practices to configure I/O subsystems. This paper presents updated best practices for configuring the I/O subsystem for your SAS^®9 applications, ensuring adequate capacity, bandwidth, and performance for your SAS^®9 workloads. We have found that very few storage systems work ideally with SAS with their out-of-the-box settings, so it is important to convey these general guidelines.

Read the paper (PDF).

Tony Brown, SAS

Margaret Crevar, SAS

Paper 3082-2015:

Big Data Meets Little Data: Hadoop and Arduino Integration Using SAS^®

SAS^® has been an early leader in big data technology architecture that more easily integrates unstructured files across multi-tier data system platforms. By using SAS^® Data Integration Studio and SAS^® Enterprise Business Intelligence software, you can easily automate big data using SAS^® system accommodations for Hadoop open-source standards. At the same time, another seminal technology has emerged, which involves real-time multi-sensor data integration using Arduino microprocessors. This break-out session demonstrates the use of SAS^® 9.4 coding to define Hadoop clusters and to automate Arduino data acquisition to convert custom unstructured log files into structured tables, which can be analyzed by SAS in near real time. Examples include the use of SAS Data Integration Studio to create and automate stored processes, as well as tips for C language object coding to integrate to SAS data management, with a simple temperature monitoring application for Hadoop to Arduino using SAS.

Keith Allan Jones PHD, QUALIMATIX.com

Paper SAS1824-2015:

Bust Open That ETL Black Box and Apply Proven Techniques to Successfully Modernize Data Integration

So you are still writing SAS^® DATA steps and SAS macros and running them through a command-line scheduler. When work comes in, there is only one person who knows that code, and they are out--what to do? This paper shows how SAS applies extract, transform, load (ETL) modernization techniques with SAS^® Data Integration Studio to gain resource efficiencies and to break down the ETL black box. We are going to share the fundamentals (metadata foldering and naming standards) that ensure success, along with steps to ease into the pool while iteratively gaining benefits. Benefits include self-documenting code visualization, impact analysis on jobs and tables impacted by change, and being supportable by interchangeable bench resources. We conclude with demonstrating how SAS^® Visual Analytics is being used to monitor service-level agreements and provide actionable insights into job-flow performance and scheduling.

Read the paper (PDF).

Brandon Kirk, SAS

C

Paper 3367-2015:

Colon(:)izing My Programs

There is a plethora of uses of the colon (:) in SAS^® programming. The colon is used as a data or variable name wild-card, a macro variable creator, an operator modifier, and so forth. The colon helps you write clear, concise, and compact code. The main objective of this paper is to encourage the effective use of the colon in writing crisp code. This paper presents real-time applications of the colon in day-to-day programming. In addition, this paper presents cases where the colon limits programmers' wishes.

Read the paper (PDF).

Jinson Erinjeri, EMMES Corporation

D

Paper 3451-2015:

DATU - Data Automated Transfer Utility: A Program to Migrate Files from Mainframe to Windows

The Centers for Disease Control and Prevention (CDC) went through a large migration from a mainframe to a Windows platform. This e-poster will highlight the Data Automated Transfer Utility (DATU) that was developed to migrate historic files between the two file systems using SAS^® macros and SAS/CONNECT^®. We will demonstrate how this program identifies the type of file, transfers the file appropriately, verifies the successful transfer, and provides the details in a Microsoft Excel report. SAS/CONNECT code, special system options, and mainframe code will be shown. In 2009, the CDC made the decision to retire a mainframe that was used for years of primarily SAS work. The replacement platform is a SAS grid system, based on Windows, which is referred to as the Consolidated Statistical Platform (CSP). The change from mainframe to Windows required the migration of over a hundred thousand files totaling approximately 20 terabytes. To minimize countless man hours and human error, an automated solution was developed. DATU was developed for users to migrate their files from the mainframe to the new Windows CSP or other Windows destinations. Approximately 95% of the files on the CDC mainframe were one of three file types: SAS data sets, sequential text files, and partitioned data sets (PDS) libraries. DATU dynamically determines the file type and uses the appropriate method to transfer the file to the assigned Windows destination. Variations of files are detected and handled appropriately. File variations include multiple SAS versions of SAS data sets and sequential files that contain binary values such as packed decimal fields. To mitigate the loss of numeric precision during the migration, SAS numeric variables are identified and promoted to account for architectural differences between mainframe and Windows platforms. To aid users in verifying the accuracy of the file transfer, the program compares file information of the source and destination files. When a SAS file is d ownloaded, PROC CONTENTS is run on both files, and the PROC CONTENTS output is compared. For sequential text files, a checksum is generated for both files and the checksum file is compared. A PDS file transfer creates a list of the members in the PDS and destination Windows folder, and the file lists are compared. The development of this program and the file migration was a daunting task. This paper will share some of our lessons learned along the way and the method of our implementation.

Read the paper (PDF).

Jim Brittain, National Center for Health Statistics (CDC)

Robert Schwartz, National Centers for Disease Control and Prevention

Paper 2000-2015:

Data Aggregation Using the SAS^® Hash Object

Soon after the advent of the SAS^® hash object in SAS^® 9.0, its early adopters realized that the potential functionality of the new structure is much broader than basic 0(1)-time lookup and file matching. Specifically, they went on to invent methods of data aggregation based on the ability of the hash object to quickly store and update key summary information. They also demonstrated that the DATA step aggregation using the hash object offered significantly lower run time and memory utilization compared to the SUMMARY/MEANS or SQL procedures, coupled with the possibility of eliminating the need to write the aggregation results to interim data files and the programming flexibility that allowed them to combine sophisticated data manipulation and adjustments of the aggregates within a single step. Such developments within the SAS user community did not go unnoticed by SAS R&D, and for SAS^® 9.2 the hash object had been enriched with tag parameters and methods specifically designed to handle aggregation without the need to write the summarized data to the PDV host variable and update the hash table with new key summaries, thus further improving run-time performance. As more SAS programmers applied these methods in their real-world practice, they developed aggregation techniques fit to various programmatic scenarios and ideas for handling the hash object memory limitations in situations calling for truly enormous hash tables. This paper presents a review of the DATA step aggregation methods and techniques using the hash object. The presentation is intended for all situations in which the final SAS code is either a straight Base SAS DATA step or a DATA step generated by any other SAS product.

Read the paper (PDF).

Paul Dorfman, Dorfman Consukting

Don Henderson, Henderson Consulting Services

Paper 3386-2015:

Defining and Mapping a Reasonable Distance for Consumer Access to Market Locations

Using geocoded addresses from FDIC Summary of Deposits data with Census geospatial data including TIGER boundary files and population-weighted centroid shapefiles, we were able to calculate a reasonable distance threshold by metropolitan statistical area (MSA) (or metropolitan division, where applicable (MD)) through a series of SAS^® DATA steps and SQL joins. We first used the Cartesian join with PROC SQL on the data set containing population-weighted centroid coordinates. (The data set contained geocoded coordinates of approximately 91,000 full-service bank branches.) Using the GEODIST function in SAS, we were able to calculate the distance to the nearest bank branch from the population-weighted centroid of each Census tract. The tract data set was then grouped by MSA/MD and sorted in ascending order within each grouping (using the RETAIN function) by distance to the nearest bank branch. We calculated the cumulative population and cumulative population percent for each MSA/MD. The reasonable threshold distance is established where cumulative population percent is closest (in either direction +/-) to 90%.

Read the paper (PDF).

Sarah Campbell, Federal Deposit Insurance Corporation

Paper SAS1865-2015:

Drilling for Deepwater Data: A Forensic Analysis of the Gulf of Mexico Deepwater Horizon Disaster

During the cementing and pumps-off phase of oil drilling, drilling operations need to know, in real time, about any loss of hydrostatic or mechanical well integrity. This phase involves not only big data, but also high-velocity data. Today's state-of-the-art drilling rigs have tens of thousands of sensors. These sensors and their data output must be correlated and analyzed in real time. This paper shows you how to leverage SAS^® Asset Performance Analytics and SAS^® Enterprise Miner™ to build a model for drilling and well control anomalies, fingerprint key well control measures of the transienct fluid properties, and how to operationalize these analytics on the drilling assets with SAS^® event stream processing. We cover the implementation and results from the Deepwater Horizon case study, demonstrating how SAS analytics enables the rapid differentiation between safe and unsafe modes of operation.

Read the paper (PDF).

Jim Duarte, SAS

Keith Holdaway, SAS

Moray Laing, SAS

E

Paper 3920-2015:

Entity Resolution and Master Data Life Cycle Management in the Era of Big Data

Proper management of master data is a critical component of any enterprise information system. However, effective master data management (MDM) requires that both IT and Business understand the life cycle of master data and the fundamental principles of entity resolution (ER). This presentation provides a high-level overview of current practices in data matching, record linking, and entity information life cycle management that are foundational to building an effective strategy to improve data integration and MDM. Particular areas of focus are: 1) The need for ongoing ER analytics--the systematic and quantitative measurement of ER performance; 2) Investing in clerical review and asserted resolution for continuous improvement; and 3) Addressing the large-scale ER challenge through distributed processing.

Read the paper (PDF). | Watch the recording.

John Talburt, Black Oak Analytics, Inc

Paper 3306-2015:

Extending the Scope of Custom Transformations

Building and maintaining a data warehouse can require a complex series of jobs. Having an ETL flow that is reliable and well integrated is one big challenge. An ETL process might need some pre- and post-processing operations on the database to be well integrated and reliable. Some might handle this via maintenance windows. Others like us might generate custom transformations to be included in SAS^® Data Integration Studio jobs. Custom transformations in SAS Data Integration Studio can be used to speed ETL process flows and reduce the database administrator's intervention after ETL flows are complete. In this paper, we demonstrate the use of custom transformations in SAS Data Integration Studio jobs to handle database-specific tasks for improving process efficiency and reliability in ETL flows.

Read the paper (PDF).

Emre Saricicek, University of North Carolina at Chapel Hill

Dean Huff, UNC

F

Paper SAS1500-2015:

Frequently Asked Questions Regarding Storage Configurations

The SAS^® Global Forum paper 'Best Practices for Configuring Your I/O Subsystem for SAS^®9 Applications' provides general guidelines for configuring I/O subsystems for your SAS^® applications. The paper reflects updated storage and virtualization technology. This companion paper ('Frequently Asked Questions Regarding Storage Configurations') is commensurately updated, including new storage technologies such as storage virtualization, storage tiers (including automated tier management), and flash storage. The subject matter is voluminous, so a frequently asked questions (FAQ) format is used. Our goal is to continually update this paper as additional field needs arise and technology dictates.

Read the paper (PDF).

Tony Brown, SAS

Margaret Crevar, SAS

Paper 3348-2015:

From ETL to ETL VCM: Ensure the Success of a Data Migration Project through a Flexible Data Quality Framework Using SAS^® Data Management

Data quality is now more important than ever before. According to Gartner (2011), poor data quality is the primary reason why 40% of all business initiatives fail to achieve their targeted benefits. As a response, Deloitte Belgium has created an agile data quality framework using SAS^® Data Management to rapidly identify and resolve root causes of data quality issues to jump-start business initiatives, especially a data migration one. Moreover, the approach uses both standard SAS Data Management functionalities (such as standardization, parsing, etc.) and advanced features (such as using macros, dynamic profiling in deployment mode, extracting the profiling results, etc.), allowing the framework to be agile and flexible and to maximize the reusability of specific components built in SAS Data Management.

Read the paper (PDF).

yves wouters, Deloitte

Valérie Witters, Deloitte

G

Paper SAS1852-2015:

Garbage In, Gourmet Out: How to Leverage the Power of the SAS^® Quality Knowledge Base

Companies spend vast amounts of resources developing and enhancing proprietary software to clean their business data. Save time and obtain more accurate results by leveraging the SAS^® Quality Knowledge Base (QKB), formerly a DataFlux^® Data Quality technology. Tap into the existing QKB rules for cleansing contact information or product data, or easily design your own custom rules using the QKB editing tools. The QKB enables data management operations such as parsing, standardization, and fuzzy matching for contact information such as names, organizations, addresses, and phone numbers, or for product data attributes such as materials, colors, and dimensions. The QKB supports data in native character sets in over 38 locales. A single QKB can be shared by multiple SAS^® Data Management installations across your enterprise, ensuring consistent results on workstations, servers, and massive parallel processing systems such as Hadoop. In this breakout, a SAS R&D manager demonstrates the power and flexibility of the QKB, and answers your questions about how to deploy and customize the QKB for your environment.

Read the paper (PDF).

Brian Rineer, SAS

Paper 1886-2015:

Getting Started with Data Governance

While there has been tremendous progress in technologies related to data storage, high-performance computing, and advanced analytic techniques, organizations have only recently begun to comprehend the importance of parallel strategies that help manage the cacophony of concerns around access, quality, provenance, data sharing, and use. While data governance is not new, the drumbeat around it, along with master data management and data quality, is approaching a crescendo. Intensified by the increase in consumption of information, expectations about ubiquitous access, and highly dynamic visualizations, these factors are also circumscribed by security and regulatory constraints. In this paper, we provide a summary of what data governance is and its importance. We go beyond the obvious and provide practical guidance on what it takes to build out a data governance capability appropriate to the scale, size, and purpose of the organization and its culture. Moreover, we discuss best practices in the form of requirements that highlight what we think is important to consider as you provide that tactical linkage between people, policies, and processes to the actual data lifecycle. To that end, our focus includes the organization and its culture, people, processes, policies, and technology. Further, we include discussions of organizational models as well as the role of the data steward, and provide guidance on how to formalize data governance into a sustainable set of practices within your organization.

Read the paper (PDF). | Watch the recording.

Greg Nelson, ThotWave

Lisa Dodson, SAS

H

Paper SAS1704-2015:

Helpful Hints for Transitioning to SAS^® 9.4

A group tasked with testing SAS^® software from the customer perspective has gathered a number of helpful hints for SAS^® 9.4 that will smooth the transition to its new features and products. These hints will help with the 'huh?' moments that crop up when you are getting oriented and will provide short, straightforward answers. We also share insights about changes in your order contents. Gleaned from extensive multi-tier deployments, SAS^® Customer Experience Testing shares insiders' practical tips to ensure that you are ready to begin your transition to SAS 9.4. The target audience for this paper is primarily system administrators who will be installing, configuring, or administering the SAS 9.4 environment. (This paper is an updated version of the paper presented at SAS Global Forum 2014 and includes new features and software changes since the original paper was delivered, plus any relevant content that still applies. This paper includes information specific to SAS 9.4 and SAS 9.4 maintenance releases.)

Read the paper (PDF).

Cindy Taylor, SAS

Paper SAS1812-2015:

Hey! SAS^® Federation Server Is Virtualizing 'Big Data'!

In this session, we discuss the advantages of SAS^® Federation Server and how it makes it easier for business users to access secure data for reports and use analytics to drive accurate decisions. This frees up IT staff to focus on other tasks by giving them a simple method of sharing data using a centralized, governed, security layer. SAS Federation Server is a data server that provides scalable, threaded, multi-user, and standards-based data access technology in order to process and seamlessly integrate data from multiple data repositories. The server acts as a hub that provides clients with data by accessing, managing, and sharing data from multiple relational and non-relational data sources as well as from SAS^® data. Users can view data in big data sources like Hadoop, SAP HANA, Netezza, or Teradata, and blend them with existing database systems like Oracle or DB2. Security and governance features, such as data masking, ensure that the right users have access to the data and reduce the risk of exposure. Finally, data services are exposed via a REST API for simpler access to data from third-party applications.

Read the paper (PDF).

Ivor Moan, SAS

I

Paper SAS1845-2015:

Introduction to SAS^® Data Loader: The Power of Data Transformation in Hadoop

Organizations are loading data into Hadoop platforms at an extraordinary rate. However, in order to extract value from these platforms, the data must be prepared for analytic exploit. As the volume of data grows, it becomes increasingly more important to reduce data movement, as well as to leverage the computing power of these distributed systems. This paper provides a cursory overview of SAS^® Data Loader, a product specifically aimed at these challenges. We cover the underlying mechanisms of how SAS Data Loader works, as well as how it's used to profile, cleanse, transform, and ultimately prepare data for analytics in Hadoop.

Read the paper (PDF).

Keith Renison, SAS

L

Paper SAS1955-2015:

Latest and Greatest: Best Practices for Migrating to SAS^® 9.4

SAS^® customers benefit greatly when they are using the functionality, performance, and stability available in the latest version of SAS. However, the task of moving all SAS collateral such as programs, data, catalogs, metadata (stored processes, maps, queries, reports, and so on), and content to SAS^® 9.4 can seem daunting. This paper provides an overview of the steps required to move all SAS collateral from systems based on SAS^® 9.2 and SAS^® 9.3 to the current release of SAS^® 9.4.

Read the paper (PDF).

Alec Fernandez, SAS

Paper SPON2000-2015:

Leveraging In-Database Technology to Enhance Data Governance and Improve Performance

In-database processing refers to the integration of advanced analytics into the data warehouse. With this capability, analytic processing is optimized to run where the data reside, in parallel, without having to copy or move the data for analysis. From a data governance perspective there are many good reasons to embrace in-database processing. Many analytical computing solutions and large databases use this technology because it provides significant performance improvements over more traditional methods. Come learn how Blue Cross Blue Shield of Tennessee (BCBST) uses in-database processing from SAS and Teradata.

Harold Klagstad, BlueCross BlueShield of TN

M

Paper SAS1822-2015:

Master Data and Command Results: Combine Master Data Management with SAS^® Analytics for Improved Insights

It's well known that SAS^® is the leader in advanced analytics but often overlooked is the intelligent data preparation that combines information from disparate sources to enable confident creation and deployment of compelling models. Improving data-based decision making is among the top reasons why organizations decide to embark on master data management (MDM) projects and why you should consider incorporating MDM functionality into your analytics-based processes. MDM is a discipline that includes the people, processes, and technologies for creating an authoritative view of core data elements in enterprise operational and analytic systems. This paper demonstrates why MDM functionality is a natural fit for many SAS solutions that need to have access to timely, clean, and unique master data. Because MDM shares many of the same technologies that power SAS analytic solutions, it has never been easier to add MDM capabilities to your advanced analytics projects.

Read the paper (PDF).

Ron Agresta, SAS

Paper 1381-2015:

Model Risk and Corporate Governance of Models with SAS^®

Banks can create a competitive advantage in their business by using business intelligence (BI) and by building models. In the credit domain, the best practice is to build risk-sensitive models (Probability of Default, Exposure at Default, Loss-given Default, Unexpected Loss, Concentration Risk, and so on) and implement them in decision-making, credit granting, and credit risk management. There are models and tools on the next level built on these models and that are used to help in achieving business targets, risk-sensitive pricing, capital planning, optimizing of ROE/RAROC, managing the credit portfolio, setting the level of provisions, and so on. It works remarkably well as long as the models work. However, over time, models deteriorate and their predictive power can drop dramatically. Since the global financial crisis in 2008, we have faced a tsunami of regulation and accelerated frequency of changes in the business environment, which cause models to deteriorate faster than ever before. As a result, heavy reliance on models in decision-making (some decisions are automated following the model's results--without human intervention) might result in a huge error that can have dramatic consequences for the bank's performance. In my presentation, I share our experience in reducing model risk and establishing corporate governance of models with the following SAS^® tools: model monitoring, SAS^® Model Manager, dashboards, and SAS^® Visual Analytics.

Read the paper (PDF).

Boaz Galinson, Bank Leumi

Paper 1344-2015:

Modernise Your SAS^® Platform

Organisations find SAS^® upgrades and migration projects come with risk, costs, and challenges to solve. The benefits are enticing new software capabilities such as SAS^® Visual Analytics, which help maintain your competitive advantage. An interesting conundrum. This paper explores how to evaluate the benefits and plan the project, as well as how the cloud option impacts modernisation. The author presents with the experience of leading numerous migration and modernisation projects from the leading UK SAS Implementation Partner.

Read the paper (PDF).

David Shannon, Amadeus Software

N

Paper SAS1866-2015:

Now That You Have Your Data in Hadoop, How Are You Staging Your Analytical Base Tables?

Well, Hadoop community, now that you have your data in Hadoop, how are you staging your analytical base tables? In my discussions with clients about this, we all agree on one thing: Data sizes stored in Hadoop prevent us from moving that data to a different platform in order to generate the analytical base tables. To address this dilemma, I want to introduce to you the SAS^® In-Database Code Accelerator for Hadoop.

Read the paper (PDF).

Steven Sober, SAS

Donna DeCapite, SAS

P

Paper 3516-2015:

Piecewise Linear Mixed Effects Models Using SAS

Evaluation of the impact of critical or high-risk events or periods in longitudinal studies of growth might provide clues to the long-term effects of life events and efficacies of preventive and therapeutic interventions. Conventional linear longitudinal models typically involve a single growth profile to represent linear changes in an outcome variable across time, which sometimes does not fit the empirical data. The piecewise linear mixed-effects models allow different linear functions of time corresponding to the pre- and post-critical time point trends. This presentation shows: 1) how to perform piecewise linear mixed effects models using SAS step by step, in the context of a clinical trial with two-arm interventions and a predictive covariate of interest; 2) how to obtain the slopes and corresponding p-values for intervention and control groups during pre- and post-critical periods, conditional on different values of the predictive covariate; and 3) explains how to make meaningful comparisons and present results in a scientific manuscript. A SAS macro to generate the summary tables assisting the interpretation of the results is also provided.

Qinlei Huang, St Jude Children's Research Hospital

Paper 1884-2015:

Practical Implications of Sharing Data: A Primer on Data Privacy, Anonymization, and De-Identification

Researchers, patients, clinicians, and other health-care industry participants are forging new models for data-sharing in hopes that the quantity, diversity, and analytic potential of health-related data for research and practice will yield new opportunities for innovation in basic and translational science. Whether we are talking about medical records (for example, EHR, lab, notes), administrative data (claims and billing), social (on-line activity), behavioral (fitness trackers, purchasing patterns), contextual (geographic, environmental), or demographic data (genomics, proteomics), it is clear that as health-care data proliferates, threats to security grow. Beginning with a review of the major health-care data breeches in our recent history, we highlight some of the lessons that can be gleaned from these incidents. In this paper, we talk about the practical implications of data sharing and how to ensure that only the right people have the right access to the right level of data. To that end, we explore not only the definitions of concepts like data privacy, but we discuss, in detail, methods that can be used to protect data--whether inside our organization or beyond its walls. In this discussion, we cover the fundamental differences between encrypted data, 'de-identified', 'anonymous', and 'coded' data, and methods to implement each. We summarize the landscape of maturity models that can be used to benchmark your organization's data privacy and protection of sensitive data.

Read the paper (PDF). | Watch the recording.

Greg Nelson, ThotWave

Paper 3247-2015:

Privacy, Transparency, and Quality Improvement in the Era of Big Data and Health Care Reform

The era of big data and health care reform is an exciting and challenging time for anyone whose work involves data security, analytics, data visualization, or health services research. This presentation examines important aspects of current approaches to quality improvement in health care based on data transparency and patient choice. We look at specific initiatives related to the Affordable Care Act (for example, the qualified entity program of section 10332 that allows the Centers for Medicare and Medicaid Services (CMS) to provide Medicare claims data to organizations for multi-payer quality measurement and reporting, the open payments program, and state-level all-payer claims databases to inform improvement and public reporting) within the context of a core issue in the era of big data: security and privacy versus transparency and openness. In addition, we examine an assumption that underlies many of these initiatives: data transparency leads to improved choices by health care consumers and increased accountability of providers. For example, recent studies of one component of data transparency, price transparency, show that, although health plans generally offer consumers an easy-to-use cost calculator tool, only about 2 percent of plan members use it. Similarly, even patients with high-deductible plans (presumably those with an increased incentive to do comparative shopping) seek prices for only about 10 percent of their services. Anyone who has worked in analytics, reporting, or data visualization recognizes the importance of understanding the intended audience, and that methodological transparency is as important as the public reporting of the output of the calculation of cost or quality metrics. Although widespread use of publicly reported health care data might not be a realistic goal, data transparency does offer a number of potential benefits: data-driven policy making, informed management of cost and use of services, as well as public health benefits through, for example, the rec ognition of patterns of disease prevalence and immunization use. Looking at this from a system perspective, we can distinguish five main activities: data collection, data storage, data processing, data analysis, and data reporting. Each of these activities has important components (such as database design for data storage and de-identification and aggregation for data reporting) as well as overarching requirements such as data security and quality assurance that are applicable to all activities. A recent Health Affairs article by CMS leaders noted that the big-data revolution could not have come at a better time, but it also recognizes that challenges remain. Although CMS is the largest single payer for health care in the U.S., the challenges it faces are shared by all organizations that collect, store, analyze, or report health care data. In turn, these challenges are opportunities for database developers, systems analysts, programmers, statisticians, data analysts, and those who provide the tools for public reporting to work together to design comprehensive solutions that inform evidence-based improvement efforts.

Read the paper (PDF).

Paul Gorrell, IMPAQ International

R

Paper SAS1941-2015:

Real Time--Is That Your Final Decision?

Streaming data is becoming more and more prevalent. Everything is generating data now--social media, machine sensors, the 'Internet of Things'. And you need to decide what to do with that data right now. And 'right now' could mean 10,000 times or more per second. SAS^® Event Stream Processing provides an infrastructure for capturing streaming data and processing it on the fly--including applying analytics and deciding what to do with that data. All in milliseconds. There are some basic tenets on how SAS^® provides this extremely high-throughput, low-latency technology to meet whatever streaming analytics your company might want to pursue.

Read the paper (PDF). | Watch the recording.

Diane Hatcher, SAS

Jerry Baulier, SAS

Steve Sparano, SAS

S

Paper SAS1907-2015:

SAS^® Data Management: Technology Options for Ensuring a Quality Journey Through the Data Management Process

When planning for a journey, one of the main goals is to get the best value possible. The same thing could be said for your corporate data as it journeys through the data management process. It is your goal to get the best data in the hands of decision makers in a timely fashion, with the lowest cost of ownership and the minimum number of obstacles. The SAS^® Data Management suite of products provides you with many options for ensuring value throughout the data management process. The purpose of this session is to focus on how the SAS^® Data Management solution can be used to ensure the delivery of quality data, in the right format, to the right people, at the right time. The journey is yours, the technology is ours--together, we can make it a fulfilling and rewarding experience.

Read the paper (PDF).

Mark Craver, SAS

Paper SAS1952-2015:

SAS^® Visual Analytics Environment Stood Up? Check! Data Automatically Loaded and Refreshed? Not Quite

Once you have a SAS^® Visual Analytics environment up and running, the next important piece to the puzzle is to keep your users happy by keeping their data loaded and refreshed on a consistent basis. Loading data from the SAS Visual Analytics UI is both a great first start and great for ad hoc data exploring. But automating this data load so that users can focus on exploring the data and creating reports is where to power of SAS Visual Analytics comes into play. By using tried-and-true SAS^® Data Integration Studio techniques (both out of the box and custom transforms), you can easily make this happen. Proven techniques such as sweeping from a source library and stacking similar Hadoop Distributed File System (HDFS) tables into SAS^® LASR™ Analytic Server for consumption by SAS Visual Analytics are presented using SAS Visual Analytics and SAS Data Integration Studio.

Read the paper (PDF).

Jason Shoffner, SAS

Brandon Kirk, SAS

Paper SAS1683-2015:

SAS^® Visual Analytics for Fun and Profit: A College Football Case Study

SAS^® Visual Analytics is a powerful tool for exploring, analyzing, and reporting on your data. Whether you understand your data well or are in need of additional insights, SAS Visual Analytics has the capabilities you need to discover trends, see relationships, and share the results with your information consumers. This paper presents a case study applying the capabilities of SAS Visual Analytics to NCAA Division I college football data from 2005 through 2014. It follows the process from reading raw comma-separated values (csv) files through processing that data into SAS data sets, doing data enrichment, and finally loading the data into in-memory SAS^® LASR™ tables. The case study then demonstrates using SAS Visual Analytics to explore detailed play-by-play data to discover trends and relationships, as well as to analyze team tendencies to develop game-time strategies. Reports on player, team, conference, and game statistics can be used for fun (by fans) and for profit (by coaches, agents and sportscasters). Finally, the paper illustrates how all of these capabilities can be delivered via the web or to a mobile device--anywhere--even in the stands at the stadium. Whether you are using SAS Visual Analytics to study college football data or to tackle a complex problem in the financial, insurance, or manufacturing industry, SAS Visual Analytics provides the power and flexibility to score a big win in your organization.

Read the paper (PDF).

John Davis, SAS

Paper SAS4280-2015:

SAS^® Workshop: SAS Data Loader for Hadoop

This workshop provides hands-on experience with SAS^® Data Loader for Hadoop. Workshop participants will configure SAS Data Loader for Hadoop and use various directives inside SAS Data Loader for Hadoop to interact with data in the Hadoop cluster.

Read the paper (PDF).

Kari Richardson, SAS

Paper SAS1856-2015:

SAS^® and SAP Business Warehouse on SAP HANA--What's in the Handshake?

Is your company using or considering using SAP Business Warehouse (BW) powered by SAP HANA? SAS^® provides various levels of integration with SAP BW in an SAP HANA environment. This integration enables you to not only access SAP BW components from SAS, but to also push portions of SAS analysis directly into SAP HANA, accelerating predictive modeling and data mining operations. This paper explains the SAS toolset for different integration scenarios, highlights the newest technologies contributing to integration, and walks you through examples of using SAS with SAP BW on SAP HANA. The paper is targeted at SAS and SAP developers and architects interested in building a productive analytical environment with the help of the latest SAS and SAP collaborative advancements.

Read the paper (PDF).

Tatyana Petrova, SAS

Paper SAS1972-2015:

Social Media and Open Data Integration through SAS^® Visual Analytics and SAS^® Text Analytics for Public Health Surveillance

A leading killer in the United States is smoking. Moreover, over 8.6 million Americans live with a serious illness caused by smoking or second-hand smoking. Despite this, over 46.6 million U.S. adults smoke tobacco, cigars, and pipes. The key analytic question in this paper is, How would e-cigarettes affect this public health situation? Can monitoring public opinions of e-cigarettes using SAS^® Text Analytics and SAS^® Visual Analytics help provide insight into the potential dangers of these new products? Are e-cigarettes an example of Big Tobacco up to its old tricks or, in fact, a cessation product? The research in this paper was conducted on thousands of tweets from April to August 2014. It includes API sources beyond Twitter--for example, indicators from the Health Indicators Warehouse (HIW) of the Centers for Disease Control and Prevention (CDC)--that were used to enrich Twitter data in order to implement a surveillance system developed by SAS^® for the CDC. The analysis is especially important to The Office of Smoking and Health (OSH) at the CDC, which is responsible for tobacco control initiatives that help states to promote cessation and prevent initiation in young people. To help the CDC succeed with these initiatives, the surveillance system also: 1) automates the acquisition of data, especially tweets; and 2) applies text analytics to categorize these tweets using a taxonomy that provides the CDC with insights into a variety of relevant subjects. Twitter text data can help the CDC look at the public response to the use of e-cigarettes, and examine general discussions regarding smoking and public health, and potential controversies (involving tobacco exposure to children, increasing government regulations, and so on). SAS^® Content Categorization helps health care analysts review large volumes of unstructured data by categorizing tweets in order to monitor and follow what people are saying and why they are saying it. Ultimatel y, it is a solution intended to help the CDC monitor the public's perception of the dangers of smoking and e-cigarettes, in addition, it can identify areas where OSH can focus its attention in order to fulfill its mission and track the success of CDC health initiatives.

Read the paper (PDF).

Manuel Figallo, SAS

Emily McRae, SAS

Paper 3209-2015:

Standardizing the Standardization Process

A prevalent problem surrounding Extract, Transform, and Load (ETL) development is the ability to apply consistent logic and manipulation of source data when migrating to target data structures. Certain inconsistencies that add a layer of complexity include, but are not limited to, naming conventions and data types associated with multiple sources, numerous solutions applied by an array of developers, and multiple points of updating. In this paper, we examine the evolution of implementing a best practices solution during the process of data delivery, with respect to standardizing data. The solution begins with injecting the transformations of the data directly into the code at the standardized layer via Base SAS^® or SAS^® Enterprise Guide^®. A more robust method that we explore is to apply these transformations with SAS^® macros. This provides the capability to apply these changes in a consistent manner across multiple sources. We further explore this solution by implementing the macros within SAS^® Data Integration Studio processes on the DataFlux^® Data Management Platform. We consider these issues within the financial industry, but the proposed solution can be applied across multiple industries.

Read the paper (PDF).

Avery Long, Financial Risk Group

Frank Ferriola, Financial Risk Group

Paper 3478-2015:

Stress Testing for Mid-Sized Banks

In 2014, for the first time, mid-market banks (consisting of banks and bank holding companies with $10-$50 billion in consolidated assets) were required to submit Capital Stress Tests to the federal regulators under the Dodd-Frank Act Stress Testing (DFAST). This is a process large banks have been going through since 2011. However, mid-market banks are not positioned to commit as many resources to their annual stress tests as their largest peers. Limited human and technical resources, incomplete or non-existent detailed historical data, lack of enterprise-wide cross-functional analytics teams, and limited exposure to rigorous model validations are all challenges mid-market banks face. While there are fewer deliverables required from the DFAST banks, the scrutiny the regulators are placing on the analytical modes is just as high as their expectations for Comprehensive Capital Analysis and Review (CCAR) banks. This session discusses the differences in how DFAST and CCAR banks execute their stress tests, the challenges facing DFAST banks, and potential ways DFAST banks can leverage the analytics behind this exercise.

Read the paper (PDF).

Charyn Faenza, F.N.B. Corporation

T

Paper 3504-2015:

The %ic_mixed Macro: A SAS Macro to Produce Sorted Information Criteria (AIC and BIC) List for PROC MIXED for Model Selection

PROC MIXED is one of the most popular SAS procedures to perform longitudinal analysis or multilevel models in epidemiology. Model selection is one of the fundamental questions in model building. One of the most popular and widely used strategies is model selection based on information criteria, such as Akaike Information Criterion (AIC) and Sawa Bayesian Information Criterion (BIC). This strategy considers both fit and complexity, and enables multiple models to be compared simultaneously. However, there is no existing SAS procedure to perform model selection automatically based on information criteria for PROC MIXED, given a set of covariates. This paper provides information about using the SAS %ic_mixed macro to select a final model with the smallest value of AIC and BIC. Specifically, the %ic_mixed macro will do the following: 1) produce a complete list of all possible model specifications given a set of covariates, 2) use do loop to read in one model specification every time and save it in a macro variable, 3) execute PROC MIXED and use the Output Delivery System (ODS) to output AICs and BICs, 4) append all outputs and use the DATA step to create a sorted list of information criteria with model specifications, and 5) run PROC REPORT to produce the final summary table. Based on the sorted list of information criteria, researchers can easily identify the best model. This paper includes the macro programming language, as well as examples of the macro calls and outputs.

Read the paper (PDF).

Qinlei Huang, St Jude Children's Research Hospital

Paper 3860-2015:

The Challenges with Governing Big Data: How SAS^® Can Help

In this session, I discuss an overall approach to governing Big Data. I begin with an introduction to Big Data governance and the governance framework. Then I address the disciplines of Big Data governance: data ownership, metadata, privacy, data quality, and master and reference data management. Finally, I discuss the reference architecture of Big Data, and how SAS^® tools can address Big Data governance.

Sunil Soares, Information Asset

Paper SAS1905-2015:

Tips and Techniques for Efficiently Updating and Loading Data into SAS^® Visual Analytics

So you have big data and need to know how to quickly and efficiently keep your data up-to-date and available in SAS^® Visual Analytics? One of the challenges that customers often face is how to regularly update data tables in the SAS^® LASR™ Analytic Server, the in-memory analytical platform for SAS Visual Analytics. Is appending data always the right answer? What are some of the key things to consider when automating a data update and load process? Based on proven best practices and existing customer implementations, this paper provides you with answers to those questions and more, enabling you to optimize your update and data load processes. This ensures that your organization develops an effective and robust data refresh strategy.

Read the paper (PDF).

Kerri L. Rivers, SAS

Christopher Redpath, SAS

U

Paper SAS1910-2015:

Unconventional Data-Driven Methodologies Forecast Performance in Unconventional Oil and Gas Reservoirs

How does historical production data relate a story about subsurface oil and gas reservoirs? Business and domain experts must perform accurate analysis of reservoir behavior using only rate and pressure data as a function of time. This paper introduces innovative data-driven methodologies to forecast oil and gas production in unconventional reservoirs that, owing to the nature of the tightness of the rocks, render the empirical functions less effective and accurate. You learn how implementations of the SAS^® MODEL procedure provide functional algorithms that generate data-driven type curves on historical production data. Reservoir engineers can now gain more insight to the future performance of the wells across their assets. SAS enables a more robust forecast of the hydrocarbons in both an ad hoc individual well interaction and in an automated batch mode across the entire portfolio of wells. Examples of the MODEL procedure arising in subsurface production data analysis are discussed, including the Duong data model and the stretched exponential data model. In addressing these examples, techniques for pattern recognition and for implementing TREE, CLUSTER, and DISTANCE procedures in SAS/STAT^® are highlighted to explicate the importance of oil and gas well profiling to characterize the reservoir. The MODEL procedure analyzes models in which the relationships among the variables comprise a system of one or more nonlinear equations. Primary uses of the MODEL procedure are estimation, simulation, and forecasting of nonlinear simultaneous equation models, and generating type curves that fit the historical rate production data. You will walk through several advanced analytical methodologies that implement the SEMMA process to enable hypotheses testing as well as directed and undirected data mining techniques. SAS^® Visual Analytics Explorer drives the exploratory data analysis to surface trends and relationships, and the data QC workflows ensure a robust input space for the performance forecasting methodologies that are visualized in a web-based thin client for interactive interpretation by reservoir engineers.

Read the paper (PDF).

Keith Holdaway, SAS

Louis Fabbi, SAS

Dan Lozie, SAS

Paper SAS1925-2015:

Using SAS to Deliver Web Content Personalization Using Cloud-Based Clickstream Data and On-premises Customer Data

Real-time web content personalization has come into its teen years, but recently a spate of marketing solutions have enabled marketers to finely personalize web content for visitors based on browsing behavior, geo-location, preferences, and so on. In an age where the attention span of a web visitor is measured in seconds, marketers hope that tailoring the digital experience will pique each visitor's interest just long enough to increase corporate sales. The range of solutions spans the entire spectrum of completely cloud-based installations to completely on-premises installations. Marketers struggle to find the most optimal solution that would meet their corporation's marketing objectives, provide them the highest agility and time-to-market, and still keep a low marketing budget. In the last decade or so, marketing strategies that involved personalizing using purely on-premises customer data quickly got replaced by ones that involved personalizing using only web-browsing behavior (a.k.a, clickstream data). This was possible because of a spate of cloud-based solutions that enabled marketers to de-couple themselves from the underlying IT infrastructure and the storage issues of capturing large volumes of data. However, this new trend meant that corporations weren't using much of their treasure trove of on-premises customer data. Of late, however, enterprises have been trying hard to find solutions that give them the best of both--the ease of gathering clickstream data using cloud-based applications and on-premises customer data--to perform analytics that lead to better web content personalization for a visitor. This paper explains a process that attempts to address this rapidly evolving need. The paper assumes that the enterprise already has tools for capturing clickstream data, developing analytical models, and for presenting the content. It provides a roadmap to implementing a phased approach where enterprises continue to capture clickstream data, but they bring that data in-house to be merg ed with customer data to enable their analytics team to build sophisticated predictive models that can be deployed into the real-time web-personalization application. The final phase requires enterprises to continuously improve their predictive models on a periodic basis.

Read the paper (PDF).

Mahesh Subramanian, SAS Institute Inc.

Suneel Grover, SAS

Paper 3503-2015:

Using SAS^® Enterprise Guide^®, SAS^® Enterprise Miner™, and SAS^® Marketing Automation to Make a Collection Campaign Smarter

Companies are increasingly relying on analytics as the right solution to their problems. In order to use analytics and create value for the business, companies first need to store, transform, and structure the data to make it available and functional. This paper shows a successful business case where the extraction and transformation of the data combined with analytical solutions were developed to automate and optimize the management of the collections cycle for a TELCO company (DIRECTV Colombia). SAS^® Data Integration Studio is used to extract, process, and store information from a diverse set of sources. SAS Information Map is used to integrate and structure the created databases. SAS^® Enterprise Guide^® and SAS^® Enterprise Miner™ are used to analyze the data, find patterns, create profiles of clients, and develop churn predictive models. SAS^® Customer Intelligence Studio is the platform on which the collection campaigns are created, tested, and executed. SAS^® Web Report Studio is used to create a set of operational and management reports.

Read the paper (PDF).

Darwin Amezquita, DIRECTV

Paulo Fuentes, Directv Colombia

Andres Felipe Gonzalez, Directv

W

Paper SAS1390-2015:

What's New in SAS^® Data Management

The latest releases of SAS^® Data Integration Studio and DataFlux^® Data Management Platform provide an integrated environment for managing and transforming your data to meet new and increasingly complex data management challenges. The enhancements help develop efficient processes that can clean, standardize, transform, master, and manage your data. The latest features include capabilities for building complex job processes, new web-based development and job monitoring environments, enhanced ELT transformation capabilities, big data transformation capabilities for Hadoop, integration with the analytic platform provided by SAS^® LASR™ Analytic Server, enhanced features for lineage tracing and impact analysis, and new features for master data and metadata management. This paper provides an overview of the latest features of the products and includes use cases and examples for leveraging product capabilities.

Read the paper (PDF). | Watch the recording.

Nancy Rausch, SAS

Mike Frost, SAS

Y

Paper 3262-2015:

Yes, SAS^® Can Do! Manage External Files with SAS Programming

Managing and organizing external files and directories play an important part in our data analysis and business analytics work. A good file management system can streamline project management and file organizations and significantly improve work efficiency . Therefore, under many circumstances, it is necessary to automate and standardize the file management processes through SAS^® programming. Compared with managing SAS files via PROC DATASETS, managing external files is a much more challenging task, which requires advanced programming skills. This paper presents and discusses various methods and approaches to managing external files with SAS programming. The illustrated methods and skills can have important applications in a wide variety of analytic work fields.

Read the paper (PDF).

Justin Jia, Trans Union

Amanda Lin, CIBC