SAS Global Forum 2017 Proceedings

A D F H I L M O P S T U W

A

Session 0831-2017:

A Practical Guide to Healthcare Data: Tips, Traps, and Techniques

Healthcare is weird. Healthcare data is even more so. The digitization of healthcare data that describes the patient experience is a modern phenomenon, with most healthcare organizations still in their infancy. While the business of healthcare is already a century old, most organizations have focused their efforts on the financial aspects of healthcare and not on stakeholder experience or clinical outcomes. Think of the workflow that you might have experienced such as scheduling an appointment through doctor visits, obtaining lab tests, or obtaining prescriptions for interventions such as surgery or physical therapy. The modern healthcare system creates a digital footprint of administrative, process, quality, epidemiological, financial, clinical, and outcome measures, which range in size, cleanliness, and usefulness. Whether you are new to healthcare data or are looking to advance your knowledge of healthcare data and the techniques used to analyze it, this paper serves as a practical guide to understanding and using healthcare data. We explore common methods for how we structure and access data, discuss common challenges such as aggregating data into episodes of care, describe reverse engineering real world events, and talk about dealing with the myriad of unstructured data found in nursing notes. Finally, we discuss the ethical uses of healthcare data and the limits of informed consent, which are critically important for those of us in analytics.

Read the paper (PDF)

Greg Nelson, Thotwave Technologies, LLC.

Session 1067-2017:

An Introduction to the Improved SAS^® Data Integration Studio Batch Deployment Utility on UNIX

Interactively redeploying SAS^® Data Integration Studio jobs can be a slow and tedious process. The updated batch deployment utility gives the ETL Tech Lead a more efficient and repeatable method for administering batch jobs. This improved tool became available in SAS^® Data Integration Studio 4.901.

Read the paper (PDF)

Jeff Dyson, The Financial Risk Group

Session SAS0339-2017:

An Oasis of Serenity in a Sea of Chaos: Automating the Management of Your UNIX/Linux Multi-tiered SAS^® Services

UNIX and Linux SAS^® administrators, have you ever been greeted by one of these statements as you walk into the office before you have gotten your first cup of coffee? Power outage! SAS servers are down. I cannot access my reports. Have you frantically tried to restart the SAS servers to avoid loss of productivity and missed one of the steps in the process, causing further delays while other work continues to pile up? If you have had this experience, you understand the benefit to be gained from a utility that automates the management of these multi-tiered deployments. Until recently, there was no method for automatically starting and stopping multi-tiered services in an orchestrated fashion. Instead, you had to use time-consuming manual procedures to manage SAS services. These procedures were also prone to human error, which could result in corrupted services and additional time lost, debugging and resolving issues injected by this process. To address this challenge, SAS Technical Support created the SAS Local Services Management (SAS_lsm) utility, which provides automated, orderly management of your SAS^® multi-tiered deployments. The intent of this paper is to demonstrate the deployment and usage of the SAS_lsm utility. Now, go grab a coffee, and let's see how SAS_lsm can make life less chaotic.

Read the paper (PDF)

Clifford Meyers, SAS

D

Session SAS0670-2017:

Data Management for Cybersecurity

As an information security or data professional, you have seen and heard about how advanced analytics has impacted nearly every business domain. You recognize the potential of insights derived from advanced analytics to improve the information security of your organization. You want to realize these benefits, and to understand their pitfalls. To successfully apply advanced analytics to the information security business problem, proper application of data management processes and techniques is of paramount importance. Based on professional services experience in implementing SAS^® Cybersecurity, this session teaches you about the data sources used, the activities involved in properly managing this data, and the means to which these processes address information security business problems. You will come to appreciate how using advanced analytics in the information security domain requires more than just the application of tools or modeling techniques. Using a data management regime for information security concerns can benefit your organization by providing insights into IT infrastructure, enabling successful data science activities, and providing greater resilience by way of improved information security investigations.

Read the paper (PDF)

Alex Anglin, SAS

Session SAS0552-2017:

Deploying SAS^® on Software-Defined and Virtual Storage Systems

This paper presents considerations for deploying SAS^® Foundation across software-defined storage (SDS) infrastructures, and within virtualized storage environments. There are many new offerings on the market that offer easy, point-and-click creation of storage entities, with simplified management. Internal storage area network (SAN) virtualization also removes much of the hands-on management for defining storage device pools. Automated tier software further attempts to optimize data placement across performance tiers without manual intervention. Virtual storage provisioning and automated tier placement have many time-saving and management benefits. In some cases, they have also caused serious unintended performance issues with heavy large-block workloads, such as those found in SAS Foundation. You must follow best practices to get the benefit of these new technologies while still maintaining performance. For SDS infrastructures, this paper offers specific considerations for the performance of applications in SAS Foundation, workload management and segregation, replication, high availability, and disaster recovery. Architecture and performance ramifications and advice are offered for virtualized and tiered storage systems. General virtual storage pros and cons are also discussed in detail.

Read the paper (PDF)

Tony Brown, SAS

Margaret Crevar, SAS

Session 0830-2017:

Developing Your Data Strategy

The ever growing volume of data challenges us to keep pace in ensuring that we use it to its full advantage. Unfortunately, often our response to new data sources, data types, and applications is somewhat reactionary. There exists a misperception that organizations have precious little time to consider a purposeful strategy without disrupting business continuity. Strategy is a phrase that is often misused and ill-defined. However, it is nothing more than a set of integrated choices that help position an initiative for future success. This presentation covers the key elements defining data strategy. The following key topics are included: What data should we keep or toss? How should we structure data (warehouse versus data lake versus real-time event streaming)? How do we store data (cloud, virtualization, federation, cloud, Hadoop)? What is the approach we use to integrate and cleanse data (ETL versus cognitive/ automated profiling)? How do we protect and share data? These topics ensure that the organization gets the most value from our data. They explore how we prioritize and adapt our strategy to meet unanticipated needs in the future. As with any strategy, we need to make sure that we have a roadmap or plan for execution, so we talk specifically about the tools, technologies, methods, and processes that are useful as we design a data strategy that is both relevant and actionable to your organization.

Read the paper (PDF)

Greg Nelson, Thotwave Technologies, LLC.

F

Session SAS0491-2017:

From Source to Target: Hadoop Capabilities of SAS^® Data Integration Studio

This paper demonstrates how to use the capabilities of SAS^® Data Integration Studio to extract, load, and transform your data within a Hadoop environment. Which transformations can be used in each layer of the ELT process is illustrated using a sample use case, and the functionality of each is described. The use case steps through the process from source to target.

Read the paper (PDF)

Darryl Yewchin, SAS

Todd Foreman, SAS

H

Session 2001-2017:

Hands-On Workshop: SAS^® Data Loader for Hadoop

This workshop provides hands-on experience with some basic functionality of SAS Data Loader for Hadoop. You will learn how to: Copy Data to Hadoop Profile Data in Hadoop Cleanse Data in Hadoop

Kari Richardson, SAS

I

Session SAS0639-2017:

I Spy PII: Detect, Protect, and Monitor Personally Identifiable Information with SAS^® Federation Server

The clock is ticking! Is your company ready for May 25, 2018 when the General Data Protection Regulation that affects data privacy laws across Europe comes into force? If companies fail to comply, they incur very large fines and might lose customer trust if sensitive information is compromised. With data streaming in from multiple channels in different formats, sizes, and wavering quality, it is increasingly difficult to keep track of personal data so that you can protect it. SAS^® Data Management helps companies on their journey toward governance and compliance involving tasks such as detection, quality assurance, and protection of personal data. This paper focuses on using SAS^® Federation Server and SAS^® Data Management Studio in the SAS^® data management suite of products to surface and manage that hard-to find-personal data. SAS Federation Server provides you with a universal way to access data in Hadoop, Teradata, SQL Server, Oracle, SAP HANA, and other types of data without data movement during processing. The advanced data masking and encryption capabilities of SAS Federation Server can be use when virtualizing data for users. Purpose-built data quality functions are used to perform identification analysis, parsing, and matching and extraction of personal data in real time. We also provide insight to how the exploratory data analysis capability of SAS^® Data Management Studio enables you to scan through your investigation hub to identify and categorize personal data.

Read the paper (PDF)

Cecily Hoffritz, SAS

Session 1163-2017:

If You Have to Process Difficult Characters: UTF-8 Encoding and SAS^®

Many SAS^® environments are set up for single-byte character sets (SBCS). But many organizations now have to process names of people and companies with characters outside that set. You can solve this problem by changing the configuration to the UTF-8 encoding, which is a multi-byte character set (MBCS). But the commonly used text manipulating functions like SUBSTR, INDEX, FIND, and so on, act on bytes, and should not be used anymore. SAS has provided new functions to replace these (K-functions). Also, character fields have to be enlarged to make room for multi-byte characters. This paper describes the problems and gives guidelines for a strategy to change. It also presents code to analyze existing code for functions that might cause problems. For those interested, a short historic background and a description of UTF-8 encoding is also provided. Conclusions focus on the positioning of SAS environments configured with UTF-8 versus single-byte encodings, the strategy of organizations faced with a necessary change, and the documentation.

Read the paper (PDF)

Frank Poppe, PW Consulting

Session 0826-2017:

Improving the Evaluation of Higher Education: Understanding the Myths, Methods, and Metrics

A growing need in higher education is the more effective use of analytics when evaluating the success of a postsecondary institution. The metrics currently used are simplistic measures of graduation rates, publications, and external funding. These measures offer a limited view of the effectiveness of postsecondary institutions. This paper provides a global perspective of the academic progress of students and the business of higher education. It presents innovative metrics that are more effective in evaluating postsecondary-institutional effectiveness.

Read the paper (PDF)

Sean Mulvenon, University of Arkansas

L

Session 1257-2017:

Let the System Do Repeating Work for You

Developing software using agile methodologies has become the common practice in many organizations. We use the SCRUM methodology to prepare, plan, and implement changes in our analytics environment. Preparing for the deployment of a new release usually took two days of creating packages, promoting them, deploying jobs, creating migration scripts, and correcting errors made in the first attempt. A sprint that originally took 10 working days (two weeks) was effectively reduced to barely seven. By automating this process, we were able to reduce the time needed to prepare our deployment to less than half a day, increasing the time we can spend developing by 25%. In this paper, we present the process and system prerequisites for automating the deployment process. We also describe the process, code, and scripts required for automating metadata promotion and physical table comparison and update.

Read the paper (PDF)

Laurent de Walick, PW Consulting

bas Marsman, NN Bank

Session SAS0395-2017:

Location Analytics: Minority Report Is Here. Real-Time Geofencing Using SAS^® Event Stream Processing

Geofencing is one of the most promising and exciting concepts that has developed with the advent of the internet of things. Like John Anderton in the 2002 movie Minority Report, you can now enter a mall and immediately receive commercial ads and offers based on your personal taste and past purchases. Authorities can track vessels positions and detect when a ship is not in the area it should be, or they can forecast and optimize harbor arrivals. When a truck driver breaks from the route, the dispatcher can be alerted and can act immediately. And there are countless examples from manufacturing, industry, security, or even households. All of these applications are based on the core concept of geofencing, which consists of detecting whether a device s position is within a defined geographical boundary. Geofencing requires real-time processing in order to react appropriately. In this session, we explain how to implement real-time geofencing on streaming data with SAS^® Event Stream Processing and achieve high-performance processing, in terms of millions of events per second, over hundreds of millions of geofences.

Read the paper (PDF)

Frederic Combaneyre, SAS

M

Session 1148-2017:

My SAS^® Grid Scheduler

No Batch Scheduler? No problem! This paper describes the use of a SAS^® Data Integration Studio job that can be started by a time-dependent scheduler like Windows Scheduler (or crontab in UNIX) to mimic a batch scheduler using SAS^® Grid Manager.

Read the paper (PDF)

Patrick Cuba, Cuba BI Consulting

O

Session SAS0278-2017:

Optimizing SAS^® Grid Computing with SAS^® Scalable Performance Data Server and Dynamic Data Partitioning

Making optimal use of SAS^® Grid Computing relies on the ability to spread the workload effectively across all of the available nodes. With SAS^® Scalable Performance Data Server (SPD Server), it is possible to partition your data and spread the processing across the SAS Grid Computing environment. In an ideal world it would be possible to adjust the size and number of partitions according to the data volumes being processed on any given day. This paper discusses a technique that enables the processing performed in the SAS Grid Computing environment to be dynamically reconfigured, automatically at run time, to optimize the use of SAS Grid Computing, and to provide significant performance benefits.

Read the paper (PDF)

Andy Knight, SAS

P

Session 1461-2017:

Programming Weakly Informative Prior Distributions in SAS^®

Bayesian inference has become ubiquitous in applied science because of its flexibility in modeling data and advances in computation that allow special methods of simulation to obtain sound estimates when more mathematical approaches are intractable. However, when the sample size is small, the choice of a prior distribution becomes difficult. Computationally convenient choices for prior distributions can overstate prior beliefs and bias the estimates. We propose a simple form of prior distribution, a mixture of two uniform distributions, that is weakly informative, in that the prior distribution has a relatively large standard deviation. This choice leads to closed-form expressions for the posterior distribution if the observed data follow a normal, binomial, or Poisson distribution. The explicit formulas are easily encoded in SAS^®. For a small sample size of 10, we illustrate how to elicit the mixture prior and indicate that the resulting posterior distribution is insensitive to minor misspecification of input values. Weakly informative prior distributions suitable for small sample sizes are easy to specify and appear to provide robust inference.

View the e-poster or slides (PDF)

Robert Lew, U.S. Department of Veterans Affairs

hongsheng wu, Wentworth Institute of Technology

jones yu, Wentworth Institute of Technology

S

Session 1179-2017:

SAS^® Data Integration Studio: Take Control with Conditional and Looping Transformations

SAS^® Data Integration Studio jobs are not always linear. While Loop transformations have been part of SAS Data Integration Studio for ages, only more recently has SAS Data Integration Studio included the Conditional Control transformations to control logic flow within a job. This paper demonstrates the use of both the Loop and Conditional transformations in a real world example.

Read the paper (PDF)

Harry Droogendyk, Stratia Consulting Inc

Session 1517-2017:

SAS^® Data Integration: a Capgemini Solution to Accelerate and Keeping It All 'in Sync'

A common issue in data integration is that often the documentation and the SAS^® data integration job source code start to diverge and eventually become out of sync. At Capgemini, working for a specific client, we developed a solution to rectify this challenge. We proposed moving all necessary documentation into the SAS^® Data Integration Studio job itself. In this way, all documentation then becomes part of the metadata we have created, with the possibility of automatically generating Job and Release documentation from the metadata. This presentation therefore focuses on the metadata documentation generator. Specifically, this presentation: 1) looks at how to use programming and documentation standards in SAS data integration jobs to enable the generation of documentation from the metadata; and 2) shows how the documentation is generated from the metadata, and the challenges that were encountered creating the code. I draw on our hands-on experience; Capgemini has implemented this for a customer in the Netherlands, and we are rolling this out as an accelerator in other SAS data integration projects worldwide. I share examples of the generated documentation, which contains functional and technical designs, including a list with all source tables, a list with the target tables, all transformations with their own documentation, job dependencies, and more.

Read the paper (PDF)

Richard Hogenberg, Capgemini

Session 0911-2017:

Self-Service Data Management for Analytics Users across the Enterprise

With the proliferation of analytics expanding across every function of the enterprise, the need for broader access to data by experienced data scientists and non-technical users to produce reports and do discovery is growing exponentially. The unintended consequence of this trend is a bottleneck within IT to deliver the necessary data while still maintaining the necessary governance and data security standards required to safeguard this critical corporate asset. This presentation illustrates how organizations are solving this challenge and enabling users to both access larger quantities of existing data and add new data to their own models without negatively impacting the quality, security, or cost to store that data. It also highlights some of the cost and performance benefits achieved by enabling self-service data management.

Ken Pikulik, Teradata

T

Session SAS0190-2017:

Ten Tips to Unlock the Power of Hadoop with SAS^®

This paper discusses a set of practical recommendations for optimizing the performance and scalability of your Hadoop system using SAS^®. Topics include recommendations gleaned from actual deployments from a variety of implementations and distributions. Techniques cover tips for improving performance and working with complex Hadoop technologies such as Kerberos, techniques for improving efficiency when working with data, methods to better leverage the SAS in Hadoop components, and other recommendations. With this information, you can unlock the power of SAS in your Hadoop system.

Read the paper (PDF)

Wilbram Hazejager, SAS

Nancy Rausch, SAS

Session 1258-2017:

Testing the Night Away

Testing is a weak spot in many data warehouse environments. A lot of the testing is focused on the correct implementation of requirements. But due to the complex nature of analytics environments, a change in a data integration process can lead to unexpected results in totally different and untouched areas. We developed a method to identify unexpected changes often and early by doing a nightly regression test. The test does a full ETL run, compares all output from the test to a baseline, and reports all the changes. This paper describes the process and the SAS^® code needed to back up existing data, trigger ETL flows, compare results, and restore situations after a nightly regression test. We also discuss the challenges we experienced while implementing the nightly regression test framework.

Read the paper (PDF)

Laurent de Walick, PW Consulting

bas Marsman, NN Bank

Stephan Minnaert, PW Consulting

U

Session 1122-2017:

Using Hash Tables for Creating Electronic Code Books

Hash tables are powerful tools when building an electronic code book, which often requires a lot of match-merging between the SAS^® data sets. In projects that span multiple years (e.g., longitudinal studies), there are usually thousands of new variables introduced at the end of every year or at the end of each phase of the project. These variables usually have the same stem or core as the previous year's variables. However, they differ only in a digit or two that usually signifies the year number of the project. So, every year, there is this extensive task of comparing thousands of new variables to older variables for the sake of carrying forward key database elements corresponding to the previously defined variables. These elements can include the length of the variable, data type, format, discrete or continuous flag, and so on. In our SAS program, hash objects are efficiently used to cut down not only time, but also the number of DATA and PROC steps used to accomplish the task. Clean and lean code is much easier to understand. A macro is used to create the data set containing new and older variables. For a specific new variable, the FIND method in hash objects is used in a loop to find the match to the most recent older variable. What was taking around a dozen PROC SQL steps is now a single DATA step using hash tables.

Read the paper (PDF)

Raghav Adimulam, Westat

Session 0433-2017:

Using International Character Sets with SAS^® and Teradata

If you run SAS^® and Teradata software with default application and database client encodings, some operations with international character sets will appear to work because you are actually treating character strings as streams of bytes instead of streams of characters. As long as no one in the chain of custody tries to interpret the data as anything other than a stream of bytes, then data can sometimes flow in and out of a database without being altered, giving the appearance of correct operation. But when you need to compare a particular character to an international character, or when your data approaches the maximum size of a character field, then you will run into trouble. To correctly handle international character sets, every layer of software that touches the data needs to agree on the encoding of the data. UTF-8 encoding is a flexible way to handle many single-byte and multi-byte international character sets. To use UTF-8 encoding with international character sets, we need to configure the SAS session encoding, Teradata client encoding, and Teradata server encoding to all agree, so that they are handling UTF-8 encoded data. This paper shows you how to configure SAS and Teradata so that your applications run successfully with international characters.

Read the paper (PDF)

Greg Otto, Teradata

Salman Maher, SAS

Austin Swift, SAS

Session 0194-2017:

Using SAS^® Data Management Advanced to Ensure Data Quality for Master Data Management

Data is a valuable corporate asset that, when managed improperly, can detract from a company's ability to achieve strategic goals. At 1-800-Flowers.com, Inc. (18F), we have embarked on a journey toward data governance through embracing Master Data Management (MDM). Along the path, we've recognized that in order to protect and increase the value of our data, we must take data quality into consideration at all aspects of data movement in the organization. This presentation discusses the ways that SAS^® Data Management is being leveraged by the team at 18F to create and enhance our data quality strategy to ensure data quality for MDM.

Read the paper (PDF)

Brian Smith, 1800Flowers.com

W

Session SAS0195-2017:

What's New in SAS^® Data Management

The latest releases of SAS^® Data Management software provide a comprehensive and integrated set of capabilities for collecting, transforming, and managing your data. The latest features in the product suite include capabilities for working with data from a wide variety of environments and types including Hadoop, cloud data sources, RDBMS, files, unstructured data, streaming, and others, and the ability to perform ETL and ELT transformations in diverse run-time environments including SAS^®, database systems, Hadoop, Spark, SAS^® Analytics, cloud, and data virtualization environments. There are also new capabilities for lineage, impact analysis, clustering, and other data governance features for enhancements to master data and support metadata management. This paper provides an overview of the latest features of the SAS^® Data Management product suite and includes use cases and examples for leveraging product capabilities.

Read the paper (PDF)

Nancy Rausch, SAS