SAS Global Forum 2016 Proceedings

A E F H I M N P R S T U W

A

Session 2260-2016:

A Practical Introduction to SAS^® Data Integration Studio

A useful and often overlooked tool released with the SAS^® Business Intelligence suite to aid in ETL is SAS^® Data Integration Studio. This product gives users the ability to extract, transform, join, and load data from various database management systems (DBMSs), data marts, and other data stores by using a graphical interface and without having to code different credentials for each schema. It enables seamless promotion of code to a production system without the need to alter the code. And it is quite useful for deploying and scheduling jobs by using the schedule manager in SAS^® Management Console, because all code created by Data Integration Studio is optimized. Although this tool enables users to create code from scratch, one of its most useful capabilities is that it can take legacy SAS^® code and, with minimal alterations, have its data associations created and have all the properties of a job coded from scratch.

Read the paper (PDF)

Erik Larsen, Independent Consultant

Session SAS3960-2016:

An Insider's Guide to SAS/ACCESS^® Interface to Impala

Impala is an open source SQL engine designed to bring real-time, concurrent, ad hoc query capability to Hadoop. SAS/ACCESS^® Interface to Impala allows SAS^® to take advantage of this exciting technology. This presentation uses examples to show you how to increase your program's performance and troubleshoot problems. We discuss how Impala fits into the Hadoop ecosystem and how it differs from Hive. Learn the differences between the Hadoop and Impala SAS/ACCESS engines.

Read the paper (PDF)

Jeff Bailey, SAS

Session 11769-2016:

Analytics and Data Management in a Box: Dramatically Increase Performance

Organizations are collecting more structured and unstructured data than ever before, and it is presenting great opportunities and challenges to analyze all of that complex data. In this volatile and competitive economy, there has never been a bigger need for proactive and agile strategies to overcome these challenges by applying the analytics directly to the data rather than shuffling data around. Teradata and SAS^® have joined forces to revolutionize your business by providing enterprise analytics in a harmonious data management platform to deliver critical strategic insights by applying advanced analytics inside the database or data warehouse where the vast volume of data is fed and resides. This paper highlights how Teradata and SAS strategically integrate in-memory, in-database, and the Teradata relational database management system (RDBMS) in a box, giving IT departments a simplified footprint and reporting infrastructure, as well as lower total cost of ownership (TCO), while offering SAS users increased analytic performance by eliminating the shuffling of data over external network connections.

Read the paper (PDF)

Greg Otto, Teradata Corporation

Tho Nguyen, Teradata

Paul Segal, Teradata

Session 10662-2016:

Avoid Change Control by Using Control Tables

Developers working on a production process need to think carefully about ways to avoid future changes that require change control, so it's always important to make the code dynamic rather than hardcoding items into the code. Even if you are a seasoned programmer, the hardcoded items might not always be apparent. This paper assists in identifying the harder-to-reach hardcoded items and addresses ways to effectively use control tables within the SAS^® software tools to deal with sticky areas of coding such as formats, parameters, grouping/hierarchies, and standardization. The paper presents examples of several ways to use the control tables and demonstrates why this usage prevents the need for coding changes. Practical applications are used to illustrate these examples.

Read the paper (PDF) | Watch the recording

Frank Ferriola, Financial Risk Group

E

Session 4140-2016:

Event Stream Processing for Enterprise Data Exploitation

With the big data throughputs generated by event streams, organizations can opportunistically respond with low-latency effectiveness. Having the ability to permeate identified patterns of interest throughout the enterprise requires deep integration between event stream processing and foundational enterprise data management applications. This paper describes the innovative ability to consolidate real-time data ingestion with controlled and disciplined universal data access from SAS^® and Teradata.

Read the paper (PDF)

Tho Nguyen, Teradata

Fiona McNeill, SAS

F

Session 8520-2016:

From SAS^® Data Management to Big Data Appliances: How SAS/ACCESS^® Makes Life Easier

In the fast-changing world of data management, new technologies to handle increasing volumes of (big) data are emerging every day. Many companies are struggling in dealing with these technologies and more importantly in integrating them into their existing data management processes. Moreover, they also want to rely on the knowledge built by their teams in existing products. Implementing change to learn new technologies can therefore be a costly procedure. SAS^® is a perfect fit in this situation by offering a suite of software that can be set up to work with any third-party database through using the corresponding SAS/ACCESS^® interface. Indeed, for new database technologies, SAS is releasing a specific SAS/ACCESS interface, allowing users to develop and migrate SAS solutions almost transparently. Your users need to know only a few techniques in order to combine the power of SAS with a third-party (big) database. This paper will help companies with the integration of rising technologies in their current SAS^® Data Management platform using SAS/ACCESS. More specifically, the paper focuses on best practices for success in such an implementation. Integrating SAS Data Management with the IBM PureData for Analytics Appliance is provided as an example.

Read the paper (PDF) | View the e-poster or slides (PDF)

Magali Thesias, Deloitte

H

Session SAS2180-2016:

How to Leverage the Hadoop Distributed File System as a Storage Format for the SAS^® Scalable Performance Data Server and SAS^® Scalable Performance Data Engine

The SAS^® Scalable Performance Data Server and SAS^® Scalable Performance Data Engine are data formats from SAS^® that support the creation of analytical base tables with tens of thousands of columns. These analytical base tables are used to support daily predictive analytical routines. Traditionally Storage Area Network (SAN) storage has been and continues to be the primary storage platform for the SAS Scalable Performance Data Server and SAS Scalable Performance Data Engine formats. Due to cost constraints associated with SAN storage, companies have added Hadoop to their environments to help minimize storage costs. In this paper we explore how the SAS Scalable Performance Data Server and SAS Scalable Performance Data Engine leverage the Hadoop Distributed File System.

Read the paper (PDF)

Steven Sober, SAS

I

Session 9320-2016:

Importing Data Directly from PDF into SAS^® Data Sets

In the pharmaceutical industry, generating summary tables or listings in PDF format is becoming increasingly popular. You can create various PDF output files with different styles by combining the ODS PDF statement and the REPORT procedure in SAS^® software. In order to validate those outputs, you need to import the PDF summary tables or listings into SAS data sets. This paper presents an overview of a utility that reads in an uncompressed PDF file that is generated by SAS, extracts the data, and converts it to SAS data sets.

Read the paper (PDF)

William Wu, Herodata LLC

Yun Guan, Theravance Biopharm

Session SAS5200-2016:

It's raining data! Harnessing the Cloud with Amazon Redshift and SAS/ACCESS^® Software

Every day, companies all over the world are moving their data into the cloud. While there are many options available, much of this data will wind up in Amazon Redshift. As a SAS^® user, you are probably wondering, 'What is the best way to access this data using SAS?' This paper discusses the many ways that you can use SAS/ACCESS^® software to get to Amazon Redshift. We compare and contrast the various approaches and help you decide which is best for you. Topics include building a connection, moving data into Amazon Redshift, and applying performance best practices.

Read the paper (PDF)

Chris DeHart, SAS

Jeff Bailey, SAS

M

Session 10863-2016:

Migrating from SAS^® 9.3 to SAS^® 9.4: SAS Communicating with Microsoft Office Products

Microsoft Office products play an important role in most enterprises. SAS^® is combined with Microsoft Office to assist in decision making in everyday life. Most SAS users have moved from SAS^® 9.3, or 32-bit SAS, to SAS^® 9.4 for the exciting new features. This paper describes a few things that do not work quite the same in SAS 9.4 as they did in the 32-bit version. It discusses the reasons for the differences and the new approaches that SAS 9.4 provides. The paper focuses on how SAS 9.4 works with Excel and Access. Furthermore, this presentation summarizes methods by LIBNAME engines and by import or export procedures. It also demonstrates how to interactively import or export PC files with SAS 9.4. The issue of incompatible format catalogs is addressed as well.

Read the paper (PDF)

Hong Zhang, Mathematica Policy Research

Session SAS6431-2016:

Modernizing Data Management with Event Streams

Specialized access requirements to tap into event streams vary depending on the source of the events. Open-source approaches from Spark, Kafka, Storm, and others can connect event streams to big data lakes, like Hadoop and other common data management repositories. But a different approach is needed to ensure latency and throughput are not adversely affected when processing streaming data with different open-source frameworks, that is, they need to scale. This talk distinguishes the advantages of adapters and connectors and shows how SAS^® can leverage both Hadoop and YARN technologies to scale while still meeting the needs of streaming data analysis and large, distributed data repositories.

Read the paper (PDF)

Evan Guarnaccia, SAS

Fiona McNeill, SAS

Steve Sparano, SAS

N

Session SAS5562-2016:

New for SAS^® 9.4: Integrating Third-Party Metadata for Use with Lineage and Impact Analysis

Customers are under increasing pressure to determine how data, reports, and other assets are interconnected, and to determine the implications of making changes to these objects. Of special interest is performing lineage and impact analysis on data from disparate systems, for example, SAS^®, IBM InfoSphere DataStage, Informatica PowerCenter, SAP BusinessObjects, and Cognos. SAS provides utilities to identify relationships between assets that exist within the SAS system, and beginning with the third maintenance release of SAS^® 9.4, these utilities support third-party products. This paper provides step-by-step instructions, tips, and techniques for using these utilities to collect metadata from these disparate systems, and to perform lineage and impact analysis.

Read the paper (PDF)

Liz McIntosh, SAS

P

Session 11206-2016:

Performance Issues and Solutions: SAS^® with Hadoop

Today many analysts in information technology are facing challenges working with large amounts of data. Most analysts are smart enough to write queries using correct table join conditions, text mining, index keys, and hash objects for quick data retrieval. The SAS^® system is now able to work with Hadoop data storage to provide efficient processing power to SAS analytics. In this paper we demonstrate differences in data retrieval between the SQL procedure, the DATA step, and the Hadoop environment. In order to test data retrieval time, we used the following code to randomly generate ten million observations with character and numeric variables using the RANUNI function. DO LOOP=1 to 10e7 will generate ten million records; however this code can generate any number of records by changing the exponential log. We use the most commonly used functions and procedures to retrieve records on a test data set and we illustrate real- and CPU-time processing. All PROC SQL and Hadoop queries are on SAS^® Enterprise Guide^® 6.1 to record processing time. This paper includes an overview of Hadoop data architecture and describes how to connect to a Hadoop environment through SAS. It provides sample queries for data retrieval, explains differences using Hadoop pass-through versus Hadoop LIBNAME, and covers the big challenges for real-time and CPU-time comparison using the most commonly used SAS functions and procedures.

Read the paper (PDF)

Anjan Matlapudi,, AmeriHealth Caritas Family of Companies

Session 9660-2016:

Performing Efficient Wide-to-Long Transposes on Teradata Tables Using SAS^® Explicit Pass-Through

SAS^® provides in-database processing technology in the SQL procedure, which allows the SQL explicit pass-through method to push some or all of the work to a database management system (DBMS). This paper focuses on using the SAS SQL explicit pass-through method to transform Teradata table columns into rows. There are two common approaches for transforming table columns into rows. The first approach is to create narrow tables, one for each column that requires transposition, and then use UNION or UNION ALL to append all the tables together. This approach is straightforward but can be quite cumbersome, especially when there is a large number of columns that need to be transposed. The second approach is using the Teradata TD_UNPIVOT function, which makes the wide-to-long table transposition an easy job. However, TD_UNPIVOT allows you to transpose only columns with the same data type from wide to long. This paper presents a SAS macro solution to the wide-to-long table transposition involving different column data types. Several examples are provided to illustrate the usage of the macro solution. This paper complements the author's SAS paper Performing Efficient Transposes on Large Teradata Tables Using SQL Explicit Pass-Through in which the solution of performing the long-to-wide table transposition method is discussed. SAS programmers who are working with data stored in an external DBMS and would like to efficiently transpose their data will benefit from this paper.

Read the paper (PDF) | Watch the recording

Tao Cheng, Accenture

Session 7560-2016:

Processing CDC and SCD Type 2 for Sources without CDC: A Hybrid Approach

In a data warehousing system, change data capture (CDC) plays an important part not just in making the data warehouse (DWH) aware of the change but also in providing a means of flowing the change to the DWH marts and reporting tables so that we see the current and latest version of the truth. This and slowly changing dimensions (SCD) create a cycle that runs the DWH and provides valuable insights in the history and for the decision-making future. What if the source has no CDC? It would be an ETL nightmare to identify the exact change and report the absolute truth. If these two processes can be combined into a single process where just one single transform does both jobs of identifying the change and applying the change to the DWH, then we can save significant processing times and value resources of the system. Hence, I came up with a hybrid SCD with CDC approach for this. My paper focuses on sources that DO NOT have CDC in their sources and need to perform SCD Type 2 on such records without worrying about data duplications and increased processing times.

Read the paper (PDF) | Watch the recording

Vishant Bhat, University of Newcastle

Tony Blanch, SAS Consultant

R

Session 1660-2016:

Reading JSON in SAS^® Using Groovy

JavaScript Object Notation (JSON) has quickly become the de-facto standard for data transfer on the Internet, due to an increase in both web data and the usage of full-stack JavaScript. JSON has become dominant in the emerging technologies of the web today, such as the Internet of Things and the mobile cloud. JSON offers a light and flexible format for data transfer, and can be processed directly from JavaScript without the need for an external parser. The SAS^® JSON procedure lacks the ability to read in JSON. However, the addition of the GROOVY procedure in SAS^® 9.3 allows for execution of Java code from within SAS, allowing for JSON data to be read into a SAS data set through XML conversion. This paper demonstrates the method for parsing JSON into data sets with Groovy and the XML LIBNAME Engine, all within Base SAS^®.

Read the paper (PDF) | Download the data file (ZIP)

John Kennedy, Mesa Digital

S

Session 12700-2016:

SAS and EMC Deliver Analytics Modernization

Analyzing massive amounts of big data quickly to get at answers that increase agility--an organization's ability to sense change and respond--and drive time-sensitive business decisions is a competitive differentiator in today's market. Customers will gain from the deep understanding of storage architecture provided by EMC combined with the deep expertise in analytics provided by SAS. This session provides an overview of how your mixed analytics SAS^® workloads can be transformed on EMC XtremIO, DSSD, and Pivotal solutions. Whether you're working with one of the three primary SAS file systems or SAS^® Grid, performance and capacity scale linearly, significantly eliminate application latency, and remove the complexity of storage tuning from the equation. SAS and EMC have partnered to meet customer challenges and deliver a modern analytic architecture. This unified approach encompasses big data management, analytics discovery, and deployment via end-to-end solutions that solve your big data problems. Learn about best practices from fellow customers and how they deliver new levels of SAS business value.

Read the paper (PDF)

T

Session SAS2560-2016:

Ten Tips to Unlock the Power of Hadoop with SAS^®

This paper discusses a set of practical recommendations for optimizing the performance and scalability of your Hadoop system using SAS^®. Topics include recommendations gleaned from actual deployments from a variety of implementations and distributions. Techniques cover tips for improving performance and working with complex Hadoop technologies such as Kerberos, techniques for improving efficiency when working with data, methods to better leverage the SAS in Hadoop components, and other recommendations. With this information, you can unlock the power of SAS in your Hadoop system.

Read the paper (PDF)

Nancy Rausch, SAS

Wilbram Hazejager, SAS

U

Session SAS5244-2016:

Unleashing High-Performance Risk Data with the Hadoop Custom File Reader

SAS^® High-Performance Risk distributes financial risk data and big data portfolios with complex analyses across a networked Hadoop Distributed File System (HDFS) grid to support rapid in-memory queries for hundreds of simultaneous users. This data is extremely complex and must be stored in a proprietary format to guarantee data affinity for rapid access. However, customers still desire the ability to view and process this data directly. This paper demonstrates how to use the HPRISK custom file reader to directly access risk data in Hadoop MapReduce jobs, using the HPDS2 procedure and the LASR procedure.

Read the paper (PDF) | Download the data file (ZIP)

Mike Whitcher, SAS

Stacey Christian, SAS

Phil Hanna, SAS Institute

Don McAlister, SAS

W

Session SAS2400-2016:

What's New in SAS^® Data Management

The latest releases of SAS^® Data Integration Studio, SAS^® Data Management Studio and SAS^® Data Integration Server, SAS^® Data Governance, and SAS/ACCESS^® software provide a comprehensive and integrated set of capabilities for collecting, transforming, and managing your data. The latest features in the product suite include capabilities for working with data from a wide variety of environments and types including Hadoop, cloud, RDBMS, files, unstructured data, streaming, and others, and the ability to perform ETL and ELT transformations in diverse run-time environments including SAS^®, database systems, Hadoop, Spark, SAS^® Analytics, cloud, and data virtualization environments. There are also new capabilities for lineage, impact analysis, clustering, and other data governance features for enhancements to master data and support metadata management. This paper provides an overview of the latest features of the SAS^® Data Management product suite and includes use cases and examples for leveraging product capabilities.

Read the paper (PDF)

Nancy Rausch, SAS

Session SAS6223-2016:

What's New in SAS^® Federation Server 4.2

SAS^® Federation Server is a scalable, threaded, multi-user data access server providing seamlessly integrated data from various data sources. When your data becomes too large to move or copy and too sensitive to allow direct access, the powerful set of data virtualization capabilities allows you to effectively and efficiently manage and manipulate data from many sources, without moving or copying the data. This agile data integration framework allows business users the ability to connect to more data, reduce risks, respond faster, and make better business decisions. For technical users, the framework provides central data control and security, reduces complexity, optimizes commodity hardware, promotes rapid prototyping, and increases staff productivity. This paper provides an overview of the latest features of the product and includes use cases and examples for leveraging product capabilities.

Read the paper (PDF)

Tatyana Petrova, SAS

Session 9960-2016:

Working with Big Data in Near Real Time Using SAS^® Event Stream Processing

In the world of big data, real-time processing and event stream processing are becoming the norm. However, there are not many tools available today that can do this type of processing. SAS^® Event Stream Processing aims to process this data. In this paper, we look at using SAS Event Stream Processing to read multiple data sets stored in big data platforms such as Hadoop and Cassandra in real time and to perform transformations on the data such as joining data sets, filtering data based on preset business rules, and creating new variables as required. We look at how we can score the data based on a machine learning algorithm. This paper shows you how to use the provided Hadoop Distributed File System (HDFS) publisher and subscriber to read and push data to Hadoop. The HDFS adapter is discussed in detail. We look at the Streamviewer to see how data flows through SAS Event Stream Processing.

View the e-poster or slides (PDF)

Krishna Sai Kishore Konudula, Kavi Associates