The presentation illustrates techniques and technology to manage complex and large-scale ETL workloads using SAS® and IBM Platform LSF to provide greater control of flow triggering and more complex triggering logic based on a rules-engine approach that parses pending workloads and current activity. Key techniques, configuration steps, and design and development patterns are demonstrated. Pros and cons of the techniques that are used are discussed. Benefits include increased throughput, reduced support effort, and the ability to support more sophisticated inter-flow dependencies and conflicts.
Angus Looney, Capgemini
Seven principles frequently manifest themselves in data management projects. These principles improve two aspects: the development process, which enables the programmer to deliver better code faster with fewer issues; and the actual performance of programs and jobs. Using examples from Base SAS®, SAS® Macro Language, DataFlux®, and SQL, this paper will present seven principles, including environments; job control data; test data; improvement; data locality; minimal passes; and indexing. Readers with an intermediate knowledge of SAS DataFlux Management Platform and/or Base SAS and SAS Macro Language as well as SQL will understand these principles and find ways to use them in their data management projects.
Bob Janka, Modern Analytics
Developers working on a production process need to think carefully about ways to avoid future changes that require change control, so it's always important to make the code dynamic rather than hardcoding items into the code. Even if you are a seasoned programmer, the hardcoded items might not always be apparent. This paper assists in identifying the harder-to-reach hardcoded items and addresses ways to effectively use control tables within the SAS® software tools to deal with sticky areas of coding such as formats, parameters, grouping/hierarchies, and standardization. The paper presents examples of several ways to use the control tables and demonstrates why this usage prevents the need for coding changes. Practical applications are used to illustrate these examples.
Frank Ferriola, Financial Risk Group
As Data Management professionals, you have to comply with new regulations and controls. One such regulation is Basel Committee on Banking Supervision (BCBS) 239. To respond to these new demands, you have to put processes and methods in place to automate metadata collection and analysis, and to provide rigorous documentation around your data flows. You also have to deal with many aspects of data management including data access, data manipulation (ETL and other), data quality, data usage, and data consumption, often from a variety of toolsets that are not necessarily from a single vendor. This paper shows you how to use SAS® technologies to support data governance requirements, including third party metadata collection and data monitoring. It highlights best practices such as implementing a business glossary and establishing controls for monitoring data. Attend this session to become familiar with the SAS tools used to meet the new requirements and to implement a more managed environment.
Jeff Stander, SAS
With the big data throughputs generated by event streams, organizations can opportunistically respond with low-latency effectiveness. Having the ability to permeate identified patterns of interest throughout the enterprise requires deep integration between event stream processing and foundational enterprise data management applications. This paper describes the innovative ability to consolidate real-time data ingestion with controlled and disciplined universal data access from SAS® and Teradata.
Tho Nguyen, Teradata
Fiona McNeill, SAS
With millions of users and peak traffic of thousands of requests a second for complex user-specific data, fantasy football offers many data design challenges. Not only is there a high volume of data transfers, but the data is also dynamic and of diverse types. We need to process data originating on the stadium playing field and user devices and make it available to a variety of different services. The system must be nimble and must produce accurate and timely responses. This talk discusses the strategies employed by and lessons learned from one of the primary architects of the National Football League's fantasy football system. We explore general data design considerations with specific examples of high availability, data integrity, system performance, and some other random buzzwords. We review some of the common pitfalls facing large-scale databases and the systems using them. And we cover some of the tips and best practices to take your data-driven applications from fantasy to reality.
Clint Carpenter, Carpenter Programming
SAS® programmers spend hours developing DATA step code to clean up their data. Save time and money and enhance your data quality results by leveraging the power of the SAS Quality Knowledge Base (QKB). Knowledge Engineers at SAS have developed advanced context-specific data quality operations and packaged them in callable objects in the QKB. In this session, a Senior Software Development Manager at SAS explains the QKB and shows how to invoke QKB operations in a DATA step and in various SAS® Data Management offerings.
Brian Rineer, SAS
SAS® Data Management is not a dating application. However, as a data analyst, you do strive to find the best matches for your data. Similar to a dating application, when trying to find matches for your data, you need to specify the criteria that constitutes a suitable match. You want to strike a balance between being too stringent with your criteria and under-matching your data and being too loose with your criteria and over-matching your data. This paper highlights various SAS Data Management matching techniques that you can use to strike the right balance and help your data find its perfect match, and as a result improve your data for reporting and analytics purposes.
Mary Kathryn Queen, SAS
Working with big data is often time consuming and challenging. The primary goal in programming is to maximize throughputs while minimizing the use of computer processing time, real time, and programmers' time. By using the Multiprocessing (MP) CONNECT method on a symmetric multiprocessing (SMP) computer, a programmer can divide a job into independent tasks and execute the tasks as threads in parallel on several processors. This paper demonstrates the development and application of a parallel processing program on a large amount of health-care data.
Shuhua Liang, Kaiser Permanente
We spend so much time talking about GRID environments, distributed jobs, and huge data volumes that we ignore the thousands of relatively tiny programs scheduled to run every night, which produce only a few small data sets, but make all the difference to the users who depend on them. Individually, these jobs might place a negligible load on the system, but by their sheer number they can often account for a substantial share of overall resources, sometimes even impacting the performance of the bigger, more important jobs. SAS® programs, by their varied nature, use available resources in a varied way. Depending on whether a SAS® procedure is CPU-, disk- or memory-intensive, chunks of a memory or CPU can sometimes remain unused for quite a while. Bigger jobs leave bigger chunks, and this is where being small and able to effectively exploit those leftovers can be a great advantage. We call our main optimization technique the Fast Lane, which is a queue configuration dedicated to jobs with consistently small SASWORK directories, that, when available, lets them use unused RAM in place of their SASWORK disk. The approach improves overall CPU saturation considerably while taking loads off the I/O subsystem, and without failure results in improved runtimes for big jobs and small jobs alike, without requiring any changes to deployed code. This paper explores the practical aspects of implementing the Fast Lane on your environment. The time-series profiling of overnight workload finds available resource gaps, identifies candidate jobs to fill those gaps, schedules and queues triggers, changes application server configuration, and monitors and controls techniques that keep everything running smoothly. And, of course, are the very real, tangible gains in performance that Fast Lane has delivered in real-world scenarios.
Nikola Markovic, Boemska Ltd.
Greg Nelson, ThotWave
Specialized access requirements to tap into event streams vary depending on the source of the events. Open-source approaches from Spark, Kafka, Storm, and others can connect event streams to big data lakes, like Hadoop and other common data management repositories. But a different approach is needed to ensure latency and throughput are not adversely affected when processing streaming data with different open-source frameworks, that is, they need to scale. This talk distinguishes the advantages of adapters and connectors and shows how SAS® can leverage both Hadoop and YARN technologies to scale while still meeting the needs of streaming data analysis and large, distributed data repositories.
Evan Guarnaccia, SAS
Fiona McNeill, SAS
Steve Sparano, SAS
Customers are under increasing pressure to determine how data, reports, and other assets are interconnected, and to determine the implications of making changes to these objects. Of special interest is performing lineage and impact analysis on data from disparate systems, for example, SAS®, IBM InfoSphere DataStage, Informatica PowerCenter, SAP BusinessObjects, and Cognos. SAS provides utilities to identify relationships between assets that exist within the SAS system, and beginning with the third maintenance release of SAS® 9.4, these utilities support third-party products. This paper provides step-by-step instructions, tips, and techniques for using these utilities to collect metadata from these disparate systems, and to perform lineage and impact analysis.
Liz McIntosh, SAS
In a data warehousing system, change data capture (CDC) plays an important part not just in making the data warehouse (DWH) aware of the change but also in providing a means of flowing the change to the DWH marts and reporting tables so that we see the current and latest version of the truth. This and slowly changing dimensions (SCD) create a cycle that runs the DWH and provides valuable insights in the history and for the decision-making future. What if the source has no CDC? It would be an ETL nightmare to identify the exact change and report the absolute truth. If these two processes can be combined into a single process where just one single transform does both jobs of identifying the change and applying the change to the DWH, then we can save significant processing times and value resources of the system. Hence, I came up with a hybrid SCD with CDC approach for this. My paper focuses on sources that DO NOT have CDC in their sources and need to perform SCD Type 2 on such records without worrying about data duplications and increased processing times.
Vishant Bhat, University of Newcastle
Tony Blanch, SAS Consultant
With advances in technology, the world of sports is now offering rich data sets that are of interest to statisticians. This talk concerns some research problems in various sports that are based on large data sets. In baseball, PITCHf/x data is used to help quantify the quality of pitches. From this, questions about pitcher evaluation and effectiveness are addressed. In cricket, match commentaries are parsed to yield ball-by-ball data in order to assist in building a match simulator. The simulator can then be used to investigate optimal lineups, player evaluation, and the assessment of fielding.
Sixty percent of organizations will have Hadoop in production by 2016, per a recent TDWI survey, and it has become increasingly challenging to access, cleanse, and move these huge volumes of data. How can data scientists and business analysts clean up their data, manage it where it lives, and overcome the big data skills gap? It all comes down to accelerating the data preparation process. SAS® Data Loader leverages years of expertise in data quality and data integration, a simplified user interface, and the parallel processing power of Hadoop to overcome the skills gap and compress the time taken to prepare data for analytics or visualization. We cover some of the new capabilities of SAS Data Loader for Hadoop including in-memory Spark processing of data quality and master data management functions, faster profiling and unstructured data field extraction, and chaining multiple transforms together for improved productivity.
Matthew Magne, SAS
Scott Gidley, SAS
This paper discusses a set of practical recommendations for optimizing the performance and scalability of your Hadoop system using SAS®. Topics include recommendations gleaned from actual deployments from a variety of implementations and distributions. Techniques cover tips for improving performance and working with complex Hadoop technologies such as Kerberos, techniques for improving efficiency when working with data, methods to better leverage the SAS in Hadoop components, and other recommendations. With this information, you can unlock the power of SAS in your Hadoop system.
Nancy Rausch, SAS
Wilbram Hazejager, SAS
Over the years, complex medical data has been captured and retained in a variety of legacy platforms that are characterized by special formats, hierarchical/network relationships, and relational databases. Due to the complex nature of the data structures and the capacity constraints associated with outdated technologies, high-value healthcare data locked in legacy systems has been restricted in its use for analytics. With the emergence of highly scalable, big data technologies such as Hadoop, now it's possible to move, transform, and enable previously locked legacy data economically. In addition, sourcing such data once on an enterprise data hub enables the ability to efficiently store, process, and analyze the data from multiple perspectives. This presentation illustrates how legacy data is not only unlocked but enabled for rapid visual analytics by using Cloudera's enterprise data hub for data transformation and the SAS® ecosystem.
Multivariate statistical analysis plays an increasingly important role as the number of variables being measured increases in educational research. In both cognitive and noncognitive assessments, many instruments that researchers aim to study contain a large number of variables, with each measured variable assigned to a specific factor of the bigger construct. Recalling the educational theories or empirical research, the factor of each instrument usually emerges the same way. Two types of factor analysis are widely used in order to understand the latent relationships among these variables based on different scenarios. (1) Exploratory factor analysis (EFA), which is performed by using the SAS® procedure PROC FACTOR, is an advanced statistical method used to probe deeply into the relationship among the variables and the larger construct and then develop a customized model for the specific assessment. (2) When a model is established, confirmatory factor analysis (CFA) is conducted by using the SAS procedure PROC CALIS to examine the model fit of specific data and then make adjustments for the model as needed. This paper presents the application of SAS to conduct these two types of factor analysis to fulfill various research purposes. Examples using real noncognitive assessment data are demonstrated, and the interpretation of the fit statistics is discussed.
Jun Xu, Educational Testing Service
Steven Holtzman, Educational Testing Service
Kevin Petway, Educational Testing Service
Lili Yao, Educational Testing Service
The latest releases of SAS® Data Integration Studio, SAS® Data Management Studio and SAS® Data Integration Server, SAS® Data Governance, and SAS/ACCESS® software provide a comprehensive and integrated set of capabilities for collecting, transforming, and managing your data. The latest features in the product suite include capabilities for working with data from a wide variety of environments and types including Hadoop, cloud, RDBMS, files, unstructured data, streaming, and others, and the ability to perform ETL and ELT transformations in diverse run-time environments including SAS®, database systems, Hadoop, Spark, SAS® Analytics, cloud, and data virtualization environments. There are also new capabilities for lineage, impact analysis, clustering, and other data governance features for enhancements to master data and support metadata management. This paper provides an overview of the latest features of the SAS® Data Management product suite and includes use cases and examples for leveraging product capabilities.
Nancy Rausch, SAS
SAS® Federation Server is a scalable, threaded, multi-user data access server providing seamlessly integrated data from various data sources. When your data becomes too large to move or copy and too sensitive to allow direct access, the powerful set of data virtualization capabilities allows you to effectively and efficiently manage and manipulate data from many sources, without moving or copying the data. This agile data integration framework allows business users the ability to connect to more data, reduce risks, respond faster, and make better business decisions. For technical users, the framework provides central data control and security, reduces complexity, optimizes commodity hardware, promotes rapid prototyping, and increases staff productivity. This paper provides an overview of the latest features of the product and includes use cases and examples for leveraging product capabilities.
Tatyana Petrova, SAS