In the clinical research world, data accuracy plays a significant role in delivering quality results. Various validation methods are available to confirm data accuracy. Of these, double programming is the most highly recommended and commonly used method to demonstrate a perfect match between production and validation output. PROC COMPARE is one of the SAS® procedures used to compare two data sets and confirm the accuracy In the current practice, whenever a program rerun happens, the programmer must manually review the output file to ensure an exact match. This is tedious, time-consuming, and error prone because there are more LST files to be reviewed and manual intervention is required. The proposed approach programmatically validates the output of PROC COMPARE in all the programs and generates an HTML output file with a Pass/Fail status flag for each output file with the following: 1. Data set name, label, and number of variables and observations 2. Number of observations not in both compared and base data sets 3. Number of variables not in both compared and base data sets The status is flagged as Pass whenever the output file meets the following 3N criteria: 1. NOTE: No unequal values were found. All values compared are exactly equal 2. Number of observations in base and compared data sets are equal 3. Number of variables in base and compared data sets are equal The SAS® macro %threeN efficiently validates all the output files generated from PROC COMPARE in a short time and also expedites validation with accuracy. It reduces up to 90% of the time spent on the manual review of the PROC COMPARE output files.
Amarnath Vijayarangan, Emmes Services Pvt Ltd, India
Common tasks that we need to perform are merging or appending SAS® data sets. During this process, we sometimes get error or warning messages saying that the same fields in different SAS data sets have different lengths or different types. If the problems involve a lot of fields and data sets, we need to spend a lot of time to identify those fields and write extra SAS codes to solve the issues. However, if you use the macro in this paper, it can help you identify the fields that have inconsistent data type or length issues. It also solves the length issues automatically by finding the maximum field length among the current data sets and assigning that length to the field. An html report is generated after running the macro that includes the information about which fields' lengths have been changed and which fields have inconsistent data type issues.
Ting Sa, Cincinnati Children's Hospital Medical Center
A useful and often overlooked tool released with the SAS® Business Intelligence suite to aid in ETL is SAS® Data Integration Studio. This product gives users the ability to extract, transform, join, and load data from various database management systems (DBMSs), data marts, and other data stores by using a graphical interface and without having to code different credentials for each schema. It enables seamless promotion of code to a production system without the need to alter the code. And it is quite useful for deploying and scheduling jobs by using the schedule manager in SAS® Management Console, because all code created by Data Integration Studio is optimized. Although this tool enables users to create code from scratch, one of its most useful capabilities is that it can take legacy SAS® code and, with minimal alterations, have its data associations created and have all the properties of a job coded from scratch.
Erik Larsen, Independent Consultant
With increasing data needs, it becomes more and more unwieldy to ensure that all scheduled jobs are running successfully and on time. Worse, maintaining your reputation as an information provider becomes a precarious prospect as the likelihood increases that your customers alert you of reporting issues before you are even aware yourself. By combining various SAS® capabilities and tying them together with concise visualizations, it is possible to track jobs actively and alert customers of issues before they become a problem. This paper introduces a report tracking framework that helps achieve this goal and improve customer satisfaction. The report tracking starts by obtaining table and job statuses and then color-codes them by severity levels. Based on the job status, it then goes deeper into the log to search for potential errors or other relevant information to help assess the processes. The last step is to send proactive alerts to users informing them of possible delays or potential data issues.
Jia Heng, Wyndham Destination Network
The Waze application, purchased by Google in 2013, alerts millions of users about traffic congestion, collisions, construction, and other complexities of the road that can stymie motorists' attempts to get from A to B. From jackknifed rigs to jackalope carcasses, roads can be gnarled by gridlock or littered with obstacles that impede traffic flow and efficiency. Waze algorithms automatically reroute users to more efficient routes based on user-reported events as well as on historical norms that demonstrate typical road conditions. Extract, transform, load (ETL) infrastructures often represent serialized process flows that can mimic highways and that can become similarly snarled by locked data sets, slow processes, and other factors that introduce inefficiency. The LOCKITDOWN SAS® macro, introduced at the Western Users of SAS® Software Conference 2014, detects and prevents data access collisions that occur when two or more SAS processes or users simultaneously attempt to access the same SAS data set. Moreover, the LOCKANDTRACK macro, introduced at the conference in 2015, provides real-time tracking of and historical performance metrics for locked data sets through a unified control table, enabling developers to hone processes in order to optimize efficiency and data throughput. This paper demonstrates the implementation of LOCKANDTRACK and its lock performance metrics to create data-driven, fuzzy logic algorithms that preemptively reroute program flow around inaccessible data sets. Thus, rather than needlessly waiting for a data set to become available or for a process to complete, the software actually anticipates the wait time based on historical norms, performs other (independent) functions, and returns to the original process when it becomes available.
Troy Hughes, Datmesis Analytics
Impala is an open source SQL engine designed to bring real-time, concurrent, ad hoc query capability to Hadoop. SAS/ACCESS® Interface to Impala allows SAS® to take advantage of this exciting technology. This presentation uses examples to show you how to increase your program's performance and troubleshoot problems. We discuss how Impala fits into the Hadoop ecosystem and how it differs from Hive. Learn the differences between the Hadoop and Impala SAS/ACCESS engines.
Jeff Bailey, SAS
Organizations are collecting more structured and unstructured data than ever before, and it is presenting great opportunities and challenges to analyze all of that complex data. In this volatile and competitive economy, there has never been a bigger need for proactive and agile strategies to overcome these challenges by applying the analytics directly to the data rather than shuffling data around. Teradata and SAS® have joined forces to revolutionize your business by providing enterprise analytics in a harmonious data management platform to deliver critical strategic insights by applying advanced analytics inside the database or data warehouse where the vast volume of data is fed and resides. This paper highlights how Teradata and SAS strategically integrate in-memory, in-database, and the Teradata relational database management system (RDBMS) in a box, giving IT departments a simplified footprint and reporting infrastructure, as well as lower total cost of ownership (TCO), while offering SAS users increased analytic performance by eliminating the shuffling of data over external network connections.
Greg Otto, Teradata Corporation
Tho Nguyen, Teradata
Paul Segal, Teradata
Seven principles frequently manifest themselves in data management projects. These principles improve two aspects: the development process, which enables the programmer to deliver better code faster with fewer issues; and the actual performance of programs and jobs. Using examples from Base SAS®, SAS® Macro Language, DataFlux®, and SQL, this paper will present seven principles, including environments; job control data; test data; improvement; data locality; minimal passes; and indexing. Readers with an intermediate knowledge of SAS DataFlux Management Platform and/or Base SAS and SAS Macro Language as well as SQL will understand these principles and find ways to use them in their data management projects.
Bob Janka, Modern Analytics
Developers working on a production process need to think carefully about ways to avoid future changes that require change control, so it's always important to make the code dynamic rather than hardcoding items into the code. Even if you are a seasoned programmer, the hardcoded items might not always be apparent. This paper assists in identifying the harder-to-reach hardcoded items and addresses ways to effectively use control tables within the SAS® software tools to deal with sticky areas of coding such as formats, parameters, grouping/hierarchies, and standardization. The paper presents examples of several ways to use the control tables and demonstrates why this usage prevents the need for coding changes. Practical applications are used to illustrate these examples.
Frank Ferriola, Financial Risk Group
During the course of a clinical trial study, large numbers of new and modified data records are received on an ongoing basis. Providing end users with an approach to continuously review and monitor study data, while enabling them to focus reviews on new or modified (incremental) data records, allows for greater efficiency in identifying potential data issues. In addition, supplying data reviewers with a familiar machine-readable output format (for example, Microsoft Excel) allows for greater flexibility in filtering, highlighting, and retention of data reviewers' comments. In this paper, we outline an approach using SAS® in a Windows server environment and a shared folder structure to automatically refresh data review listings. Upon each execution, the listings are compared against previously reviewed data to flag new and modified records, as well as carry forward any data reviewers' comments made during the previous review. In addition, we highlight the use and capabilities of the SAS® ExcelXP tagset, which enables greater control over data formatting, including management of Microsoft Excel's sometimes undesired automatic formatting. Overall, this approach provides a significantly improved end-user experience above and beyond the more traditional approach of performing cumulative or incremental data reviews using PDF listings.
Victor Lopez, Samumed, LLC
Heli Ghandehari, Samumed, LLC
Bill Knowlton, Samumed, LLC
Christopher Swearingen, Samumed, LLC
As Data Management professionals, you have to comply with new regulations and controls. One such regulation is Basel Committee on Banking Supervision (BCBS) 239. To respond to these new demands, you have to put processes and methods in place to automate metadata collection and analysis, and to provide rigorous documentation around your data flows. You also have to deal with many aspects of data management including data access, data manipulation (ETL and other), data quality, data usage, and data consumption, often from a variety of toolsets that are not necessarily from a single vendor. This paper shows you how to use SAS® technologies to support data governance requirements, including third party metadata collection and data monitoring. It highlights best practices such as implementing a business glossary and establishing controls for monitoring data. Attend this session to become familiar with the SAS tools used to meet the new requirements and to implement a more managed environment.
Jeff Stander, SAS
With the big data throughputs generated by event streams, organizations can opportunistically respond with low-latency effectiveness. Having the ability to permeate identified patterns of interest throughout the enterprise requires deep integration between event stream processing and foundational enterprise data management applications. This paper describes the innovative ability to consolidate real-time data ingestion with controlled and disciplined universal data access from SAS® and Teradata.
Tho Nguyen, Teradata
Fiona McNeill, SAS
In the fast-changing world of data management, new technologies to handle increasing volumes of (big) data are emerging every day. Many companies are struggling in dealing with these technologies and more importantly in integrating them into their existing data management processes. Moreover, they also want to rely on the knowledge built by their teams in existing products. Implementing change to learn new technologies can therefore be a costly procedure. SAS® is a perfect fit in this situation by offering a suite of software that can be set up to work with any third-party database through using the corresponding SAS/ACCESS® interface. Indeed, for new database technologies, SAS is releasing a specific SAS/ACCESS interface, allowing users to develop and migrate SAS solutions almost transparently. Your users need to know only a few techniques in order to combine the power of SAS with a third-party (big) database. This paper will help companies with the integration of rising technologies in their current SAS® Data Management platform using SAS/ACCESS. More specifically, the paper focuses on best practices for success in such an implementation. Integrating SAS Data Management with the IBM PureData for Analytics Appliance is provided as an example.
Magali Thesias, Deloitte
SAS® programmers spend hours developing DATA step code to clean up their data. Save time and money and enhance your data quality results by leveraging the power of the SAS Quality Knowledge Base (QKB). Knowledge Engineers at SAS have developed advanced context-specific data quality operations and packaged them in callable objects in the QKB. In this session, a Senior Software Development Manager at SAS explains the QKB and shows how to invoke QKB operations in a DATA step and in various SAS® Data Management offerings.
Brian Rineer, SAS
SAS® programs can be very I/O intensive. SAS data sets with inappropriate variable attributes can degrade the performance of SAS programs. Using SAS compression offers some relief but does not eliminate the issues caused by inappropriately defined SAS variables. This paper examines the problems that inappropriate SAS variable attributes can cause and introduces a macro to tackle the problem of minimizing the footprint of a SAS data set.
Brian Varney, Experis BI & Analytics Practice
Annotating a blank Case Report Form (blankcrf.pdf, which is required for a nondisclosure agreement with the US Food and Drug Administration) by hand is an arduous process that can take many hours of precious time. Once again SAS® comes to the rescue! You can have SAS use the Study Data Tabulation Model (SDTM) specifications data to create all the necessary annotations and place them on the appropriate pages of the blankcrf.pdf. In addition you can dynamically color and adjust the font size of these annotations. This approach uses SAS, Adobe Acrobat's forms definition format (FDF) language, and Adobe Reader to complete the process. In this paper I go through each of the steps needed and explain in detail exactly how to accomplish each task.
Steven Black, Agility Clinical
SAS® Data Management is not a dating application. However, as a data analyst, you do strive to find the best matches for your data. Similar to a dating application, when trying to find matches for your data, you need to specify the criteria that constitutes a suitable match. You want to strike a balance between being too stringent with your criteria and under-matching your data and being too loose with your criteria and over-matching your data. This paper highlights various SAS Data Management matching techniques that you can use to strike the right balance and help your data find its perfect match, and as a result improve your data for reporting and analytics purposes.
Mary Kathryn Queen, SAS
The SAS® Scalable Performance Data Server and SAS® Scalable Performance Data Engine are data formats from SAS® that support the creation of analytical base tables with tens of thousands of columns. These analytical base tables are used to support daily predictive analytical routines. Traditionally Storage Area Network (SAN) storage has been and continues to be the primary storage platform for the SAS Scalable Performance Data Server and SAS Scalable Performance Data Engine formats. Due to cost constraints associated with SAN storage, companies have added Hadoop to their environments to help minimize storage costs. In this paper we explore how the SAS Scalable Performance Data Server and SAS Scalable Performance Data Engine leverage the Hadoop Distributed File System.
Steven Sober, SAS
This paper provides tips and techniques to speed up the validation process without and with automation. For validation without automation, it introduces both standard use and clever use of options and statements to be implemented in the COMPARE procedure that can speed up the validation process. For validation with automation, a macro named %QCDATA is introduced for individual data set validation, and a macro named %QCDIR is introduced for comparison of data sets in two different directories. Also introduced in this section is a definition of &SYSINFO and an explanation of how it can be of use to interpret the result of the comparison.
Alice Cheng, Portola Pharmaceuticals
Justina Flavin, Independent Consultant
Michael Wise, Experis BI & Analytics Practice
In the pharmaceutical industry, generating summary tables or listings in PDF format is becoming increasingly popular. You can create various PDF output files with different styles by combining the ODS PDF statement and the REPORT procedure in SAS® software. In order to validate those outputs, you need to import the PDF summary tables or listings into SAS data sets. This paper presents an overview of a utility that reads in an uncompressed PDF file that is generated by SAS, extracts the data, and converts it to SAS data sets.
William Wu, Herodata LLC
Yun Guan, Theravance Biopharm
Working with big data is often time consuming and challenging. The primary goal in programming is to maximize throughputs while minimizing the use of computer processing time, real time, and programmers' time. By using the Multiprocessing (MP) CONNECT method on a symmetric multiprocessing (SMP) computer, a programmer can divide a job into independent tasks and execute the tasks as threads in parallel on several processors. This paper demonstrates the development and application of a parallel processing program on a large amount of health-care data.
Shuhua Liang, Kaiser Permanente
Every day, companies all over the world are moving their data into the cloud. While there are many options available, much of this data will wind up in Amazon Redshift. As a SAS® user, you are probably wondering, 'What is the best way to access this data using SAS?' This paper discusses the many ways that you can use SAS/ACCESS® software to get to Amazon Redshift. We compare and contrast the various approaches and help you decide which is best for you. Topics include building a connection, moving data into Amazon Redshift, and applying performance best practices.
Chris DeHart, SAS
Jeff Bailey, SAS
In this paper, a SAS® macro is introduced that can help users find and access their folders and files very easily. By providing a path to the macro and letting the macro know which folders and files you are looking for under this path, the macro creates an HTML report that lists the matched folders and files. The best part of this HTML report is that it also creates a hyperlink for each folder and file so that when a user clicks the hyperlink, it directly opens the folder or file. Users can also ask the macro to find certain folders or files by providing part of the folder or file name as the search criterion. The results shown in the report can be sorted in different ways so that it can further help users quickly find and access their folders and files.
Ting Sa, Cincinnati Children's Hospital Medical Center
When I help users design or debug their SAS® programs, they are sometimes unable to provide relevant SAS data sets because they contain confidential information. Sometimes, confidential data values are intrinsic to their problem, but often the problem could still be identified or resolved with innocuous data values that preserve some of the structure of the confidential data. Or the confidential values are in variables that are unrelated to the problem. While techniques for masking or disguising data exist, they are often complex or proprietary. In this paper, I describe a very simple macro, REVALUE, that can change the values in a SAS data set. REVALUE preserves some of the structure of the original data by ensuring that for a given variable, observations with the same real value have the same replacement value, and if possible, observations with a different real value have a different replacement value. REVALUE enables the user to specify which variables to change and whether to order the replacement values for each variable by the sort order of the real values or by observation order. I discuss the REVALUE macro in detail and provide a copy of the macro.
Bruce Gilsen, Federal Reserve Board
Microsoft Office products play an important role in most enterprises. SAS® is combined with Microsoft Office to assist in decision making in everyday life. Most SAS users have moved from SAS® 9.3, or 32-bit SAS, to SAS® 9.4 for the exciting new features. This paper describes a few things that do not work quite the same in SAS 9.4 as they did in the 32-bit version. It discusses the reasons for the differences and the new approaches that SAS 9.4 provides. The paper focuses on how SAS 9.4 works with Excel and Access. Furthermore, this presentation summarizes methods by LIBNAME engines and by import or export procedures. It also demonstrates how to interactively import or export PC files with SAS 9.4. The issue of incompatible format catalogs is addressed as well.
Hong Zhang, Mathematica Policy Research
Specialized access requirements to tap into event streams vary depending on the source of the events. Open-source approaches from Spark, Kafka, Storm, and others can connect event streams to big data lakes, like Hadoop and other common data management repositories. But a different approach is needed to ensure latency and throughput are not adversely affected when processing streaming data with different open-source frameworks, that is, they need to scale. This talk distinguishes the advantages of adapters and connectors and shows how SAS® can leverage both Hadoop and YARN technologies to scale while still meeting the needs of streaming data analysis and large, distributed data repositories.
Evan Guarnaccia, SAS
Fiona McNeill, SAS
Steve Sparano, SAS
Customers are under increasing pressure to determine how data, reports, and other assets are interconnected, and to determine the implications of making changes to these objects. Of special interest is performing lineage and impact analysis on data from disparate systems, for example, SAS®, IBM InfoSphere DataStage, Informatica PowerCenter, SAP BusinessObjects, and Cognos. SAS provides utilities to identify relationships between assets that exist within the SAS system, and beginning with the third maintenance release of SAS® 9.4, these utilities support third-party products. This paper provides step-by-step instructions, tips, and techniques for using these utilities to collect metadata from these disparate systems, and to perform lineage and impact analysis.
Liz McIntosh, SAS
Today many analysts in information technology are facing challenges working with large amounts of data. Most analysts are smart enough to write queries using correct table join conditions, text mining, index keys, and hash objects for quick data retrieval. The SAS® system is now able to work with Hadoop data storage to provide efficient processing power to SAS analytics. In this paper we demonstrate differences in data retrieval between the SQL procedure, the DATA step, and the Hadoop environment. In order to test data retrieval time, we used the following code to randomly generate ten million observations with character and numeric variables using the RANUNI function. DO LOOP=1 to 10e7 will generate ten million records; however this code can generate any number of records by changing the exponential log. We use the most commonly used functions and procedures to retrieve records on a test data set and we illustrate real- and CPU-time processing. All PROC SQL and Hadoop queries are on SAS® Enterprise Guide® 6.1 to record processing time. This paper includes an overview of Hadoop data architecture and describes how to connect to a Hadoop environment through SAS. It provides sample queries for data retrieval, explains differences using Hadoop pass-through versus Hadoop LIBNAME, and covers the big challenges for real-time and CPU-time comparison using the most commonly used SAS functions and procedures.
Anjan Matlapudi,, AmeriHealth Caritas Family of Companies
SAS® provides in-database processing technology in the SQL procedure, which allows the SQL explicit pass-through method to push some or all of the work to a database management system (DBMS). This paper focuses on using the SAS SQL explicit pass-through method to transform Teradata table columns into rows. There are two common approaches for transforming table columns into rows. The first approach is to create narrow tables, one for each column that requires transposition, and then use UNION or UNION ALL to append all the tables together. This approach is straightforward but can be quite cumbersome, especially when there is a large number of columns that need to be transposed. The second approach is using the Teradata TD_UNPIVOT function, which makes the wide-to-long table transposition an easy job. However, TD_UNPIVOT allows you to transpose only columns with the same data type from wide to long. This paper presents a SAS macro solution to the wide-to-long table transposition involving different column data types. Several examples are provided to illustrate the usage of the macro solution. This paper complements the author's SAS paper Performing Efficient Transposes on Large Teradata Tables Using SQL Explicit Pass-Through in which the solution of performing the long-to-wide table transposition method is discussed. SAS programmers who are working with data stored in an external DBMS and would like to efficiently transpose their data will benefit from this paper.
Tao Cheng, Accenture
In a data warehousing system, change data capture (CDC) plays an important part not just in making the data warehouse (DWH) aware of the change but also in providing a means of flowing the change to the DWH marts and reporting tables so that we see the current and latest version of the truth. This and slowly changing dimensions (SCD) create a cycle that runs the DWH and provides valuable insights in the history and for the decision-making future. What if the source has no CDC? It would be an ETL nightmare to identify the exact change and report the absolute truth. If these two processes can be combined into a single process where just one single transform does both jobs of identifying the change and applying the change to the DWH, then we can save significant processing times and value resources of the system. Hence, I came up with a hybrid SCD with CDC approach for this. My paper focuses on sources that DO NOT have CDC in their sources and need to perform SCD Type 2 on such records without worrying about data duplications and increased processing times.
Vishant Bhat, University of Newcastle
Tony Blanch, SAS Consultant
JavaScript Object Notation (JSON) has quickly become the de-facto standard for data transfer on the Internet, due to an increase in both web data and the usage of full-stack JavaScript. JSON has become dominant in the emerging technologies of the web today, such as the Internet of Things and the mobile cloud. JSON offers a light and flexible format for data transfer, and can be processed directly from JavaScript without the need for an external parser. The SAS® JSON procedure lacks the ability to read in JSON. However, the addition of the GROOVY procedure in SAS® 9.3 allows for execution of Java code from within SAS, allowing for JSON data to be read into a SAS data set through XML conversion. This paper demonstrates the method for parsing JSON into data sets with Groovy and the XML LIBNAME Engine, all within Base SAS®.
John Kennedy, Mesa Digital
Analyzing massive amounts of big data quickly to get at answers that increase agility--an organization's ability to sense change and respond--and drive time-sensitive business decisions is a competitive differentiator in today's market. Customers will gain from the deep understanding of storage architecture provided by EMC combined with the deep expertise in analytics provided by SAS. This session provides an overview of how your mixed analytics SAS® workloads can be transformed on EMC XtremIO, DSSD, and Pivotal solutions. Whether you're working with one of the three primary SAS file systems or SAS® Grid, performance and capacity scale linearly, significantly eliminate application latency, and remove the complexity of storage tuning from the equation. SAS and EMC have partnered to meet customer challenges and deliver a modern analytic architecture. This unified approach encompasses big data management, analytics discovery, and deployment via end-to-end solutions that solve your big data problems. Learn about best practices from fellow customers and how they deliver new levels of SAS business value.
Big data is often distinguished as encompassing high volume, high velocity, or high variability of data. While big data can signal big business intelligence and big business value, it can also wreak havoc on systems and software ill-prepared for its profundity. Scalability describes the ability of a system or software to adequately meet the needs of additional users or its ability to use additional processors or resources to fulfill those added requirements. Scalability also describes the adequate and efficient response of a system to increased data throughput. Because sorting data is one of the most common and most resource-intensive operations in any software language, inefficiencies or failures caused by big data often are first observed during sorting routines. Much SAS® literature has been dedicated to optimizing big data sorts for efficiency, including minimizing execution time and, to a lesser extent, minimizing resource usage (that is, memory and storage consumption). However, less attention has been paid to implementing big data sorting that is reliable and robust even when confronted with resource limitations. To that end, this text introduces the SAFESORT macro, which facilitates a priori exception-handling routines (which detect environmental and data set attributes that could cause process failure) and post hoc exception-handling routines (which detect actual failed sorting routines). If exception handling is triggered, SAFESORT automatically reroutes program flow from the default sort routine to a less resource-intensive routine, thus sacrificing execution speed for reliability. Moreover, macro modularity enables developers to select their favorite sort procedure and, for data-driven disciples, to build fuzzy logic routines that dynamically select a sort algorithm based on environmental and data set attributes.
Troy Hughes, Datmesis Analytics
Sixty percent of organizations will have Hadoop in production by 2016, per a recent TDWI survey, and it has become increasingly challenging to access, cleanse, and move these huge volumes of data. How can data scientists and business analysts clean up their data, manage it where it lives, and overcome the big data skills gap? It all comes down to accelerating the data preparation process. SAS® Data Loader leverages years of expertise in data quality and data integration, a simplified user interface, and the parallel processing power of Hadoop to overcome the skills gap and compress the time taken to prepare data for analytics or visualization. We cover some of the new capabilities of SAS Data Loader for Hadoop including in-memory Spark processing of data quality and master data management functions, faster profiling and unstructured data field extraction, and chaining multiple transforms together for improved productivity.
Matthew Magne, SAS
Scott Gidley, SAS
This paper discusses a set of practical recommendations for optimizing the performance and scalability of your Hadoop system using SAS®. Topics include recommendations gleaned from actual deployments from a variety of implementations and distributions. Techniques cover tips for improving performance and working with complex Hadoop technologies such as Kerberos, techniques for improving efficiency when working with data, methods to better leverage the SAS in Hadoop components, and other recommendations. With this information, you can unlock the power of SAS in your Hadoop system.
Nancy Rausch, SAS
Wilbram Hazejager, SAS
SAS® High-Performance Risk distributes financial risk data and big data portfolios with complex analyses across a networked Hadoop Distributed File System (HDFS) grid to support rapid in-memory queries for hundreds of simultaneous users. This data is extremely complex and must be stored in a proprietary format to guarantee data affinity for rapid access. However, customers still desire the ability to view and process this data directly. This paper demonstrates how to use the HPRISK custom file reader to directly access risk data in Hadoop MapReduce jobs, using the HPDS2 procedure and the LASR procedure.
Mike Whitcher, SAS
Stacey Christian, SAS
Phil Hanna, SAS Institute
Don McAlister, SAS
The Add Health Parent Study is using a new and innovative method to augment our other interview verification strategies. Typical verification strategies include calling respondents to ask questions about their interview, recording pieces of interaction (CARI - Computer Aided Recorded Interview), and analyzing timing data to see that each interview was within a reasonable length. Geocoding adds another tool to the toolbox for verifications. By applying street-level geocoding to the address where an interview is reported to be conducted and comparing that to a captured latitude/longitude reading from a GPS tracking device, we are able to compute the distance between two points. If that distance is very small and time stamps are close to each other, then the evidence points to the field interviewer being present at the respondent's address during the interview. For our project, the street-level geocoding to an address is done using SAS® PROC GEOCODE. Our paper describes how to obtain a US address database from the SAS website and how it can be used in PROC GEOCODE. We also briefly compare this technique to using the Google Map API and Python as an alternative.
Chris Carson, RTI International
Lilia Filippenko, RTI International
Mai Nguyen, RTI International
By default, the SAS® hash object permits only entries whose keys, defined in its key portion, are unique. While in certain programming applications this is a rather utile feature, there are also others for which being able to insert and manipulate entries with duplicate keys is imperative. Such an ability, facilitated in SAS since SAS® 9.2, was a welcome development: It vastly expanded the functionality of the hash object and eliminated the necessity to work around the distinct-key limitation using custom code. However, nothing comes without a price, and the ability of the hash object to store duplicate key entries is no exception. In particular, additional hash object methods had to be--and were--developed to handle specific entries sharing the same key. The extra price is that using these methods is surely not quite as straightforward as the simple corresponding operations on distinct-key tables, and the documentation alone is a rather poor help for making them work in practice. Rather extensive experimentation and investigative coding is necessary to make that happen. This paper is a result of such endeavor, and hopefully, it will save those who delve into it a good deal of time and frustration.
Paul Dorfman, Dorfman Consulting
The latest releases of SAS® Data Integration Studio, SAS® Data Management Studio and SAS® Data Integration Server, SAS® Data Governance, and SAS/ACCESS® software provide a comprehensive and integrated set of capabilities for collecting, transforming, and managing your data. The latest features in the product suite include capabilities for working with data from a wide variety of environments and types including Hadoop, cloud, RDBMS, files, unstructured data, streaming, and others, and the ability to perform ETL and ELT transformations in diverse run-time environments including SAS®, database systems, Hadoop, Spark, SAS® Analytics, cloud, and data virtualization environments. There are also new capabilities for lineage, impact analysis, clustering, and other data governance features for enhancements to master data and support metadata management. This paper provides an overview of the latest features of the SAS® Data Management product suite and includes use cases and examples for leveraging product capabilities.
Nancy Rausch, SAS
SAS® Federation Server is a scalable, threaded, multi-user data access server providing seamlessly integrated data from various data sources. When your data becomes too large to move or copy and too sensitive to allow direct access, the powerful set of data virtualization capabilities allows you to effectively and efficiently manage and manipulate data from many sources, without moving or copying the data. This agile data integration framework allows business users the ability to connect to more data, reduce risks, respond faster, and make better business decisions. For technical users, the framework provides central data control and security, reduces complexity, optimizes commodity hardware, promotes rapid prototyping, and increases staff productivity. This paper provides an overview of the latest features of the product and includes use cases and examples for leveraging product capabilities.
Tatyana Petrova, SAS
In the world of big data, real-time processing and event stream processing are becoming the norm. However, there are not many tools available today that can do this type of processing. SAS® Event Stream Processing aims to process this data. In this paper, we look at using SAS Event Stream Processing to read multiple data sets stored in big data platforms such as Hadoop and Cassandra in real time and to perform transformations on the data such as joining data sets, filtering data based on preset business rules, and creating new variables as required. We look at how we can score the data based on a machine learning algorithm. This paper shows you how to use the provided Hadoop Distributed File System (HDFS) publisher and subscriber to read and push data to Hadoop. The HDFS adapter is discussed in detail. We look at the Streamviewer to see how data flows through SAS Event Stream Processing.
Krishna Sai Kishore Konudula, Kavi Associates