SAS Global Forum 2016 Proceedings

In the clinical research world, data accuracy plays a significant role in delivering quality results. Various validation methods are available to confirm data accuracy. Of these, double programming is the most highly recommended and commonly used method to demonstrate a perfect match between production and validation output. PROC COMPARE is one of the SAS^® procedures used to compare two data sets and confirm the accuracy In the current practice, whenever a program rerun happens, the programmer must manually review the output file to ensure an exact match. This is tedious, time-consuming, and error prone because there are more LST files to be reviewed and manual intervention is required. The proposed approach programmatically validates the output of PROC COMPARE in all the programs and generates an HTML output file with a Pass/Fail status flag for each output file with the following: 1. Data set name, label, and number of variables and observations 2. Number of observations not in both compared and base data sets 3. Number of variables not in both compared and base data sets The status is flagged as Pass whenever the output file meets the following 3N criteria: 1. NOTE: No unequal values were found. All values compared are exactly equal 2. Number of observations in base and compared data sets are equal 3. Number of variables in base and compared data sets are equal The SAS^® macro %threeN efficiently validates all the output files generated from PROC COMPARE in a short time and also expedites validation with accuracy. It reduces up to 90% of the time spent on the manual review of the PROC COMPARE output files.

Read the paper (PDF) | View the e-poster or slides (PDF)

Common tasks that we need to perform are merging or appending SAS^® data sets. During this process, we sometimes get error or warning messages saying that the same fields in different SAS data sets have different lengths or different types. If the problems involve a lot of fields and data sets, we need to spend a lot of time to identify those fields and write extra SAS codes to solve the issues. However, if you use the macro in this paper, it can help you identify the fields that have inconsistent data type or length issues. It also solves the length issues automatically by finding the maximum field length among the current data sets and assigning that length to the field. An html report is generated after running the macro that includes the information about which fields' lengths have been changed and which fields have inconsistent data type issues.

Read the paper (PDF) | Watch the recording

With increasing data needs, it becomes more and more unwieldy to ensure that all scheduled jobs are running successfully and on time. Worse, maintaining your reputation as an information provider becomes a precarious prospect as the likelihood increases that your customers alert you of reporting issues before you are even aware yourself. By combining various SAS^® capabilities and tying them together with concise visualizations, it is possible to track jobs actively and alert customers of issues before they become a problem. This paper introduces a report tracking framework that helps achieve this goal and improve customer satisfaction. The report tracking starts by obtaining table and job statuses and then color-codes them by severity levels. Based on the job status, it then goes deeper into the log to search for potential errors or other relevant information to help assess the processes. The last step is to send proactive alerts to users informing them of possible delays or potential data issues.

View the e-poster or slides (PDF)

Seven principles frequently manifest themselves in data management projects. These principles improve two aspects: the development process, which enables the programmer to deliver better code faster with fewer issues; and the actual performance of programs and jobs. Using examples from Base SAS^®, SAS^® Macro Language, DataFlux^®, and SQL, this paper will present seven principles, including environments; job control data; test data; improvement; data locality; minimal passes; and indexing. Readers with an intermediate knowledge of SAS DataFlux Management Platform and/or Base SAS and SAS Macro Language as well as SQL will understand these principles and find ways to use them in their data management projects.

Read the paper (PDF)

During the course of a clinical trial study, large numbers of new and modified data records are received on an ongoing basis. Providing end users with an approach to continuously review and monitor study data, while enabling them to focus reviews on new or modified (incremental) data records, allows for greater efficiency in identifying potential data issues. In addition, supplying data reviewers with a familiar machine-readable output format (for example, Microsoft Excel) allows for greater flexibility in filtering, highlighting, and retention of data reviewers' comments. In this paper, we outline an approach using SAS^® in a Windows server environment and a shared folder structure to automatically refresh data review listings. Upon each execution, the listings are compared against previously reviewed data to flag new and modified records, as well as carry forward any data reviewers' comments made during the previous review. In addition, we highlight the use and capabilities of the SAS^® ExcelXP tagset, which enables greater control over data formatting, including management of Microsoft Excel's sometimes undesired automatic formatting. Overall, this approach provides a significantly improved end-user experience above and beyond the more traditional approach of performing cumulative or incremental data reviews using PDF listings.

Read the paper (PDF)

As Data Management professionals, you have to comply with new regulations and controls. One such regulation is Basel Committee on Banking Supervision (BCBS) 239. To respond to these new demands, you have to put processes and methods in place to automate metadata collection and analysis, and to provide rigorous documentation around your data flows. You also have to deal with many aspects of data management including data access, data manipulation (ETL and other), data quality, data usage, and data consumption, often from a variety of toolsets that are not necessarily from a single vendor. This paper shows you how to use SAS^® technologies to support data governance requirements, including third party metadata collection and data monitoring. It highlights best practices such as implementing a business glossary and establishing controls for monitoring data. Attend this session to become familiar with the SAS tools used to meet the new requirements and to implement a more managed environment.

Read the paper (PDF)

SAS^® programmers spend hours developing DATA step code to clean up their data. Save time and money and enhance your data quality results by leveraging the power of the SAS Quality Knowledge Base (QKB). Knowledge Engineers at SAS have developed advanced context-specific data quality operations and packaged them in callable objects in the QKB. In this session, a Senior Software Development Manager at SAS explains the QKB and shows how to invoke QKB operations in a DATA step and in various SAS^® Data Management offerings.

Read the paper (PDF)

SAS^® programs can be very I/O intensive. SAS data sets with inappropriate variable attributes can degrade the performance of SAS programs. Using SAS compression offers some relief but does not eliminate the issues caused by inappropriately defined SAS variables. This paper examines the problems that inappropriate SAS variable attributes can cause and introduces a macro to tackle the problem of minimizing the footprint of a SAS data set.

Read the paper (PDF)

Annotating a blank Case Report Form (blankcrf.pdf, which is required for a nondisclosure agreement with the US Food and Drug Administration) by hand is an arduous process that can take many hours of precious time. Once again SAS^® comes to the rescue! You can have SAS use the Study Data Tabulation Model (SDTM) specifications data to create all the necessary annotations and place them on the appropriate pages of the blankcrf.pdf. In addition you can dynamically color and adjust the font size of these annotations. This approach uses SAS, Adobe Acrobat's forms definition format (FDF) language, and Adobe Reader to complete the process. In this paper I go through each of the steps needed and explain in detail exactly how to accomplish each task.

Read the paper (PDF) | Download the data file (ZIP) | Watch the recording

SAS^® Data Management is not a dating application. However, as a data analyst, you do strive to find the best matches for your data. Similar to a dating application, when trying to find matches for your data, you need to specify the criteria that constitutes a suitable match. You want to strike a balance between being too stringent with your criteria and under-matching your data and being too loose with your criteria and over-matching your data. This paper highlights various SAS Data Management matching techniques that you can use to strike the right balance and help your data find its perfect match, and as a result improve your data for reporting and analytics purposes.

Read the paper (PDF)

This paper provides tips and techniques to speed up the validation process without and with automation. For validation without automation, it introduces both standard use and clever use of options and statements to be implemented in the COMPARE procedure that can speed up the validation process. For validation with automation, a macro named %QCDATA is introduced for individual data set validation, and a macro named %QCDIR is introduced for comparison of data sets in two different directories. Also introduced in this section is a definition of &SYSINFO and an explanation of how it can be of use to interpret the result of the comparison.

Read the paper (PDF)

Sixty percent of organizations will have Hadoop in production by 2016, per a recent TDWI survey, and it has become increasingly challenging to access, cleanse, and move these huge volumes of data. How can data scientists and business analysts clean up their data, manage it where it lives, and overcome the big data skills gap? It all comes down to accelerating the data preparation process. SAS^® Data Loader leverages years of expertise in data quality and data integration, a simplified user interface, and the parallel processing power of Hadoop to overcome the skills gap and compress the time taken to prepare data for analytics or visualization. We cover some of the new capabilities of SAS Data Loader for Hadoop including in-memory Spark processing of data quality and master data management functions, faster profiling and unstructured data field extraction, and chaining multiple transforms together for improved productivity.

Read the paper (PDF)

The Add Health Parent Study is using a new and innovative method to augment our other interview verification strategies. Typical verification strategies include calling respondents to ask questions about their interview, recording pieces of interaction (CARI - Computer Aided Recorded Interview), and analyzing timing data to see that each interview was within a reasonable length. Geocoding adds another tool to the toolbox for verifications. By applying street-level geocoding to the address where an interview is reported to be conducted and comparing that to a captured latitude/longitude reading from a GPS tracking device, we are able to compute the distance between two points. If that distance is very small and time stamps are close to each other, then the evidence points to the field interviewer being present at the respondent's address during the interview. For our project, the street-level geocoding to an address is done using SAS^® PROC GEOCODE. Our paper describes how to obtain a US address database from the SAS website and how it can be used in PROC GEOCODE. We also briefly compare this technique to using the Google Map API and Python as an alternative.