Healthcare is weird. Healthcare data is even more so. The digitization of healthcare data that describes the patient experience is a modern phenomenon, with most healthcare organizations still in their infancy. While the business of healthcare is already a century old, most organizations have focused their efforts on the financial aspects of healthcare and not on stakeholder experience or clinical outcomes. Think of the workflow that you might have experienced such as scheduling an appointment through doctor visits, obtaining lab tests, or obtaining prescriptions for interventions such as surgery or physical therapy. The modern healthcare system creates a digital footprint of administrative, process, quality, epidemiological, financial, clinical, and outcome measures, which range in size, cleanliness, and usefulness. Whether you are new to healthcare data or are looking to advance your knowledge of healthcare data and the techniques used to analyze it, this paper serves as a practical guide to understanding and using healthcare data. We explore common methods for how we structure and access data, discuss common challenges such as aggregating data into episodes of care, describe reverse engineering real world events, and talk about dealing with the myriad of unstructured data found in nursing notes. Finally, we discuss the ethical uses of healthcare data and the limits of informed consent, which are critically important for those of us in analytics.
Greg Nelson, Thotwave Technologies, LLC.
Interactively redeploying SAS® Data Integration Studio jobs can be a slow and tedious process. The updated batch deployment utility gives the ETL Tech Lead a more efficient and repeatable method for administering batch jobs. This improved tool became available in SAS® Data Integration Studio 4.901.
Jeff Dyson, The Financial Risk Group
UNIX and Linux SAS® administrators, have you ever been greeted by one of these statements as you walk into the office before you have gotten your first cup of coffee? Power outage! SAS servers are down. I cannot access my reports. Have you frantically tried to restart the SAS servers to avoid loss of productivity and missed one of the steps in the process, causing further delays while other work continues to pile up? If you have had this experience, you understand the benefit to be gained from a utility that automates the management of these multi-tiered deployments. Until recently, there was no method for automatically starting and stopping multi-tiered services in an orchestrated fashion. Instead, you had to use time-consuming manual procedures to manage SAS services. These procedures were also prone to human error, which could result in corrupted services and additional time lost, debugging and resolving issues injected by this process. To address this challenge, SAS Technical Support created the SAS Local Services Management (SAS_lsm) utility, which provides automated, orderly management of your SAS® multi-tiered deployments. The intent of this paper is to demonstrate the deployment and usage of the SAS_lsm utility. Now, go grab a coffee, and let's see how SAS_lsm can make life less chaotic.
Clifford Meyers, SAS
As an information security or data professional, you have seen and heard about how advanced analytics has impacted nearly every business domain. You recognize the potential of insights derived from advanced analytics to improve the information security of your organization. You want to realize these benefits, and to understand their pitfalls. To successfully apply advanced analytics to the information security business problem, proper application of data management processes and techniques is of paramount importance. Based on professional services experience in implementing SAS® Cybersecurity, this session teaches you about the data sources used, the activities involved in properly managing this data, and the means to which these processes address information security business problems. You will come to appreciate how using advanced analytics in the information security domain requires more than just the application of tools or modeling techniques. Using a data management regime for information security concerns can benefit your organization by providing insights into IT infrastructure, enabling successful data science activities, and providing greater resilience by way of improved information security investigations.
Alex Anglin, SAS
This paper presents considerations for deploying SAS® Foundation across software-defined storage (SDS) infrastructures, and within virtualized storage environments. There are many new offerings on the market that offer easy, point-and-click creation of storage entities, with simplified management. Internal storage area network (SAN) virtualization also removes much of the hands-on management for defining storage device pools. Automated tier software further attempts to optimize data placement across performance tiers without manual intervention. Virtual storage provisioning and automated tier placement have many time-saving and management benefits. In some cases, they have also caused serious unintended performance issues with heavy large-block workloads, such as those found in SAS Foundation. You must follow best practices to get the benefit of these new technologies while still maintaining performance. For SDS infrastructures, this paper offers specific considerations for the performance of applications in SAS Foundation, workload management and segregation, replication, high availability, and disaster recovery. Architecture and performance ramifications and advice are offered for virtualized and tiered storage systems. General virtual storage pros and cons are also discussed in detail.
Tony Brown, SAS
Margaret Crevar, SAS
The ever growing volume of data challenges us to keep pace in ensuring that we use it to its full advantage. Unfortunately, often our response to new data sources, data types, and applications is somewhat reactionary. There exists a misperception that organizations have precious little time to consider a purposeful strategy without disrupting business continuity. Strategy is a phrase that is often misused and ill-defined. However, it is nothing more than a set of integrated choices that help position an initiative for future success. This presentation covers the key elements defining data strategy. The following key topics are included: What data should we keep or toss? How should we structure data (warehouse versus data lake versus real-time event streaming)? How do we store data (cloud, virtualization, federation, cloud, Hadoop)? What is the approach we use to integrate and cleanse data (ETL versus cognitive/ automated profiling)? How do we protect and share data? These topics ensure that the organization gets the most value from our data. They explore how we prioritize and adapt our strategy to meet unanticipated needs in the future. As with any strategy, we need to make sure that we have a roadmap or plan for execution, so we talk specifically about the tools, technologies, methods, and processes that are useful as we design a data strategy that is both relevant and actionable to your organization.
Greg Nelson, Thotwave Technologies, LLC.
This paper demonstrates how to use the capabilities of SAS® Data Integration Studio to extract, load, and transform your data within a Hadoop environment. Which transformations can be used in each layer of the ELT process is illustrated using a sample use case, and the functionality of each is described. The use case steps through the process from source to target.
Darryl Yewchin, SAS
Todd Foreman, SAS
Session 2001-2017:
Hands-On Workshop: SAS® Data Loader for Hadoop
This workshop provides hands-on experience with some basic functionality of SAS Data Loader for Hadoop. You will learn how to: Copy Data to Hadoop Profile Data in Hadoop Cleanse Data in Hadoop
Kari Richardson, SAS
The clock is ticking! Is your company ready for May 25, 2018 when the General Data Protection Regulation that affects data privacy laws across Europe comes into force? If companies fail to comply, they incur very large fines and might lose customer trust if sensitive information is compromised. With data streaming in from multiple channels in different formats, sizes, and wavering quality, it is increasingly difficult to keep track of personal data so that you can protect it. SAS® Data Management helps companies on their journey toward governance and compliance involving tasks such as detection, quality assurance, and protection of personal data. This paper focuses on using SAS® Federation Server and SAS® Data Management Studio in the SAS® data management suite of products to surface and manage that hard-to find-personal data. SAS Federation Server provides you with a universal way to access data in Hadoop, Teradata, SQL Server, Oracle, SAP HANA, and other types of data without data movement during processing. The advanced data masking and encryption capabilities of SAS Federation Server can be use when virtualizing data for users. Purpose-built data quality functions are used to perform identification analysis, parsing, and matching and extraction of personal data in real time. We also provide insight to how the exploratory data analysis capability of SAS® Data Management Studio enables you to scan through your investigation hub to identify and categorize personal data.
Cecily Hoffritz, SAS
Many SAS® environments are set up for single-byte character sets (SBCS). But many organizations now have to process names of people and companies with characters outside that set. You can solve this problem by changing the configuration to the UTF-8 encoding, which is a multi-byte character set (MBCS). But the commonly used text manipulating functions like SUBSTR, INDEX, FIND, and so on, act on bytes, and should not be used anymore. SAS has provided new functions to replace these (K-functions). Also, character fields have to be enlarged to make room for multi-byte characters. This paper describes the problems and gives guidelines for a strategy to change. It also presents code to analyze existing code for functions that might cause problems. For those interested, a short historic background and a description of UTF-8 encoding is also provided. Conclusions focus on the positioning of SAS environments configured with UTF-8 versus single-byte encodings, the strategy of organizations faced with a necessary change, and the documentation.
Frank Poppe, PW Consulting
A growing need in higher education is the more effective use of analytics when evaluating the success of a postsecondary institution. The metrics currently used are simplistic measures of graduation rates, publications, and external funding. These measures offer a limited view of the effectiveness of postsecondary institutions. This paper provides a global perspective of the academic progress of students and the business of higher education. It presents innovative metrics that are more effective in evaluating postsecondary-institutional effectiveness.
Sean Mulvenon, University of Arkansas
Developing software using agile methodologies has become the common practice in many organizations. We use the SCRUM methodology to prepare, plan, and implement changes in our analytics environment. Preparing for the deployment of a new release usually took two days of creating packages, promoting them, deploying jobs, creating migration scripts, and correcting errors made in the first attempt. A sprint that originally took 10 working days (two weeks) was effectively reduced to barely seven. By automating this process, we were able to reduce the time needed to prepare our deployment to less than half a day, increasing the time we can spend developing by 25%. In this paper, we present the process and system prerequisites for automating the deployment process. We also describe the process, code, and scripts required for automating metadata promotion and physical table comparison and update.
Laurent de Walick, PW Consulting
bas Marsman, NN Bank
Geofencing is one of the most promising and exciting concepts that has developed with the advent of the internet of things. Like John Anderton in the 2002 movie Minority Report, you can now enter a mall and immediately receive commercial ads and offers based on your personal taste and past purchases. Authorities can track vessels positions and detect when a ship is not in the area it should be, or they can forecast and optimize harbor arrivals. When a truck driver breaks from the route, the dispatcher can be alerted and can act immediately. And there are countless examples from manufacturing, industry, security, or even households. All of these applications are based on the core concept of geofencing, which consists of detecting whether a device s position is within a defined geographical boundary. Geofencing requires real-time processing in order to react appropriately. In this session, we explain how to implement real-time geofencing on streaming data with SAS® Event Stream Processing and achieve high-performance processing, in terms of millions of events per second, over hundreds of millions of geofences.
Frederic Combaneyre, SAS
No Batch Scheduler? No problem! This paper describes the use of a SAS® Data Integration Studio job that can be started by a time-dependent scheduler like Windows Scheduler (or crontab in UNIX) to mimic a batch scheduler using SAS® Grid Manager.
Patrick Cuba, Cuba BI Consulting
Making optimal use of SAS® Grid Computing relies on the ability to spread the workload effectively across all of the available nodes. With SAS® Scalable Performance Data Server (SPD Server), it is possible to partition your data and spread the processing across the SAS Grid Computing environment. In an ideal world it would be possible to adjust the size and number of partitions according to the data volumes being processed on any given day. This paper discusses a technique that enables the processing performed in the SAS Grid Computing environment to be dynamically reconfigured, automatically at run time, to optimize the use of SAS Grid Computing, and to provide significant performance benefits.
Andy Knight, SAS
Bayesian inference has become ubiquitous in applied science because of its flexibility in modeling data and advances in computation that allow special methods of simulation to obtain sound estimates when more mathematical approaches are intractable. However, when the sample size is small, the choice of a prior distribution becomes difficult. Computationally convenient choices for prior distributions can overstate prior beliefs and bias the estimates. We propose a simple form of prior distribution, a mixture of two uniform distributions, that is weakly informative, in that the prior distribution has a relatively large standard deviation. This choice leads to closed-form expressions for the posterior distribution if the observed data follow a normal, binomial, or Poisson distribution. The explicit formulas are easily encoded in SAS®. For a small sample size of 10, we illustrate how to elicit the mixture prior and indicate that the resulting posterior distribution is insensitive to minor misspecification of input values. Weakly informative prior distributions suitable for small sample sizes are easy to specify and appear to provide robust inference.
Robert Lew, U.S. Department of Veterans Affairs
hongsheng wu, Wentworth Institute of Technology
jones yu, Wentworth Institute of Technology
SAS® Data Integration Studio jobs are not always linear. While Loop transformations have been part of SAS Data Integration Studio for ages, only more recently has SAS Data Integration Studio included the Conditional Control transformations to control logic flow within a job. This paper demonstrates the use of both the Loop and Conditional transformations in a real world example.
Harry Droogendyk, Stratia Consulting Inc
A common issue in data integration is that often the documentation and the SAS® data integration job source code start to diverge and eventually become out of sync. At Capgemini, working for a specific client, we developed a solution to rectify this challenge. We proposed moving all necessary documentation into the SAS® Data Integration Studio job itself. In this way, all documentation then becomes part of the metadata we have created, with the possibility of automatically generating Job and Release documentation from the metadata. This presentation therefore focuses on the metadata documentation generator. Specifically, this presentation: 1) looks at how to use programming and documentation standards in SAS data integration jobs to enable the generation of documentation from the metadata; and 2) shows how the documentation is generated from the metadata, and the challenges that were encountered creating the code. I draw on our hands-on experience; Capgemini has implemented this for a customer in the Netherlands, and we are rolling this out as an accelerator in other SAS data integration projects worldwide. I share examples of the generated documentation, which contains functional and technical designs, including a list with all source tables, a list with the target tables, all transformations with their own documentation, job dependencies, and more.
Richard Hogenberg, Capgemini
Session 0911-2017:
Self-Service Data Management for Analytics Users across the Enterprise
With the proliferation of analytics expanding across every function of the enterprise, the need for broader access to data by experienced data scientists and non-technical users to produce reports and do discovery is growing exponentially. The unintended consequence of this trend is a bottleneck within IT to deliver the necessary data while still maintaining the necessary governance and data security standards required to safeguard this critical corporate asset. This presentation illustrates how organizations are solving this challenge and enabling users to both access larger quantities of existing data and add new data to their own models without negatively impacting the quality, security, or cost to store that data. It also highlights some of the cost and performance benefits achieved by enabling self-service data management.
Ken Pikulik, Teradata
This paper discusses a set of practical recommendations for optimizing the performance and scalability of your Hadoop system using SAS®. Topics include recommendations gleaned from actual deployments from a variety of implementations and distributions. Techniques cover tips for improving performance and working with complex Hadoop technologies such as Kerberos, techniques for improving efficiency when working with data, methods to better leverage the SAS in Hadoop components, and other recommendations. With this information, you can unlock the power of SAS in your Hadoop system.
Wilbram Hazejager, SAS
Nancy Rausch, SAS
Testing is a weak spot in many data warehouse environments. A lot of the testing is focused on the correct implementation of requirements. But due to the complex nature of analytics environments, a change in a data integration process can lead to unexpected results in totally different and untouched areas. We developed a method to identify unexpected changes often and early by doing a nightly regression test. The test does a full ETL run, compares all output from the test to a baseline, and reports all the changes. This paper describes the process and the SAS® code needed to back up existing data, trigger ETL flows, compare results, and restore situations after a nightly regression test. We also discuss the challenges we experienced while implementing the nightly regression test framework.
Laurent de Walick, PW Consulting
bas Marsman, NN Bank
Stephan Minnaert, PW Consulting
Hash tables are powerful tools when building an electronic code book, which often requires a lot of match-merging between the SAS® data sets. In projects that span multiple years (e.g., longitudinal studies), there are usually thousands of new variables introduced at the end of every year or at the end of each phase of the project. These variables usually have the same stem or core as the previous year's variables. However, they differ only in a digit or two that usually signifies the year number of the project. So, every year, there is this extensive task of comparing thousands of new variables to older variables for the sake of carrying forward key database elements corresponding to the previously defined variables. These elements can include the length of the variable, data type, format, discrete or continuous flag, and so on. In our SAS program, hash objects are efficiently used to cut down not only time, but also the number of DATA and PROC steps used to accomplish the task. Clean and lean code is much easier to understand. A macro is used to create the data set containing new and older variables. For a specific new variable, the FIND method in hash objects is used in a loop to find the match to the most recent older variable. What was taking around a dozen PROC SQL steps is now a single DATA step using hash tables.
Raghav Adimulam, Westat
If you run SAS® and Teradata software with default application and database client encodings, some operations with international character sets will appear to work because you are actually treating character strings as streams of bytes instead of streams of characters. As long as no one in the chain of custody tries to interpret the data as anything other than a stream of bytes, then data can sometimes flow in and out of a database without being altered, giving the appearance of correct operation. But when you need to compare a particular character to an international character, or when your data approaches the maximum size of a character field, then you will run into trouble. To correctly handle international character sets, every layer of software that touches the data needs to agree on the encoding of the data. UTF-8 encoding is a flexible way to handle many single-byte and multi-byte international character sets. To use UTF-8 encoding with international character sets, we need to configure the SAS session encoding, Teradata client encoding, and Teradata server encoding to all agree, so that they are handling UTF-8 encoded data. This paper shows you how to configure SAS and Teradata so that your applications run successfully with international characters.
Greg Otto, Teradata
Salman Maher, SAS
Austin Swift, SAS
Data is a valuable corporate asset that, when managed improperly, can detract from a company's ability to achieve strategic goals. At 1-800-Flowers.com, Inc. (18F), we have embarked on a journey toward data governance through embracing Master Data Management (MDM). Along the path, we've recognized that in order to protect and increase the value of our data, we must take data quality into consideration at all aspects of data movement in the organization. This presentation discusses the ways that SAS® Data Management is being leveraged by the team at 18F to create and enhance our data quality strategy to ensure data quality for MDM.
Brian Smith, 1800Flowers.com
The latest releases of SAS® Data Management software provide a comprehensive and integrated set of capabilities for collecting, transforming, and managing your data. The latest features in the product suite include capabilities for working with data from a wide variety of environments and types including Hadoop, cloud data sources, RDBMS, files, unstructured data, streaming, and others, and the ability to perform ETL and ELT transformations in diverse run-time environments including SAS®, database systems, Hadoop, Spark, SAS® Analytics, cloud, and data virtualization environments. There are also new capabilities for lineage, impact analysis, clustering, and other data governance features for enhancements to master data and support metadata management. This paper provides an overview of the latest features of the SAS® Data Management product suite and includes use cases and examples for leveraging product capabilities.
Nancy Rausch, SAS