In the clinical research world, data accuracy plays a significant role in delivering quality results. Various validation methods are available to confirm data accuracy. Of these, double programming is the most highly recommended and commonly used method to demonstrate a perfect match between production and validation output. PROC COMPARE is one of the SAS® procedures used to compare two data sets and confirm the accuracy In the current practice, whenever a program rerun happens, the programmer must manually review the output file to ensure an exact match. This is tedious, time-consuming, and error prone because there are more LST files to be reviewed and manual intervention is required. The proposed approach programmatically validates the output of PROC COMPARE in all the programs and generates an HTML output file with a Pass/Fail status flag for each output file with the following: 1. Data set name, label, and number of variables and observations 2. Number of observations not in both compared and base data sets 3. Number of variables not in both compared and base data sets The status is flagged as Pass whenever the output file meets the following 3N criteria: 1. NOTE: No unequal values were found. All values compared are exactly equal 2. Number of observations in base and compared data sets are equal 3. Number of variables in base and compared data sets are equal The SAS® macro %threeN efficiently validates all the output files generated from PROC COMPARE in a short time and also expedites validation with accuracy. It reduces up to 90% of the time spent on the manual review of the PROC COMPARE output files.
Amarnath Vijayarangan, Emmes Services Pvt Ltd, India
There are currently thousands of Jamaican citizens that lack access to basic health care. In order to improve the health-care system, I collect and analyze data from two clinics in remote locations of the island. This report analyzes data collected from Clarendon Parish, Jamaica. In order to create a descriptive analysis, I use SAS® Studio 9.4. A few of the procedures I use include: PROC IMPORT, PROC MEANS, PROC FREQ, and PROC GCHART. After conducting the aforementioned procedures, I am able to produce a descriptive analysis of the health issues plaguing the island.
Verlin Joseph, Florida A&M University
Disability rights groups are using the court system to exact change. As a result, the enforcement of Section 508 and similar laws around the world has become a priority. When you generate SAS® output, you need to protect your organization from litigation. This paper describes concrete steps that help you use SAS® 9.4 Output Delivery System (ODS) to create SAS output that complies with accessibility standards. It also provides recommendations and code samples that are aligned with the accessibility standards defined by Section 508 and the Web Content Accessibility Guidelines (WCAG 2.0).
Glen Walker, SAS
Common tasks that we need to perform are merging or appending SAS® data sets. During this process, we sometimes get error or warning messages saying that the same fields in different SAS data sets have different lengths or different types. If the problems involve a lot of fields and data sets, we need to spend a lot of time to identify those fields and write extra SAS codes to solve the issues. However, if you use the macro in this paper, it can help you identify the fields that have inconsistent data type or length issues. It also solves the length issues automatically by finding the maximum field length among the current data sets and assigning that length to the field. An html report is generated after running the macro that includes the information about which fields' lengths have been changed and which fields have inconsistent data type issues.
Ting Sa, Cincinnati Children's Hospital Medical Center
Although rapid development of information technologies in the past decade has provided forecasters with both huge amounts of data and massive computing capabilities, these advancements do not necessarily translate into better forecasts. Different industries and products have unique demand patterns. There is not yet a one-size-fits-all forecasting model or technique. A good forecasting model must be tailored to the data in order to capture the salient features and satisfy the business needs. This paper presents a multistage modeling strategy for demand forecasting. The proposed strategy, which has no restrictions on the forecasting techniques or models, provides a general framework for building a forecasting system in multiple stages.
Pu Wang, SAS
A useful and often overlooked tool released with the SAS® Business Intelligence suite to aid in ETL is SAS® Data Integration Studio. This product gives users the ability to extract, transform, join, and load data from various database management systems (DBMSs), data marts, and other data stores by using a graphical interface and without having to code different credentials for each schema. It enables seamless promotion of code to a production system without the need to alter the code. And it is quite useful for deploying and scheduling jobs by using the schedule manager in SAS® Management Console, because all code created by Data Integration Studio is optimized. Although this tool enables users to create code from scratch, one of its most useful capabilities is that it can take legacy SAS® code and, with minimal alterations, have its data associations created and have all the properties of a job coded from scratch.
Erik Larsen, Independent Consultant
Two of the powerful features of ODS Graphics procedures is the ability to create forest plots and add inner margin tables to graphics output. The drawback, however, is that the syntax required by the programmer from PROC TEMPLATE is complex and tedious. A prompted application or even a parameterized stored process that connects PROC TEMPLATE code to a point-and-click application definitely makes life easier for coders in many industries who frequently create these types of graphic output.
Ted Durie, SAS
Although the statistical foundations of predictive analytics have large overlaps, the business objectives, data availability, and regulations are different across the property and casualty insurance, life insurance, banking, pharmaceutical, and genetics industries. A common process in property and casualty insurance companies with large data sets is introduced, including data acquisition, data preparation, variable creation, variable selection, model building (also known as fitting), model validation, and model testing. Variable selection and model validation stages are described in more detail. Some successful models in the insurance companies are introduced. Base SAS®, SAS® Enterprise Guide®, and SAS® Enterprise Miner™ are presented as the main tools for this process.
Mei Najim, Sedgwick
With the advent of the exciting new hybrid field of Data Science, programming and data management skills are in greater demand than ever and have never been easier to attain. Online resources like codecademy and w3schools offer a host of tutorials and assistance to those looking to develop their programming abilities and knowledge. Though their content is limited to languages and tools suited mostly for web developers, the value and quality of these sites are undeniable. To this end, similar tutorials for other free-to-use software applications are springing up. The interactivity of these tutorials elevates them above most, if not all, other out-of-classroom learning tools. The process of learning programming or a new language can be quite disjointed when trying to pair a textbook or similar walk-through material with matching coding tasks and problems. These sites unify these pieces for users by presenting them with a series of short, simple lessons that always require the user to demonstrate their understanding in a coding exercise before progressing. After teaching SAS® in a classroom environment, I became fascinated by the potential for a similar student-driven approach to learning SAS. This could afford me more time to provide individualized attention, as well as open up additional class time to more advanced topics. In this talk, I discuss my development of a series of SAS scripts that walk the user through learning the basics of SAS and that involve programming at every step of the process. This collection of scripts should serve as a self-contained, pseudo-interactive course in SAS basics that students could be asked to complete on their own in a few weeks, leaving the remainder of the term to be spent on more challenging, realistic tasks.
Hunter Glanz, California Polytechnic State University
With increasing data needs, it becomes more and more unwieldy to ensure that all scheduled jobs are running successfully and on time. Worse, maintaining your reputation as an information provider becomes a precarious prospect as the likelihood increases that your customers alert you of reporting issues before you are even aware yourself. By combining various SAS® capabilities and tying them together with concise visualizations, it is possible to track jobs actively and alert customers of issues before they become a problem. This paper introduces a report tracking framework that helps achieve this goal and improve customer satisfaction. The report tracking starts by obtaining table and job statuses and then color-codes them by severity levels. Based on the job status, it then goes deeper into the log to search for potential errors or other relevant information to help assess the processes. The last step is to send proactive alerts to users informing them of possible delays or potential data issues.
Jia Heng, Wyndham Destination Network
The new and highly anticipated SAS® Output Delivery System (ODS) destination for Microsoft Excel is finally here! Available as a production feature in the third maintenance release of SAS® 9.4 (TS1M3), this new destination generates native Excel (XLSX) files that are compatible with Microsoft Office 2010 or later. This paper is written for anyone, from entry-level programmers to business analysts, who uses the SAS® System and Microsoft Excel to create reports. The discussion covers features and benefits of the new Excel destination, differences between the Excel destination and the older ExcelXP tagset, and functionality that exists in the ExcelXP tagset that is not available in the Excel destination. These topics are all illustrated with meaningful examples. The paper also explains how you can bridge the gap that exists as a result of differences in the functionality between the destination and the tagset. In addition, the discussion outlines when it is beneficial for you to use the Excel destination versus the ExcelXP tagset, and vice versa. After reading this paper, you should be able to make an informed decision about which tool best meets your needs.
Chevell Parker, SAS
This paper aims to show a SAS® macro for generating random numbers of skew-normal and skew-t distributions, as well as the quantiles of these distributions. The results are similar to those generated by the sn package of R software.
Alan Silva, University of Brasilia
Paulo Henrique Dourado da Silva, Banco do Brasil
This work presents a macro for generating a ternary graph that can be used to solve a problem in the ceramics industry. The ceramics industry uses mixtures of clay types to generate a product with good properties that can be assessed according to breakdown pressure after incineration, the porosity, and water absorption. Beyond these properties, the industry is concerned with managing geological reserves of each type of clay. Thus, it is important to seek alternative compositions that present properties similar to the default mixture. This can be done by analyzing the surface response of these properties according to the clay composition, which is easily done in a ternary graph. SAS® documentation does not describe how to adjust an analysis grid in nonrectangular forms on graphs. A macro triaxial, however, can generate ternary graphical analysis by creating a special grid, using an annotated database, and making linear transformations of the original data. The GCONTOUR procedure is used to generate three-dimensional analysis.
IGOR NASCIMENTO, UNB
Zacarias Linhares, IFPI
ROBERTO SOARES, IFPI
Geographically Weighted Negative Binomial Regression (GWNBR) was developed by Silva and Rodrigues (2014). It is a generalization of the Geographically Weighted Poisson Regression (GWPR) proposed by Nakaya and others (2005) and of the Poisson and negative binomial regressions. This paper shows a SAS® macro to estimate the GWNBR model encoded in SAS/IML® software and shows how to use the SAS procedure GMAP to draw the maps.
Alan Silva, University of Brasilia
Thais Rodrigues, University of Brasilia
It is common for hundreds or even thousands of clinical endpoints to be collected from individual subjects, but events from the majority of clinical endpoints are rare. The challenge of analyzing high dimensional sparse data is in balancing analytical consideration for statistical inference and clinical interpretation with precise meaningful outcomes of interest at intra-categorical and inter-categorical levels. Lumping or grouping similar rare events into a composite category has the statistical advantage of increasing testing power, reducing multiplicity size, and avoiding competing risk problems. However, too much or inappropriate lumping would jeopardize the clinical meaning of interest and external validity, whereas splitting or keeping each individual event at its basic clinical meaningful category can overcome the drawbacks of lumping. This practice might create analytical issues of increasing type II errors, multiplicity size, competing risks, and having a large proportion of endpoints with rare events. It seems that lumping and splitting are diametrically opposite approaches, but in fact, they are complementary. Both are essential for high dimensional data analysis. This paper describes the steps required for the lumping and splitting analysis and presents SAS® code that can be used to implement each step.
Shirley Lu, VAMHCS, Cooperative Studies Program
This paper presents how Norway, the world's second-largest seafood-exporting country, shares valuable seafood insight using the SAS® Visual Analytics Designer. Three complementary data sources: trade statistics, consumption panel data, and consumer survey data, are used to strengthen the knowledge and understanding about the important markets for seafood, which is a potential competitive advantage for the Norwegian seafood industry. The need for information varies across users and as the amount of data available is growing, the challenge is to make the information available for everyone, everywhere, at any time. Some users are interested in only the latest trade developments, while others working with product innovation are in need of deeper consumer insights. Some have quite advanced analytical skills, while others do not. Thus, one of the most important things is to make the information understandable for everyone, and at the same time provide in-depth insights for the advanced user. SAS Visual Analytics Designer makes it possible to provide both basic reporting and more in-depth analyses on trends and relationships to cover the various needs. This paper demonstrates how the functionality in SAS Visual Analytics Designer is fully used for this purpose, and presents how data from different sources is visualized in SAS Visual Analytics Designer reports located in the SAS® Information Delivery Portal. The main challenges and suggestions for improvements that have been uncovered during the process are also presented in this paper.
Kia Uuskartano, Norwegian Seafood Council
Tor Erik Somby, Norwegian Seafood Council
This paper demonstrates how to use the ODS destination for PowerPoint to create attractive presentations from your SAS® output. Packed with examples, this paper gives you a behind-the-scenes tour of how ODS creates Microsoft PowerPoint presentations. You get an in-depth look at how to customize the ODS PowerPoint style templates that control the appearance of your presentation. With this information you can quickly turn your SAS output into an engaging and informative presentation. This paper is a follow-on to the SAS® Global Forum 2013 paper A First Look at the ODS Destination for PowerPoint.
Tim Hunter, SAS
Longitudinal and repeated measures data are seen in nearly all fields of analysis. Examples of this data include weekly lab test results of patients or performance of test score by children from the same class. Statistics students and analysts alike might be overwhelmed when it comes to repeated measures or longitudinal data analyses. They might try to educate themselves by diving into text books or taking semester-long or intensive weekend courses, resulting in even more confusion. Some might try to ignore the repeated nature of data and take short cuts such as analyzing all data as independent observations or analyzing summary statistics such as averages or changes from first to last points and ignoring all the data in-between. This hands-on presentation introduces longitudinal and repeated measures analyses without heavy emphasis on theory. Students in the workshop will have the opportunity to get hands-on experience graphing longitudinal and repeated measures data. They will learn how to approach these analyses with tools like PROC MIXED and PROC GENMOD. Emphasis will be on continuous outcomes, but categorical outcomes will briefly be covered.
Leanne Goldstein, City of Hope
Amazon Web Services (AWS) as a platform for analytics and data warehousing has gained significant adoption over the years. With SAS® Visual Analytics being one of the preferred tools for data visualization and analytics, it is imperative to be able to deploy SAS Visual Analytics on AWS. This ensures swift analysis and reporting on large amounts of data with SAS Visual Analytics by minimizing the movement of data across environments. This paper focuses on installing SAS Visual Analytics 7.2 in an Amazon Web Services environment, migration of metadata objects and content from previous versions to the deployment on the cloud, and ensuring data security.
Vimal Raj Arockiasamy, Kavi Associates
Rajesh Inbasekaran, Kavi Associates
SAS® functions provide amazing power to your DATA step programming. Some of these functions are essential--they save you from writing volumes of unnecessary code. This talk covers a number of the most useful SAS functions. Some may be new to you, and they will change the way you program and approach common programming tasks. The majority of the functions discussed in this talk work with character data. Some functions search for strings, and others find and replace strings or join strings together. Still others measure the spelling distance between two strings (useful for fuzzy matching). Some of the newest and most amazing functions are not functions at all, but call routines. Did you know that you can sort values within an observation? Did you know that you can identify not only the largest or smallest value in a list of variables, but the second- or third- or nth-largest or smallest value? Knowledge of the functions described here will make you a much better SAS programmer.
Ron Cody, Camp Verde Associates
Government funding is a highly sought-after financing medium by entrepreneurs and researchers aspiring to make a breakthrough in the market and compete with larger organizations. The funding is channeled via federal agencies that seek out promising research and products that improve the extant work. This study analyzes the project abstracts that earned the government funding through a text analytic approach over the past three and a half decades in the fields of Defense and Health and Human Services. This helps us understand the factors that might be responsible for awards made to a project. We collected the data of 100,279 records from the Small Business Innovation and Research (SBIR) website (https://www.sbir.gov). Initially, we analyze the trends in government funding over research by the small businesses. Then we perform text mining on the 55,791 abstracts of the projects that are funded by the Department of Defense and the Department of Health. From the text mining, we portray the key topics and patterns related to the past research and determine the changes in their trends over time. Through our study, we highlight the research trends to enable organizations to shape their business strategy and make better investments in the future research.
Sairam Tanguturi, Oklahoma State University
Sagar Rudrake, Oklahoma state University
Nithish Reddy Yeduguri, Oklahoma State University
The Waze application, purchased by Google in 2013, alerts millions of users about traffic congestion, collisions, construction, and other complexities of the road that can stymie motorists' attempts to get from A to B. From jackknifed rigs to jackalope carcasses, roads can be gnarled by gridlock or littered with obstacles that impede traffic flow and efficiency. Waze algorithms automatically reroute users to more efficient routes based on user-reported events as well as on historical norms that demonstrate typical road conditions. Extract, transform, load (ETL) infrastructures often represent serialized process flows that can mimic highways and that can become similarly snarled by locked data sets, slow processes, and other factors that introduce inefficiency. The LOCKITDOWN SAS® macro, introduced at the Western Users of SAS® Software Conference 2014, detects and prevents data access collisions that occur when two or more SAS processes or users simultaneously attempt to access the same SAS data set. Moreover, the LOCKANDTRACK macro, introduced at the conference in 2015, provides real-time tracking of and historical performance metrics for locked data sets through a unified control table, enabling developers to hone processes in order to optimize efficiency and data throughput. This paper demonstrates the implementation of LOCKANDTRACK and its lock performance metrics to create data-driven, fuzzy logic algorithms that preemptively reroute program flow around inaccessible data sets. Thus, rather than needlessly waiting for a data set to become available or for a process to complete, the software actually anticipates the wait time based on historical norms, performs other (independent) functions, and returns to the original process when it becomes available.
Troy Hughes, Datmesis Analytics
Although business intelligence experts agree that empowering businesses through a well-constructed semantic layer has undisputed benefits, a successful implementation has always been a formidable challenge. This presentation highlights the best practices to follow and mistakes to avoid, leading to a successful semantic layer implementation by using SAS® Visual Analytics. A correctly implemented semantic layer provides business users with quick and easy access to information for analytical and fact-based decision-making. Today, everyone talks about how the modern data platform enables businesses to store and analyze big data, but we still see most businesses trying to generate value from the data that they already store. From self-service to data visualization, business intelligence and descriptive analytics are still the key requirements for any business, and we discuss how to use SAS Visual Analytics to address them all. We also describe the key considerations in strategy, people, process, and data for a successful semantic layer rollout that uses SAS Visual Analytics.
Arun Sugumar, KAVI ASSOCIATES
Vignesh Balasubramanian, Kavi Global
Harsh Sharma, Kav Global
In health sciences and biomedical research we often search for scientific journal abstracts and publications within MEDLINE, which is a suite of indexed databases developed and maintained by the National Center for Biotechnology Information (NCBI) at the United States National Library of Medicine (NLM). PubMed is a free search engine within MEDLINE and has become one of the standard databases to search for scientific abstracts. Entrez is the information retrieval system that gives you direct access to the 40 databases with over 1.3 billion records within the NCBI. You can access these records by using the eight e-utilities (einfo, esearch, summary, efetch, elink, einfo, epost, and egquery), which are the NCBI application programming interfaces (APIs). In this paper I will focus on using three of the e-utilities to retrieve data from PubMed as opposed to the standard method of manually searching and exporting the information from the PubMed website. I will demonstrate how to use the SAS HTTP procedure along with the XML mapper to develop a SAS® macro that generates all the required data from PubMed based on search term parameters. Using SAS to extract information from PubMed searches and save it directly into a data set allows the process to be automated (which is good for running routine updates) and eliminates manual searching and exporting from PubMed.
Craig hansen, South Australian Health and Medical Research Institute
The presentation illustrates techniques and technology to manage complex and large-scale ETL workloads using SAS® and IBM Platform LSF to provide greater control of flow triggering and more complex triggering logic based on a rules-engine approach that parses pending workloads and current activity. Key techniques, configuration steps, and design and development patterns are demonstrated. Pros and cons of the techniques that are used are discussed. Benefits include increased throughput, reduced support effort, and the ability to support more sophisticated inter-flow dependencies and conflicts.
Angus Looney, Capgemini
Since it was first released three years ago, SAS® Environment Manager has been used widely in many customers' production environment. Customers are now familiar with the basic functions of the product. They are asking for more information about advanced topics such as how to manage users, roles, and permissions; how to secure the product; and how to interact with their existing system management tools in their enterprise environment. This paper addresses those advanced but practical issues from a real customer's perspective. The paper first briefly lists what's new in the most current release of SAS Environment Manager. It then discusses the new user management design introduced in the third maintenance release of SAS 9.4. We explain how that design addresses customers' concerns and the best practices for using that design and for setting up user roles and permissions. We also discuss the new process and best practices for configuring SSL for SAS Environment Manager. More and more customers have expressed an interest in integrating SAS Environment Manager with their own system management tools. Our discussion ends with explaining how customers can do that through the SNMP interface built into SAS Environment Manager.
Zhiyong Li, SAS
Gilles Chrzaszcz, SAS Institute
Eandis is a rapidly growing energy distribution grid operator in the heart of Europe, with requirements to manage power distribution on behalf of 229 municipalities in Belgium. With a legacy SAP data warehouse and other diverse data sources, business leaders at Eandis faced challenges with timely analysis of key issues such as power quality, investment planning, and asset management. To face those challenges, a new agile way of thinking about Business Intelligence (BI) was necessary. A sandbox environment was introduced where business key-users could explore and manipulate data. It allowed them to have approachable analytics and to build prototypes. Many pitfalls appeared and the greatest challenge was the change in mindset for both IT and business users. This presentation addresses those issues and possible solutions.
Olivier Goethals, Eandis
For far too long, anti-money laundering and terrorist financing solutions have forced analysts to wade through oceans of transactions and alerted work items (alerts). Alert-centered analysis is both ineffective and costly. The goal of an anti-money laundering program is to reduce risk for your financial institution, and to do this most effectively, you must start with analysis at the customer level, rather than simply troll through volumes of alerts and transactions. In this session, discover how a customer-centric approach leads to increased analyst efficiency and streamlined investigations. Rather than starting with alerts and transactions, starting with a customer-centric view allows your analysts to rapidly triage suspicious activities, prioritize work, and quickly move into investigating the highest risk customer activities.
Kathy Hart, SAS
Over the last few years both Microsoft Excel file formats and the SAS® interfaces to those Excel formats have changed. SAS® has worked hard to make the interface between the two systems easier to use. Starting with Comma Separated Variable files and moving to PROC IMPORT and PROC EXPORT, LIBNAME processing, SQL processing, SAS® Enterprise Guide®, JMP®, and then on to the HTML and XML tagsets like MSOFFICE2K, and EXCELXP. Well, there is now a new entry into the processes available for SAS users to send data directly to Excel. This new entry into the ODS arena of data transfer to Excel is the ODS destination called EXCEL. This process is included within SAS ODS and produces native format Excel files for version 2007 of Excel and later. It was first shipped as an experimental version with the first maintenance release of SAS® 9.4. This ODS destination has many features similar to the EXCELXP tagsets.
William E Benjamin Jr, Owl Computer Consultancy LLC
In an effort to increase transparency and accountability in the US health care system, the Obama administration mandated the Centers for Medicare & Medicaid Services (CMS) to make available data for use by researchers and interested parties from the general public. Among the more well-known uses of this data are analyses published by the Wall Street Journal showing that a large, and in some cases, shocking discrepancy between what hospitals potentially charge the uninsured and what they are paid by Medicare for the same procedure. Analyses such as these highlight both potential inequities in the US health care system and, more importantly, potential opportunities for its reform. However, while capturing the public imagination, analyses such as these are but one means to capitalize on the remarkable wealth of information this data provides. Specifically, data from the public distribution CMS data can help both researchers and the public better understand the burden specific conditions and medical treatments place on the US health care system. It was this simple, but important objective that motivated the present study. Our specific analyses focus on two of what we believe to be important questions. First, using the total number of hospital discharges as a proxy for incidence of a condition or treatment, which have the highest incidence rates nationally? Does their incidence remain stable, or is it increasing/decreasing? And, is there variability in these incidence rates across states? Second, as psychologists, we are necessarily interested in understanding the state of mental health care. To date, and to the best of our knowledge, there has been no study utilizing the public inpatient Medicare provider utilization and payment data set to explore the utilization of mental illness services funded by Medicare.
Joo Ann Lee, York University
Micheal Friendly, York University
cathy labrish, york university
In the Collection Direction of a well-recognized Colombian financial institution, there was no methodology that provided an optimal number of collection agents to improve the collection task and make possible that more customers be compromised with their minimum monthly payment of their debt. The objective of this paper is to apply the Data Envelopment Analysis (DEA) Optimization Methodology to determine the optimal number of agents to maximize the monthly collection in the bank. We show that the results can have a positive impact to the credit portfolio behavior and reduce the collection management cost. DEA optimization methodology has been successfully used in various fields to solve multi-criteria optimization problems, but it is not commonly used in the financial sector mostly because this methodology requires specialized software, such as SAS® Enterprise Guide®. In this paper, we present the PROC OPTMODEL and we show how to formulate the optimization problem, program the SAS® Code, and how to process adequately the available data.
Miguel Díaz, Scotiabank - Colpatria
Oscar Javier Cortés Arrigui, Scotiabank - Colpatria
Currently Colpatria, as a part of Scotiabank in Colombia, has several methodologies that enable us to have a vision of the customer from a risk perspective. However, the current trend in the financial sector is to have a global vision that involves aspects of risk as well as of profitability and utility. As a part of the business strategies to develop cross-sell and customer profitability under conditions of risk needs, it's necessary to create a customer value index to score the customer according to different groups of business key variables that permit us to describe the profitability and risk of each customer. In order to generate the Index of Customer Value, we propose to construct a synthetic index using principal component analysis and multiple factorial analysis.
Ivan Atehortua, Colpatria
Diana Flórez, Colpatria
In the regulatory world of patient safety and pharmacovigilance, whether it's during clinical trials or post-market surveillance, SAEs that affect participants must be collected, and if certain criteria are met, reported to the FDA and other regulatory authorities. SAEs are often entered into multiple databases by various users, resulting in possible data discrepancies and quality loss. Efforts have been made to reconcile the SAE data between databases, but there is no industrial standard regarding the methodology or tool employed for this task. Some organizations still reconcile the data manually, with visual inspections and vocal verification. Not only is this laborious and error-prone, it becomes prohibitive when the data reach hundreds of records. We devised an efficient algorithm using SAS® to compare two data sources automatically. Our algorithm identifies matched, discrepant, and unpaired SAE records. Additionally, it employs a user-supplied list of synonyms to find non-identical but relevant matches. First, two data sources are combined and sorted by key fields such as Subject ID , Onset Date , Stop Date , and Event Term . Record counts and Levenshtein edit distances are calculated within certain groups to assist with sorting and matching. This combined record list is then fed into a DATA step to decide whether a record is paired or unpaired. For an unpaired record, a stub record with all fields set as ? is generated as a matching placeholder. Each record is written to one of two data sets. Later, the data sets are tagged and pulled into a comparison logic using hash objects, which enable field-by-field comparison and display discrepancies in clean format for each field. Identical fields or columns are cleared or removed for clarity. The result is a streamlined and user-friendly process that allows for fast and easy SAE reconciliation.
Zhou (Tom) Hui, AstraZeneca
This paper presents supervised and unsupervised pattern recognition techniques that use Base SAS® and SAS® Enterprise Miner™ software. A simple preprocessing technique creates many small image patches from larger images. These patches encourage the learned patterns to have local scale, which follows well-known statistical properties of natural images. In addition, these patches reduce the number of features that are required to represent an image and can decrease the training time that algorithms need in order to learn from the images. If a training label is available, a classifier is trained to identify patches of interest. In the unsupervised case, a stacked autoencoder network is used to generate a dictionary of representative patches, which can be used to locate areas of interest in new images. This technique can be applied to pattern recognition problems in general, and this paper presents examples from the oil and gas industry and from a solar power forecasting application.
Patrick Hall, SAS
Alex Chien, SAS Institute
ILKNUR KABUL, SAS Institute
JORGE SILVA, SAS Institute
At Royal Bank of Scotland, business intelligence users require sophisticated security permissions both at object level and data (row) level in order to comply with data security, audit, and regulatory requirements. When we rolled out SAS® Visual Analytics to our two main stakeholder groups, this was identified as a key requirement as data is no longer restricted to the desktop but is increasingly available on mobile devices such as tablets and smart phones. Implementing row-level security (RLS) controls, in addition to standard security measures such as authentication, is a most effective final layer in your data authorization process. RLS procedures in leading relational database management systems (RDBMSs) and business intelligence (BI) software are fairly commonplace, but with the emergence of big data and in-memory visualization tools such as SAS® Visual Analytics, those RLS procedures now need to be extended to the memory interface. Identity-driven row-level security is a specific RLS technique that enables the same report query to retrieve different sets of data in accordance with the varying security privileges afforded to the respective users. This paper discusses an automated framework approach for applying identity-driven RLS controls on SAS® Visual Analytics and our plans to implement a generic end-to-end RLS framework extended to the Teradata data warehouse.
Paul Johnson, Sopra Steria
Ekaitz Goienola, SAS
Dileep Pournami, RBS
Impala is an open source SQL engine designed to bring real-time, concurrent, ad hoc query capability to Hadoop. SAS/ACCESS® Interface to Impala allows SAS® to take advantage of this exciting technology. This presentation uses examples to show you how to increase your program's performance and troubleshoot problems. We discuss how Impala fits into the Hadoop ecosystem and how it differs from Hive. Learn the differences between the Hadoop and Impala SAS/ACCESS engines.
Jeff Bailey, SAS
So you've heard about SAS® arrays, but you're not sure when or why you would use them. This presentation provides some background on SAS arrays, from explaining what occurs during compile time to explaining how to use them programmatically. It also includes a discussion about how DO loops and macro variables can enhance array usability. Specific examples, including Fahrenheit-to-Celsius temperature conversion, salary adjustments, and data transposition and counting, demonstrate how you can use SAS arrays effectively in your own work and also provide a few caveats about their use.
Andrew Kuligowski, HSN
Lisa Mendez, IMS Government Solutions
The world's capacity to store and analyze data has increased in ways that would have been inconceivable just a couple of years ago. Due to this development, large-scale data are collected by governments. Until recently, this was for purely administrative purposes. This study used comprehensive data files on education. The purpose of this study was to examine compulsory courses for a bachelor's degree in the economics program at the University of Copenhagen. The difficulty and use of the grading scale was compared across the courses by using the new IRT procedure, which was introduced in SAS/STAT® 13.1. Further, the latent ability traits that were estimated for all students in the sample by PROC IRT are used as predictors in a logistic regression model. The hypothesis of interest is that students who have a lower ability trait will have a greater probability of dropping out of the university program compared to successful students. Administrative data from one cohort of students in the economics program at the University of Copenhagen was used (n=236). Three unidimensional Item Response Theory models, two dichotomous and one polytomous, were introduced. It turns out that the polytomous Graded Response model does the best job of fitting data. The findings suggest that in order to receive the highest possible grade, the highest level of student ability is needed for the course exam in the first-year course Descriptive Economics A. In contrast, the third-year course Econometrics C is the easiest course in which to receive a top grade. In addition, this study found that as estimated student ability decreases, the probability of a student dropping out of the bachelor's degree program increases drastically. However, contrary to expectations, some students with high ability levels also end up dropping out.
Sara Armandi, University of Copenhagen
Movie reviews provide crucial information and act as an important factor when deciding whether to spend money on seeing a film in the theater. Each review reflects an individual's take on the movie and there are often contrasting reviews for the same movie. Going through each review may create confusion in the mind of a reader. Analyzing all the movie reviews and generating a quick summary that describes the performance, direction, and screenplay, among other aspects is helpful to the readers in understanding the sentiments of movie reviewers and in deciding if they would like to watch the movie in the theaters. In this paper, we demonstrate how the SAS® Enterprise Miner™ nodes enable us to generate a quick summary of the terms and their relationship with each other, which describes the various aspects of a movie. The Text Cluster and Text Topic nodes are used to generate groups with similar subjects such as genres of movies, acting, and music. We use SAS® Sentiment Analysis Studio to build models that help us classify 10,000 reviews where each review is a separate document and the reviews are equally split into good or bad. The Smoothed Relative Frequency and Chi-square model is found to be the best model with overall precision of 78.37%. As soon as the latest reviews are out, such analysis can be performed to help viewers quickly grasp the sentiments of the reviews and decide if the movie is worth watching.
Ameya Jadhavar, Oklahoma State University
Goutam Chakraborty, Oklahoma State University
Prithvi Raj Sirolikar, Oklahoma State University
Organizations are collecting more structured and unstructured data than ever before, and it is presenting great opportunities and challenges to analyze all of that complex data. In this volatile and competitive economy, there has never been a bigger need for proactive and agile strategies to overcome these challenges by applying the analytics directly to the data rather than shuffling data around. Teradata and SAS® have joined forces to revolutionize your business by providing enterprise analytics in a harmonious data management platform to deliver critical strategic insights by applying advanced analytics inside the database or data warehouse where the vast volume of data is fed and resides. This paper highlights how Teradata and SAS strategically integrate in-memory, in-database, and the Teradata relational database management system (RDBMS) in a box, giving IT departments a simplified footprint and reporting infrastructure, as well as lower total cost of ownership (TCO), while offering SAS users increased analytic performance by eliminating the shuffling of data over external network connections.
Greg Otto, Teradata Corporation
Tho Nguyen, Teradata
Paul Segal, Teradata
This paper demonstrates techniques using SAS® software to combine data from devices and sensors for analytics. Base SAS®, SAS® Data Integration Studio, and SAS® Visual Analytics are used to query external services, import, fuzzy match, analyze, and visualize data. Finally, the benefits of SAS® Event Stream Processing models and alerts is discussed. To bring the analytics of things to life, the following data are collected: GPS data from countryside walks, GPS and score card data from smart phones while playing golf, and meteorological data feeds. These data are combined to test the old adage that golf is a good walk spoiled. Further, the use of alerts and potential for predictive analytics is discussed.
David Shannon, Amadeus Software
When dealing with non-normal categorical response variables, logistic regression is the robust method to use for modeling the relationship between categorical outcomes and different predictors without assuming a linear relationship between them. Within such models, the categorical outcome might be binary, multinomial, or ordinal, and predictors might be continuous or categorical. Another complexity that might be added to such studies is when data is longitudinal, such as when outcomes are collected at multiple follow-up times. Learning about modeling such data within any statistical method is beneficial because it enables researchers to look at changes over time. This study looks at several methods of modeling binary and categorical response variables within regression models by using real-world data. Starting with the simplest case of binary outcomes through ordinal outcomes, this study looks at different modeling options within SAS® and includes longitudinal cases for each model. To assess binary outcomes, the current study models binary data in the absence and presence of correlated observations under regular logistic regression and mixed logistic regression. To assess multinomial outcomes, the current study uses multinomial logistic regression. When responses are ordered, using ordinal logistic regression is required as it allows for interpretations based on inherent rankings. Different logit functions for this model include the cumulative logit, adjacent-category logit, and continuation ratio logit. Each of these models is also considered for longitudinal (panel) data using methods such as mixed models and Generalized Estimating Equations (GEE). The final consideration, which cannot be addressed by GEE, is the conditional logit to examine bias due to omitted explanatory variables at the cluster level. Different procedures for the aforementioned within SAS® 9.4 are explored and their strengths and limitations are specified for applied researchers finding simil
ar data characteristics. These procedures include PROC LOGISTIC, PROC GLIMMIX, PROC GENMOD, PROC NLMIXED, and PROC PHREG.
Niloofar Ramezani, University of Northern Colorado
Hospital Episode Statistics (HES) is a data warehouse that contains records of all admissions, outpatient appointments, and accident and emergency (A&E) attendances at National Health Service (NHS) hospitals in England. Each year it processes over 125 million admitted patient, outpatient, and A&E records. Such a large data set gives endless research opportunities for researchers and health-care professionals. However, patient care data is complex and might be difficult to manage. This paper demonstrates the flexibility and power of SAS® programming tools such as the DATA step, the SQL procedure, and macros to help to analyze HES data.
Violeta Balinskaite, Imperial College London
For some users, having an annotation facility is an integral part of creating polished graphics for their work. To meet that need, we created a new annotation facility for the ODS Graphics procedures in SAS® 9.3. Now, with SAS® 9.4, the Graph Template Language (GTL) supports annotation as well! In fact, GTL annotation facility has some unique features not available in the ODS Graphics procedures, such as using multiple sets of annotation in the same graph and the ability to bind annotation to a particular cell in the graph. This presentation covers some basic concepts of annotating that are common to both GTL and the ODS Graphics procedures. I apply those concepts to demonstrate some unique abilities of GTL annotation. Come see annotation in action!
Dan Heath, SAS
Breast cancer is the second leading cause of cancer deaths among women in the United States. Although mortality rates have been decreasing over the past decade, it is important to continue to make advances in diagnostic procedures as early detection vastly improves chances for survival. The goal of this study is to accurately predict the presence of a malignant tumor using data from fine needle aspiration (FNA) with visual interpretation. Compared with other methods of diagnosis, FNA displays the highest likelihood for improvement in sensitivity. Furthermore, this study aims to identify the variables most closely associated with accurate outcome prediction. The study utilizes the Wisconsin Breast Cancer data available within the UCI Machine Learning Repository. The data set contains 699 clinical case samples (65.52% benign and 34.48% malignant) assessing the nuclear features of fine needle aspirates taken from patients' breasts. The study analyzes a variety of traditional and modern models, including: logistic regression, decision tree, neural network, support vector machine, gradient boosting, and random forest. Prior to model building, the weights of evidence (WOE) approach was used to account for the high dimensionality of the categorical variables after which variable selection methods were employed. Ultimately, the gradient boosting model utilizing a principal component variable reduction method was selected as the best prediction model with a 2.4% misclassification rate, 100% sensitivity, 0.963 Kolmogorov-Smirnov statistic, 0.985 Gini coefficient, and 0.992 ROC index for the validation data. Additionally, the uniformity of cell shape and size, bare nuclei, and bland chromatin were consistently identified as the most important FNA characteristics across variable selection methods. These results suggest that future research should attempt to refine the techniques used to determine these specific model inputs. Greater accuracy in characterizing the FNA attributes will allow researchers to
develop more promising models for early detection.
Josephine Akosa, Oklahoma State University
Using smart clothing with wearable medical sensors integrated to keep track of human health is now attracting many researchers. However, body movement caused by daily human activities inserts artificial noise into physiological data signals, which affects the output of a health monitoring/alert system. To overcome this problem, recognizing human activities, determining relationship between activities and physiological signals, and removing noise from the collected signals are essential steps. This paper focuses on the first step, which is human activity recognition. Our research shows that no other study used SAS® for classifying human activities. For this study, two data sets were collected from an open repository. Both data sets have 561 input variables and one nominal target variable with four levels. Principal component analysis along with other variable reduction and selection techniques were applied to reduce dimensionality in the input space. Several modeling techniques with different optimization parameters were used to classify human activity. The gradient boosting model was selected as the best model based on a test misclassification rate of 0.1233. That is, 87.67% of total events were classified correctly.
Minh Pham, Oklahoma State University
Mary Ruppert-Stroescu, Oklahoma State University
Mostakim Tanjil, Oklahoma State University
Massive Open Online Courses (MOOC) have attracted increasing attention in educational data mining research areas. MOOC platforms provide free higher education courses to Internet users worldwide. However, MOOCs have high enrollment but notoriously low completion rates. The goal of this study is apply frequentist and Bayesian logistic regression to investigate whether and how students' engagement, intentions, education levels, and other demographics are conducive to course completion in MOOC platforms. The original data used in this study came from an online eight-week course titled Big Data in Education, taught within the Coursera platform (MOOC) by Teachers College, Columbia University. The data sets for analysis were created from three different sources--clickstream data, a pre-course survey, and homework assignment files. The SAS system provides multiple procedures to perform logistic regression, with each procedure having different features and functions. In this study, we apply two approaches--frequentist and Bayesian logistic regression, to MOOC data. PROC LOGISTIC is used for the frequentist approach, and PROC GENMOD is used for Bayesian analysis. The results obtained from the two approaches are compared. All the statistical analyses are conducted in SAS® 9.3. Our preliminary results show that MOOC students with higher course engagement and higher motivation are more likely to complete the MOOC course.
Yan Zhang, Educational Testing Service
Ryan Baker, Columbia University
Yoav Bergner, Educational Testing Service
Seven principles frequently manifest themselves in data management projects. These principles improve two aspects: the development process, which enables the programmer to deliver better code faster with fewer issues; and the actual performance of programs and jobs. Using examples from Base SAS®, SAS® Macro Language, DataFlux®, and SQL, this paper will present seven principles, including environments; job control data; test data; improvement; data locality; minimal passes; and indexing. Readers with an intermediate knowledge of SAS DataFlux Management Platform and/or Base SAS and SAS Macro Language as well as SQL will understand these principles and find ways to use them in their data management projects.
Bob Janka, Modern Analytics
With the popularity of network-attached storage for shared systems, IT shops are increasingly turning to them for SAS® Grid implementations. Given the high I/O demand of SAS large block processing, how do you ensure good performance between network-attached SAS Grid nodes and storage? This paper discusses different types of network-attached implementations, what works, and what does not. It provides advice for bandwidth planning, types of network technology to use, and what has been practically successful in the field.
Tony Brown, SAS
This paper demonstrates the deployment of SAS® Visual Analytics 7.2 with a distributed SAS® LASR™ Analytic Server and an internet-facing web tier. The key factor in this deployment is to establish the secure web server connection using a third-party certificate to perform the client and server authentication. The deployment process involves the following steps: 1) Establish the analytics cluster, which consists of SAS® High-Performance Deployment of Hadoop and the deployment of high-performance analytics environment master and data nodes. 2) Get the third-party signed certificate and the key files. 3) Deploy the SAS Visual Analytics server tier and middle tier. 4) Deploy the standalone web tier with HTTP protocol configured using secure sockets. 5) Deploy the SAS® Web Infrastructure Platform. 6) Perform post-installation validation and configuration to handle the certificate between the servers.
Vimal Raj Arockiasamy, Kavi Associates
Ratul Saha, Kavi Associates
Arrays and DO loops have been used for years by SAS® programmers who work with diagnosis fields in health-care data, and the opportunity to use these techniques has only grown since the launch of the Affordable Care Act (ACA) in the United States. Users new to SAS or to the health-care field may find an overview of existing (as well as new) applications helpful. Risk-adjustment software, including the publicly available Health and Human Services (HHS) risk software that uses SAS and was released as part of the ACA implementation, is one example of code that is significantly improved by the use of arrays. Similar projects might include evaluations of diagnostic persistence, comparisons of diagnostic frequency or intensity between providers, and checks for unusual clusters of diagnosed conditions. This session reviews examples suitable for intermediate SAS users, including the application of two-dimensional arrays to diagnosis fields.
Ryan Ferland, Blue Cross Blue Shield of Arizona
In light of past indiscretions by the police departments leading to riots in major U.S. cities, it is important to assess factors leading to the arrest of a citizen. The police department should understand the arrest situation before they can make any decision and diffuse any possibility of a riot or similar incidents. Many such incidents in the past are a result of the police department failing to make right decisions in the case of emergencies. The primary objective of this study is to understand the key factors that impact arrest of people in New York City and understanding various crimes that have taken place in the region in the last two years. The study explores different regions of New York City where the crimes have taken place and also the timing of incidents leading to them. The data set consists of 111 variables and 273,430 observations from the year 2013 to 2014, with a binary target variable arrest made . The percentage of Yes and No incidents of arrest made are 6% and 94% respectively. This study analyzes the reasons for arrest, which are suspicion, timing of frisk activities, identifying threats, basis of search, whether arrest required a physical intervention, location of the arrest, and whether the frisk officer needs any support. Decision tree, regression, and neural network models were built, and decision tree turned out to be the best model with a validation misclassification rate of 6.47% and model sensitivity of 73.17%. The results from the decision tree model showed that suspicion count, search basis count, summons leading to violence, ethnicity, and use of force are some of the important factors influencing arrest of a person.
Karan Rudra, Oklahoma State University
Goutam Chakraborty, Oklahoma State University
Maitreya Kadiyala, Oklahoma State University
This presentation is an open-ended discussion about techniques for creating graphical reports using SAS® ODS Graphics components. After some introductory comments, this presentation does not have any set content. Instead, the topics discussed are dictated by attendee questions. Come prepared to ask and get answers to your questions.
This paper provides tools and guidelines to assess SAS® skill level during the interview process. Included are interview questions for Base SAS® developers, statisticians, and SAS administration candidates. Whether the interviewer has a deep understanding of or just some familiarity with these areas, the questions provided will allow discussions to uncover skills and usage experience for the skills your company requires.
Jenine Milum, Citi
Traditional approaches to assessing undergraduate assignments in the field of software-related courses, including Analytics and Data Science courses, involve the course tutors in reading the students' code and getting the students to physically demonstrate their artifacts. However, this approach tends to assess only the technical skills of solving the set task. It generally fails to assess the many soft skills that industry is looking for, as identified in the e-skills UK (Tech Partnership)/SAS® report of November 2014 and the associated infographic poster. This presentation describes and evaluates the effectiveness of a different approach to defining the assessment task and summatively assessing the work of the students in order to effectively evaluate and mark both the soft skills, including creativity, curiosity, storytelling, problem solving, and communication, and the technical skills. This approach works effectively at all levels of undergraduate and masters courses. The session is structured to provide adequate time for audience participation to discuss the approach and its applicability.
Richard Self, University of Derby
The SAS® Visual Analytics autoload feature loads any files placed in the autoload location into the public SAS® LASR™ Analytic Server. But what if we want to autoload the tables into a private SAS LASR Analytic Server and we want to load database tables into SAS Visual Analytics and we also want to reload them every day so that the tables reflect any changes in the database? One option is to create a query in SAS® Visual Data Builder and schedule the table for load into the SAS LASR Analytic Server, but in this case, a query has to be created for each table that has to be loaded. It would be rather efficient if all tables could be loaded simultaneously; that would make the job a lot easier. This paper provides a custom solution for autoloading the tables in one step, so that tables in SAS Visual Analytics don't have to be reloaded manually or multiple queries don't have to be rerun whenever there is a change in the source data. This solution automatically unloads and reloads all the tables listed in a text file every day so that the data available in SAS Visual Analytics is always up-to-date. This paper focuses on autoloading the database tables into the private SAS LASR Analytic server in SAS Visual Analytics hosted in a UNIX environment and uses the SASIOLA engine to perform the load. The process involves four steps: initial load, unload, reload, and schedule unload and reload. This paper provides the scripts associated for each step. The initial load needs to be performed only once. Unload and reload have to be performed periodically to refresh the data source, so these steps have to be scheduled. Once the tables are loaded, a description is also added to the tables detailing the name of the table, time of the load, and user ID performing the load, which is similar to the description added when loading the tables manually into the SAS LASR Analytic Server using SAS Visual Analytics Administrator.
Swetha Vuppalanchi, SAS Institute Inc.
Developers working on a production process need to think carefully about ways to avoid future changes that require change control, so it's always important to make the code dynamic rather than hardcoding items into the code. Even if you are a seasoned programmer, the hardcoded items might not always be apparent. This paper assists in identifying the harder-to-reach hardcoded items and addresses ways to effectively use control tables within the SAS® software tools to deal with sticky areas of coding such as formats, parameters, grouping/hierarchies, and standardization. The paper presents examples of several ways to use the control tables and demonstrates why this usage prevents the need for coding changes. Practical applications are used to illustrate these examples.
Frank Ferriola, Financial Risk Group
Bayesian inference for complex hierarchical models with smoothing splines is typically intractable, requiring approximate inference methods for use in practice. Markov Chain Monte Carlo (MCMC) is the standard method for generating samples from the posterior distribution. However, for large or complex models, MCMC can be computationally intensive, or even infeasible. Mean Field Variational Bayes (MFVB) is a fast deterministic alternative to MCMC. It provides an approximating distribution that has minimum Kullback-Leibler distance to the posterior. Unlike MCMC, MFVB efficiently scales to arbitrarily large and complex models. We derive MFVB algorithms for Gaussian semiparametric multilevel models and implement them in SAS/IML® software. To improve speed and memory efficiency, we use block decomposition to streamline the estimation of the large sparse covariance matrix. Through a series of simulations and real data examples, we demonstrate that the inference obtained from MFVB is comparable to that of PROC MCMC. We also provide practical demonstrations of how to estimate additional posterior quantities of interest from MFVB either directly or via Monte Carlo simulation.
Jason Bentley, The University of Sydney
Cathy Lee, University of Technology Sydney
For me, it's all about avoiding manual effort and repetition. Whether your work involves data exploration, reporting, or analytics, you probably find yourself repeating steps and tasks with each new program, project, or analysis. That repetition adds time to the delivery of results and also contributes to a lack of standardization. This presentation focuses on productivity tips and tricks to help you create a standard and efficient environment for your SAS® work so you can focus on the results and not the processes. Included are the following: setting up your programming environment (comment blocks, environment cleanup, easy movement between test and production, and modularization) sharing easily with your team (format libraries, macro libraries, and common code modules) managing files and results (date and time stamps for logs and output, project IDs, and titles and footnotes)
Marje Fecht, Prowerk Consulting
When you were a kid, did you dress up as a unicorn for Halloween? Did you ever imagine you were a fairy living at the bottom of your garden? Maybe your dreams can come true! Modern-day marketers need a variety of complementary (and sometimes conflicting) skill sets. Forbes and others have started talking about unicorns, a rare breed of marketer--marketing technologists who understand both marketing and marketing technology. A good marketing analyst is part of this new breed; a unicorn isn't a horse with a horn but truly a new breed. It is no longer enough to be good at coding in SAS®--or a whiz with Excel--and to know a few marketing buzzwords. In this session, we explore the skills an analyst needs to turn himself or herself into one of these mythical, highly sought-after creatures.
Emma Warrillow, Data Insight Group Inc. (DiG)
The power of SAS®9 applications allows information and knowledge creation from very large amounts of data. Analysis that used to consist of 10s to 100s of gigabytes (GBs) of supporting data has rapidly grown into the 10s to 100s of terabytes (TBs). This data expansion has resulted in more and larger SAS® data stores. Setting up file systems to support these large volumes of data with adequate performance, as well as ensuring adequate storage space for the SAS temporary files, can be very challenging. Technology advancements in storage and system virtualization, flash storage, and hybrid storage management require continual updating of best practices to configure IO subsystems. This paper presents updated best practices for configuring the IO subsystem for your SAS®9 applications, ensuring adequate capacity, bandwidth, and performance for your SAS®9 workloads. We have found that very few storage systems work ideally with SAS with their out-of-the-box settings, so it is important to convey these general guidelines.
Tony Brown, SAS
Financial institutions rely heavily on quantitative models for risk management, balance-sheet stress testing and various business analyses and decision support functions. Investment decisions and business strategies are largely driven by estimates from models. Recent financial crises and model failures at high-profile banks have emphasized the need for better modeling practices. Regulators have stepped-in to assist banks with enhanced guidance and regulations for effective model risk management. Effective model risk management is more than developing a good model. SAS® Model Risk Management provides a robust framework to capture and track model inventory. In this paper we present best practices in model risk management learned from implementation projects and interactions with industry experts. These best practices help firms that are setting up a model risk management framework or enhancing their existing practices.
Satish Garla, SAS
Sukhbir Dhillon, SAS
Building representative machine learning models that generalize well on future data requires careful consideration both of the data at hand and of assumptions about the various available training algorithms. Data are rarely in an ideal form that enables algorithms to train effectively. Some algorithms are designed to account for important considerations such as variable selection and handling of missing values, whereas other algorithms require additional preprocessing of the data or appropriate tweaking of the algorithm options. Ultimate evaluation of a model's quality requires appropriate selection and interpretation of an assessment criterion that is meaningful for the given problem. This paper discusses the most common issues faced by machine learning practitioners and provides guidance for using these powerful algorithms to build effective models.
Brett Wujek, SAS
Funda Gunes, SAS Institute
Patrick Hall, SAS
SAS® solutions that run in Hadoop provide you with the best tools to transform data in Hadoop. They also provide insights to help you make the right decisions for your business. It is possible to incorporate SAS products and solutions into your shared Hadoop cluster in a cooperative fashion with YARN to manage the resources. Best practices and customer examples are provided to show how to build and manage a shared cluster with SAS applications and products.
James Kochuba, SAS
Connecting database schemas to libraries in the SAS® metadata is a very important part of setting up a functional and useful environment for business users. This task can be quite difficult for the untrained administrator. This paper addresses the key configuration items that often go unnoticed but can make a big difference. Using the wrong options can lead to poor database performance or even to a total lockdown, depending on the number of connections to the database.
Mathieu Gaouette, Videotron
Architects do not see a single architectural solution, such as SAS® Grid Manager, satisfying the varied needs of users across the enterprise. Multi-tenant environments need to support data movement (such as ETL), analytic processing (such as forecasting and predictive modeling), and reporting, which can include everything from visual data discovery to standardized reporting. SAS® users have a myriad of choices that might seem at odds with one another, such as in-database versus in-memory, or data warehouses (tightly structured schemes) versus data lakes and event stream processing. Whether fit for purpose or for potential, these choices force us as architects to modernize our thinking about the appropriateness of architecture, configuration, monitoring, and management. This paper discusses how SAS® Grid Manager can accommodate the myriad use cases and the best practices used in large-scale, multi-tenant SAS environments.
Jan Bigalke, Allianz Manged Operations & Services SE
Greg Nelson, ThotWave
Higher education is taking on big data initiatives to help improve student outcomes and operational efficiencies. This paper shows how one institution went form White Paper to action. Challenges such as developing a common definition of big data, access to certain types of data, and bringing together institutional groups who had little previous contact are highlighted. The stepwise, collaborative approach taken is highlighted. How the data was eventually brought together and analyzed to generate actionable results is also covered. Learn how to make big data a part of your analytics toolbox.
Stephanie Thompson, Datamum
The surge of data and data sources in marketing has created an analytical bottleneck in most organizations. Analytics departments have been pushed into a difficult decision: either purchase black-box analytical tools to generate efficiencies or hire more analysts, modelers, and data scientists. Knowledge gaps stemming from restrictions in black-box tools or from backlogs in the work of analytical teams have resulted in lost business opportunities. Existing big data analytics tools respond well when dealing with large record counts and small variable counts, but they fall short in bringing efficiencies when dealing with wide data. This paper discusses the importance of an agile modeling engine designed to deliver productivity, irrespective of the size of the data or the complexity of the modeling approach.
Mariam Seirafi, Cornerstone Group of Companies
The electrical grid has become more complex; utilities are revisiting their approaches, methods, and technology to accurately predict energy demands across all time horizons in a timely manner. With the advanced analytics of SAS® Energy Forecasting, Old Dominion Electric Cooperative (ODEC) provides data-driven load predictions from next hour to next year and beyond. Accurate intraday forecasts means meeting daily peak demands saving millions of dollars at critical seasons and events. Mid-term forecasts provide a baseline to the cooperative and its members to accurately anticipate regional growth and customer needs, in addition to signaling power marketers where, when, and how much to hedge future energy purchases to meet weather-driven demands. Long-term forecasts create defensible numbers for large capital expenditures such as generation and transmission projects. Much of the data for determining load comes from disparate systems such as supervisory control and data acquisition (SCADA) and internal billing systems combined with external market data (PJM Energy Market), weather, and economic data. This data needs to be analyzed, validated, and shaped to fully leverage predictive methods. Business insights and planning metrics are achieved when flexible data integration capabilities are combined with advanced analytics and visualization. These increased computing demands at ODEC are being achieved by leveraging Amazon Web Services (AWS) for expanded business discovery and operational capacity. Flexible and scalable data and discovery environments allow ODEC analysts to efficiently develop and test models that are I/O intensive. SAS® visualization for the analyst is a graphic compute environment for information-sharing that is memory intensive. Also, ODEC IT operations require deployment options tuned for process optimization to meet service level agreements that can be quickly evaluated, tested, and promoted into production. What was once very difficult for most ut
ilities to embrace is now achievable with new approaches, methods, and technology like never before.
David Hamilton, ODEC
Steve Becker, SAS
Emily Forney, SAS
The Health Indicators Warehouse (HIW) is part of the US Department of Health and Human Services' (DHHS) response to make federal data more accessible. Through it, users can access data and metadata for over 1,200 indicators from approximately 180 federal and nonfederal sources. The HIW also supports data access by applications such as SAS® Visual Analytics through the use of an application programming interface (API). An API serves as a communication interface for integration. As a result of the API, HIW data consumers using SAS Visual Analytics can avoid difficult manual data processing. This paper provides detailed information about how to access HIW data with SAS Visual Analytics in order to produce easily understood visualizations with minimal effort through a methodology that automates HIW data processing. This paper also shows how to run SAS® macros inside a stored process to make HIW data available in SAS Visual Analytics for exploration and reporting via API calls; the SAS macros are provided. Use cases involving dashboards are also examined in order to demonstrate the value of streaming data directly from the HIW. Both IT professionals and population health analysts will benefit from understanding how to import HIW data into SAS Visual Analytics using SAS® Stored Processes, macros, and APIs. This can be very helpful to organizations that want to lower maintenance costs associated with data management while gaining insights into health data with visualizations. This paper provides a starting point for any organization interested in deriving full value from SAS Visual Analytics while augmenting their work with HIW data.
Li Hui Chen, US Consumer Product Safety Commission
Manuel Figallo, SAS
Your marketing team would like to pull data from its different marketing activities into one report. What happens in Vegas might stay in Vegas, but what happens in your data does not have to stay there, locked in different tools or static spreadsheets. Learn how to easily bring data from Google Analytics, Facebook, and Twitter into SAS® Visual Analytics to create interactive explorations and reports on this data along with your other data for better overall understanding of your marketing activity.
I-Kong Fu, SAS
Mark Chaves, SAS
Andrew Fagan, SAS
A United States Department of Defense agency with over USD 40 billion in sales and revenue, 25 thousand employees, and 5.3 million parts to source, partnered with SAS® to turn their disparate PC-based analytic environment into a modern SAS® Grid Computing server-based architecture. This presentation discusses the challenges of under-powered desktops, data sprawl, outdated software, difficult upgrades, and inefficient compute processing and the solution crafted to enable the agency to run as the Fortune 50 company that its balance sheet (and our nation's security) demand. In the modern architecture, rolling upgrades, high availability, centralized data set storage, and improved performance enable improved forecasting getting our troops the supplies they need, when and where they need them.
Erin Stevens, SAS
Douglas Liming, SAS
Microsoft Office has over 1 billion users worldwide, making it one of the most successful pieces of software on the market today. Imagine combining the familiarity and functionality of Microsoft Office with the power of SAS® to include SAS content in a Microsoft Office document. By using SAS® Office Analytics, you can create Microsoft Excel worksheets that are not just static reports, but interactive documents. This paper looks at opening, filtering, and editing data in an Excel worksheet. It shows how to create an interactive experience in Excel by leveraging Visual Basic for Applications using SAS data and stored processes. Finally this paper shows how to open SAS® Visual Analytics reports into Excel, so the interactive benefits of SAS Visual Analytics are combined with the familiar interface of an Excel worksheet. All of these interactions with SAS content are possible without leaving Microsoft Excel.
Tim Beese, SAS
Surveys are a critical research component when trying to gather information about a population of people and their knowledge, attitudes, and experiences. Researchers often implement surveys after a population has participated in a class or educational program. Survey developers often create sets of items, which when analyzed as a group, can measure a construct that describes the underlying behavior, attribute, or characteristic of a study participant. After testing for the reliability and validity of individual items, the group of items can be combined to create a scale score that measures a predefined construct. For example, in education research, a construct can measure students' use of global reading strategies or teachers' confidence in literacy instruction. The number of items that compose a construct can vary widely. Construct scores can be used as outcome measures to assess the impact of a program on its participants. Our example is taken from a project evaluating a teacher's professional development program aimed at improving students' literacy in target subject classes. The programmer is tasked with creating such scores quickly. With an unlimited amount of time to spend, a programmer could operationalize the creation of these constructs by manually entering predefined formulas in the DATA step. More often, time and resources are limited. Therefore, the programmer is more efficient by automating this process by using macro programs. In this paper, we present a technique that uses an externally created key data set that contains the construct specifications and corresponding items of each construct. By iterating through macro variables created from this key data set, we can create construct score variables for varying numbers of items. This technique can be generalized to other processing tasks that involve any number of mathematical combinations of variables to create one single score.
Vincent Chan, IMPAQ International
Lorena Ortiz, IMPAQ International
Nowadays, the recommender system is a popular tool for online retailer businesses to predict customers' next-product-to-buy (NPTB). Based on statistical techniques and the information collected by the retailer, an efficient recommender system can suggest a meaningful NPTB to customers. A useful suggestion can reduce the customer's searching time for a wanted product and improve the buying experience, thus increasing the chance of cross-selling for online retailers and helping them build customer loyalty. Within a recommender system, the combination of advanced statistical techniques with available information (such as customer profiles, product attributes, and popular products) is the key element in using the retailer's database to produce a useful suggestion of an NPTB for customers. This paper illustrates how to create a recommender system with the SAS® RECOMMEND procedure for online business. Using the recommender system, we can produce predictions, compare the performance of different predictive models (such as decision trees or multinomial discrete-choice models), and make business-oriented recommendations from the analysis.
Miao Nie, ABN AMRO Bank
Shanshan Cong, SAS Institute
Formats are powerful tools within the SAS® System. They can be used to change how information is brought into SAS, to modify how it is displayed, and even to reshape the data itself. Base SAS® comes with a great many predefined formats, and it is even possible for you to create your own specialized formats. This paper briefly reviews the use of formats in general and then covers a number of aspects of user-generated formats. Since formats themselves have a number of uses that are not at first apparent to the new user, we also look at some of the broader applications of formats. Topics include building formats from data sets, using picture formats, transformations using formats, value translations, and using formats to perform table lookups.
Art Carpenter, California Occidental Consultants
Have you ever wondered the best way to secure SAS® software? Many pieces need to be secured-from passwords and authentication to encryption of data at rest and in transit. During this presentation we discuss several security tasks that are important when setting up SAS, and we provide some tips for finding the information you need in the mountains of SAS documentation. The tasks include 1) enabling basic network security (TLS) using the automation options in the latest release of SAS® 9.4; 2) configuring HTTPS for the SAS middle tier (once the TLS connection is established); 3) setting up user accounts and IDs with an eye toward auditing user activity. Whether you are an IT expert who is concerned about corporate security policies, a SAS Administrator who needs to be able to describe SAS capabilities and configure SAS software to work with corporate IT standards, or an auditor who needs to review specific questions about security vulnerabilities, this presentation is for you.
Robin Crumpton, SAS
Donna Bennett, SAS
Qiana Eaglin, SAS
You would think that training as a code breaker, similar to those who were employed during the Second World War, wouldn't be necessary to perform the routine SAS® programming duties of your job, such as debugging code. However, if the author of the code doesn't incorporate good elements of style in his or her program, the task of reviewing code becomes arduous and tedious for others. Style touches upon very specific aspects of writing code--indention for code and comments, casing of keywords, variables, and data set names, and spacing between PROC and DATA steps. Paying attention to these specific issues enhances the reusability and lifespan of your code. By using style to make your code readable, you'll impress your superiors and grow as a programmer.
Jay Iyengar, Data Systems Consultants
With all the talk of big data and visual analytics, we sometimes forget how important, and often difficult, it is to get external data into SAS®. In this paper, we review some common data sources such as delimited sources (for example, comma-separated values format [CSV]) as well as structured flat files and the programming steps needed to successfully load these files into SAS. In addition to examining the INFILE and INPUT statements, we look at some methods for dealing with bad data. This paper assumes only basic SAS skills, although the topic can be of interest to anyone who needs to read external files.
Peter Eberhardt, Fernwood Consulting Group Inc.
Audrey Yeo, Athene
Even though marketing is inevitable in every business, every year the marketing budget is limited and prudent fund allocations are required to optimize marketing investment. In many businesses, the marketing fund is allocated based on the marketing manager's experience, departmental budget allocation rules, and sometimes 'gut feelings' of business leaders. Those traditional ways of budget allocation yield suboptimal results and in many cases lead to money being wasted on irrelevant marketing efforts. Marketing mixed models can be used to understand the effects of marketing activities and identify the key marketing efforts that drive the most sales among a group of competing marketing activities. The results can be used in marketing budget allocation to take out the guesswork that typically goes into the budget allocation. In this paper, we illustrate practical methods for developing and implementing marketing mixed modeling using SAS® procedures. Real-life challenges of marketing mixed model development and execution are discussed, and several recommendations are provided to overcome some of those challenges.
Delali Agbenyegah, Alliance Data Systems
In the biopharmaceutical industry, biostatistics plays an important and essential role in the research and development of drugs, diagnostics, and medical devices. Familiarity with biostatistics combined with knowledge of SAS® software can lead to a challenging and rewarding career that also improves patients' lives. This paper provides a broad overview of the different types of jobs and career paths available, discusses the education and skill sets needed for each, and presents some ideas for overcoming entry barriers to careers in biostatistics and clinical SAS programming.
Justina Flavin, Independent Consultant
Packing a carry-on suitcase for air travel and designing a report for mobile devices have a lot in common. Your carry-on suitcase contains indispensable items for your journey, and the contents are limited by tight space. Your reports for mobile devices face similar challenges--data display is governed by tight real estate space and other factors such as users' shorter attention span and information density come into play. How do you overcome these challenges while displaying data effectively for your mobile users? This paper demonstrates how smaller real estate on mobile devices, as well as device orientation in portrait or landscape mode, influences best practices for designing reports. The use of containers, layouts, filters, information windows, and carefully selected objects enable you to design and guide user interaction effectively. Appropriate selection of font styles, font sizes, and colors reduce distraction and enhance quick user comprehension. By incorporating these recommendations into your report design, you can produce reports that display seamlessly on mobile devices and browsers.
Lavanya Mandavilli, SAS
Anand Chitale, SAS
We use SAS® Forecast Studio to develop time series models for the daily number of taxi trips and the daily taxi fare revenue of Yellow Cabs in New York City, using publicly available data from 1/1/2011 to 6/30/2015. Interest centers on trying to assess the effects (if any) of the Uber ride-hailing service on Yellow Cabs in New York City.
Simon Sheather, Texas A&M University
This paper is a case study to explore the analytical and reporting capabilities of SAS® Visual Analytics to perform data exploration, determine order patterns and trends, and create data visualizations to generate extensive dashboard reports using the open source pollution data available from the United States Environmental Protection Agency (EPA). Data collection agencies report their data to EPA via the Air Quality System (AQS). The EPA makes available several types of aggregate (summary) data sets, such as daily and annual pollutant summaries, in CSV format for public use. We intend to demonstrate SAS Visual Analytics capabilities by using pollution data to create visualizations that compare Air Quality Index (AQI) values for multiple pollutants by location and time period, generate a time series plot by location and time period, compare 8-hour ozone 'exceedances' from this year with previous years, and perform other such analysis. The easy-to-use SAS Visual Analytics web-based interface is leveraged to explore patterns in the pollutant data to obtain insightful information. SAS® Visual Data Builder is used to summarize data, join data sets, and enhance the predictive power within the data. SAS® Visual Analytics Explorer is used to explore data, to create calculated data items and aggregated measures, and to define geography items. Visualizations such as chats, bar graphs, geo maps, tree maps, correlation matrices, and other graphs are created to graphically visualize pollutant information contaminating the environment; hierarchies are derived from date and time items and across geographies to allow rolling up the data. Various reports are designed for pollution analysis. Viewing on a mobile device such as an iPad is also explored. In conclusion, this paper attempts to demonstrate use of SAS Visual Analytics to determine the impact of pollution on the environment over time using various visualizations and graphs.
Viraj Kumbhakarna, MUFG Union Bank
In healthcare and other fields, the importance of cell suppression as a means to avoid unintended disclosure or identification of Protected Health Information (PHI) or any other sensitive data has grown as we move toward dynamic query systems and reports. Organizations such as Centers for Medicare & Medicaid Services (CMS), the National Center for Health Statistics (NCHS), and the Privacy Technical Assistance Center (PTAC) have outlined best practices to help researchers, analysts, report writers, and others avoid unintended disclosure for privacy reasons and to maintain statistical validity. Cell suppression is a crucial consideration during report design and can be a substantial hurdle in the dissemination of information. Often, the goal is to display as much data as possible without enabling the identification of individuals and while maintaining statistical validity. When designing reports using SAS® Visual Analytics, achieving suppression can be handled multiple ways. One way is to suppress the data before loading it into the SAS® LASR™ Analytic Server. This has the drawback that a user cannot take full advantage of the dynamic filtering and aggregation available with SAS Visual Analytics. Another method is to create formulas that govern how SAS Visual Analytics displays cells within a table (crosstab) or bars within a chart. The logic can be complex and can meet a variety of needs. This presentation walks through examples of the latter methodology, namely, the creation of suppression formulas and how to apply them to report objects.
Marc Flore, University of New Hampshire
Whether you are deploying a new capability with SAS® or modernizing the tool set that people already use in your organization, change management is a valuable practice. Sharing the news of a change with employees can be a daunting task and is often put off until the last possible second. Organizations frequently underestimate the impact of the change, and the results of that miscalculation can be disastrous. Too often, employees find out about a change just before mandatory training and are expected to embrace it. But change management is far more than training. It is early and frequent communication; an inclusive discussion; encouraging and enabling the development of an individual; and facilitating learning before, during, and long after the change. This paper not only showcases the importance of change management but also identifies key objectives for a purposeful strategy. We outline our experiences with both successful and not so successful organizational changes. We present best practices for implementing change management strategies and highlighting common gaps. For example, developing and engaging Change Champions from the beginning alleviates many headaches and avoids disruptions. Finally, we discuss how the overall company culture can either support or hinder the positive experience change management should be and how to engender support for formal change management in your organization.
Greg Nelson, ThotWave
We set out to model two of the leading causes of fatal automobile accidents in America - drunk driving and speeding. We created a decision tree for each of those causes that uses situational conditions such as time of day and type of road to classify historical accidents. Using this model a law enforcement or town official can decide if a speed trap or DUI checkpoint can be the most effective in a particular situation. This proof of concept can easily be applied to specific jurisdictions for customized solutions.
Catherine LaChapelle, NC State University
Daniel Brannock, North Carolina State University
Aric LaBarr, Institute for Advanced Analytics
Craig Shaver, North Carolina State University
As a SAS® programmer, you probably spend some of your time reading and possibly creating specifications. Your job also includes writing and testing SAS code to produce the final product, whether it is Study Data Tabulation Model (SDTM) data sets, Analysis Data Model (ADaM) data sets, or statistical outputs such as tables, listings, or figures. You reach the point where you have completed the initial programming, removed all obvious errors and warnings from your SAS log, and checked your outputs for accuracy. You are almost done with your programming task, but one important step remains. It is considered best practice to check your SAS log for any questionable messages generated by the SAS system. In addition to messages that begin with the words WARNING or ERROR, there are also messages that begin with the words NOTE or INFO. This paper will focus on five different types of NOTE messages that commonly appear in the SAS log and will present ways to remove these messages from your log.
Jennifer Srivastava, Quintiles
Graphs are essential for many clinical and health care domains, including analysis of clinical trials safety data and analysis of the efficacy of the treatment, such as change in tumor size. Creating such graphs is a breeze with procedures from SAS® 9.4 ODS Graphics. This paper shows how to create many industry-standard graphs such as Lipid Profile, Swimmer Plot, Survival Plot, Forest Plot with Subgroups, Waterfall Plot, and Patient Profile using Study Data Tabulation Model (SDTM) data with just a few lines of code.
Sanjay Matange, SAS
Thanks to advances in technologies that make data more readily available, sports analytics is an increasingly popular topic. A majority of sports analyses use advanced statistics and metrics to achieve their goal, whether it be prediction or explanation. Few studies include public opinion data. Last year's highly anticipated NCAA College Football Championship game between Ohio State and Oregon broke ESPN and cable television records with an astounding 33.4 million viewers. Given the popularity of college football, especially now with the inclusion of the new playoff system, people seem to be paying more attention than ever to the game. ESPN provides fans with College Pick'em, which gives them a way to compete with their friends and colleagues on a weekly basis, for free, to see who can correctly pick the winners of college football games. Each week, 10 close matchups are selected, and users must select which team they think will win the game and rank those picks on a scale of 1 (lowest) to 10 (highest), according to their confidence level. For each team, the percentage of users who picked that team and the national average confidence are shown. Ideally, one could use these variables in conjunction with other information to enhance one's own predictions. The analysis described in this session explores the relationship between public opinion data from College Pick'em and the corresponding game outcomes by using visualizations and statistical models implemented by various SAS® products.
Taylor Larkin, The University of Alabama
Matt Collins, University of Alabama
Managing and coordinating various aspects of a report can be challenging. This is especially true when the structure and composition of the report are data driven. For complex reports, the inclusion of attributes such as color, labeling, and the ordering of items complicates the coding process. Fortunately we have some powerful reporting tools in SAS® that allow the process to be automated to a great extent. In the example presented in this paper, we are tasked with generating an Excel spreadsheet that ranks types of injuries within age groups. A given injury type is to be assigned a constant color regardless of its rank, and the labeling is to include not only the injury label but the actual count as well. Of course the user needs to be able to control such things as the age groups, color selection and order, and number of desired ranks.
Art Carpenter, California Occidental Consultants
Fire department facilities in Belgium are in desperate need of renovation due to relentless budget cuts during the last decade. But time has changed and the current location is troubled by city center traffic jams or narrow streets. In other words, their current location is not ideal. The approach for our project was to find the best possible locations for new facilities, based on historical interventions and live traffic information for fire trucks. The goal was not to give just one answer, but to provide instead SAS® Visual Analytics dashboards that allowed policy makers to play with some of the deciding factors and view the impact of cutting off or adding facilities on the estimated intervention time. For each option the optimal locations are given. SAS® software offers many great products but the true power of SAS as a data mining tool is in using the full SAS suite. SAS® Data Integration Studio was used for its parallel processing capabilities in extracting and parsing the traveling time via an online API to MapQuest. SAS/OR® was used for the optimization of the model using a Lagrangian multiplier. SAS® Enterprise Guide® was used to combine and integrate the data preparation. SAS Visual Analytics was used to visualize the results and to serve as an interface to execute the model via a stored process.
Wouter Travers, Deloitte
One of the major diseases that records a high number of readmissions is bacterial pneumonia in Medicare patients. This study aims at comparing and contrasting Northeast and South regions of the United States based on the factors causing the 30-day readmissions. The study also identifies some of the ICD-9 medical procedures associated with those readmissions. Using SAS® Enterprise Guide® 7.1 and SAS® Enterprise Miner™ 14.1, this study analyzes patient and hospital demographics of the Northeast and South regions of United States using the Cerner Health Facts Database. Further, the study suggests some preventive measures to reduce readmissions. The 30-day readmissions are computed based on admission and discharge dates from 2003 until 2013. Using clustering, various hospitals, along with discharge disposition levels (where a patient is sent after discharge), are grouped. In both regions, the patients who are discharged to home have shown significantly lower chances of readmission. The odds ratio = 0.562 for the Northeast region and for the South region; all other disposition levels have significantly high odds ( > 1), compared to discharged to home. For the South region, females have around 53 percent more possibility of readmission compared to males (odds ratio = 1.535). Also some of the hospital groups have higher readmission cases. The association analysis using the Market Basket node finds catheterization procedures to be highly significant (Lift = 1.25 for the Northeast and 3.84 for the South; Confidence = 52.94% for the Northeast and 85.71% for the South) in bacterial pneumonia readmissions. By research it was found that during these procedures, patients are highly susceptible to acquiring Methicillin-resistant Staphylococcus aureus (MRSA) bacteria, which causes Methicillin-susceptible pneumonia. Providing timely follow up for the patients operated with these procedures might possibly limit readmissions. These patients might also be discharge
d to home under medical supervision, as such patients had shown significantly lower chances of readmission.
Heramb Joshi, Oklahoma State University
Goutam Chakraborty, Oklahoma State University
Dr. William Paiva, Center for Health Systems Innovation, OSU, Tulsa
Aditya Sharma, Oklahoma State Univesrity
Suppose that you have a very large data set with some specific values in one of the columns of the data set, and you want to classify the entire data set into different comma-separated-values format (CSV) sheets based on the values present in that specific column. Perhaps you might use codes using IF/THEN and ELSE statement conditions in SAS®, along with some OUTPUT statements. If you divide that data set into csv sheets, it is more frustrating to use the conventional, manual process of converting each of the separated data sets into csv files. This paper shows a comparative study of using the Macro command in SAS with the help of the proc Export statement and the Output Delivery System (ODS) command using proc tabulate. In these two processes, the whole tedious process is done automatically using the SAS code.
Saurabh Nandy, Oklahoma State University
Competing-risks analyses are methods for analyzing the time to a terminal event (such as death or failure) and its cause or type. The cumulative incidence function CIF(j, t) is the probability of death by time t from cause j. New options in the LIFETEST procedure provide for nonparametric estimation of the CIF from event times and their associated causes, allowing for right-censoring when the event and its cause are not observed. Cause-specific hazard functions that are derived from the CIFs are the analogs of the hazard function when only a single cause is present. Death by one cause precludes occurrence of death by any other cause, because an individual can die only once. Incorporating explanatory variables in hazard functions provides an approach to assessing their impact on overall survival and on the CIF. This semiparametric approach can be analyzed in the PHREG procedure. The Fine-Gray model defines a subdistribution hazard function that has an expanded risk set, which consists of individuals at risk of the event by any cause at time t, together with those who experienced the event before t from any cause other than the cause of interest j. Finally, with additional assumptions a full parametric analysis is also feasible. We illustrate the application of these methods with empirical data sets.
Joseph Gardiner, Michigan State University
SAS® macros offer a very flexible way of developing, organizing, and running SAS code. In many systems, programming components are stored as macros in files and called as macros via a variety of means so that they can perform the task at hand. Consideration must be given to the organization of these files in a manner consistent with proper use and retention. An example might be the development of some macros that are company-wide in potential application, others that are departmental, and still others that might be for personal or individual user. Super-imposed on this structure in a factorial fashion are the need to have areas that are considered: (1) production - validated, (2) archived - retired and (3) developmental. Several proposals for accomplishing this are discussed as well as how this structure might interrelate with stored processes available in some SAS systems. The use of these macros in systems ranging from simple background batch processing to highly interactive SAS and BI processing are discussed. Specifically, the drivers or driver files are addressed. Pro's and con's to all approaches are considered.
Roger Muller, Data-To-Events, Inc.
Within SAS® literally millions of colors are available for use in our charts, graphs, and reports. We can name these colors using techniques that include color wheels, RGB (Red, Green, Blue) HEX codes, and HLS (Hue, Lightness, Saturation) HEX codes. But sometimes I just want to use a color by name. When I want purple, I want to be able to ask for purple, not CX703070 or H03C5066. But am I limiting myself to just one purple? What about light purple or pinkish purple? Do those colors have names or must I use the codes? It turns out that they do have names. Names that we can use. Names that we can select, names that we can order, names that we can use to build our graphs and reports. This paper shows you how to gather color names and manipulate them so that you can take advantage of your favorite purple, be it purple, grayish purple, vivid purple, or pale purplish blue.
Art Carpenter, California Occidental Consultants
Abstract quasi-complete separation is a commonly detected issue in logit/probit models. Quasi-complete separation occurs when a dependent variable separates an independent variable or a combination of several independent variables to a certain degree. In other words, levels in a categorical variable or values in a numeric variable are separated by groups of discrete outcome variables. Most of the time, this happens in categorical independent variable(s). Quasi-complete separation can cause convergence failures in logistic regression, which consequently result in a large coefficient estimate and standard errors inflation. Therefore, it can potentially yield biased results. This paper provides comprehensive reviews with various approaches for correcting quasi-complete separation in binary logistic regression model and presents hands-on examples from the healthcare insurance industry. First, it introduces the concept of quasi-complete separation and how to diagnosis the issue. Then, it provides step-by-step guidelines for fixing the problem, from a straightforward data configuration approach to a complicated statistical modeling approach such as the EXACT method, the FIRTH method, and so on.
Xinghe Lu, Amerihealth Caritas
You might already know that SAS® Studio tasks provide you with prompts to fill in the blanks in SAS® code and help you navigate SAS syntax. But did you know that SAS Studio tasks are not only designed to allow you to modify them, but there is an entire common task model (CTM) provided for you to build your own? You can build basic utilities or complex reports. You can just run SAS code or you can have elaborate prompts to dynamically generate code. And since SAS Studio runs in a web browser, your tasks are browser-based and can easily be shared with others. In this paper, you learn how to take a typical SAS program and create a SAS Studio task from it. When the task is run, you see web-based prompts for the dynamic portions of the program and HTML output delivered to your browser.
Kris Kiser, SAS
Christie Corcoran, SAS
Marie Dexter, SAS
Amy Peters, SAS
This workshop shows you how to create powerful interactive visualizations using SAS® Stored Processes to deliver data to JavaScript objects. We construct some simple HTML to make a simple dashboard layout with a range of connected graphs and some static data. Then, we replace the static data with SAS Stored Processes that we build, which use the STREAM and JSON procedures in SAS® 9.4 to deliver the data to the objects. You will see how easy it is to build a bespoke dashboard that resembles that in SAS® Visual Analytics with only SAS Stored Processes and some basic HTML, JavaScript, and CSS 3.
Phil Mason, Wood Street Consultants Ltd.
Session SAS3460-2016:
Creating Custom Map Regions in SAS® Visual Analytics
Discover how to answer the Where? question of data visualization by leveraging SAS® Visual Analytics. Geographical data elements within SAS Visual Analytics provides users the capability to quickly map data by countries and regions, by states or provinces, and by the centroid of US ZIP codes. This paper demonstrates how easy it is to map by these elements. Of course, once your manager sees your new maps they will ask for more granular shapes (such as US counties or US ZIP codes). Respond with Here it is! Follow the steps provided to add custom boundaries by parsing the shape files into consumable data objects and loading these custom boundaries into SAS Visual Analytics.
Angela Hall, SAS
The HIPAA Privacy Rule can restrict geographic and demographic data used in health-care analytics. After reviewing the HIPAA requirements for de-identification of health-care data used in research, this poster guides the beginning SAS® Visual Analytics user through different options to create a better user experience. This poster presents a variety of data visualizations the analyst will encounter when describing a health-care population. We explore the different options SAS Visual Analytics offers and also offer tips on data preparation prior to using SAS® Visual Analytics Designer. Among the topics we cover are SAS Visual Analytics Designer object options (including geo bubble map, geo region map, crosstab, and treemap), tips for preparing your data for use in SAS Visual Analytics, and tips on filtering data after it's been loaded into SAS Visual Analytics, and more.
Jessica Wegner, Optum
Margaret Burgess, Optum
Catherine Olson, Optum
This paper shows how to program a powerful Q-gram algorithm for measuring the similarity of two character strings. Along with built-in SAS® functions--such as SOUNDEX, SPEDIS, COMPGED, and COMPLEV--Q-gram can be a valuable tool in your arsenal of string comparators. Q-gram is especially useful when measuring the similarity of strings that are intuitively identical, but which have a different word order, such as John Smith and Smith, John.
Joe DeShon, Boehringer-Ingelheim
SAS® Grid Manager, as well as other grid computing technologies, have a set of great capabilities that we, IT professionals, love to have in our systems. This technology increases high availability, allows parallel processing, facilitates increasing demand by scale out, and offers other features that make life better for those managing and using these environments. However, even when business users take advantage of these features, they are more concerned about the business part of the problem. Most of the time business groups hold the budgets and are key stakeholders for any SAS Grid Manager project. Therefore, it is crucial to demonstrate to business users how they will benefit from the new technologies, how the features will improve their daily operations, help them be more efficient and productive, and help them achieve better results. This paper guides you through a process to create a strong and persuasive business plan that translates the technology features from SAS Grid Manager to business benefits.
Marlos Bosso, SAS
You've heard that SAS® Output Delivery System (ODS) Graphics provides a powerful and detailed syntax for creating custom graphs, but for whatever reason you still haven't added them to your bag of SAS® tricks. Let's change that! We will also present a code playground based on Microsoft Office that will enable you to quickly try out a variety of prepared SAS ODS Graphics examples, tweak the code, and see the results--all directly from Microsoft Excel. More experienced users will also find the code playground (which is similar in spirit to Google Code Playground or JSFiddle) useful for compiling SAS ODS Graphics code snippets for themselves and for sharing with colleagues, as well as for creating dashboards hosted by Microsoft Excel or Microsoft PowerPoint that contain precisely sized and placed SAS graphics.
Ted Conway, Discover Financial Services
Ezequiel Torres, MPA Healthcare Solutions, Inc.
Credit card profitability prediction is a complex problem because of the variety of card holders' behavior patterns and the different sources of interest and transactional income. Each consumer account can move to a number of states such as inactive, transactor, revolver, delinquent, or defaulted. This paper i) describes an approach to credit card account-level profitability estimation based on the multistate and multistage conditional probabilities models and different types of income estimation, and ii) compares methods for the most efficient and accurate estimation. We use application, behavioral, card state dynamics, and macroeconomic characteristics, and their combinations as predictors. We use different types of logistic regression such as multinomial logistic regression, ordered logistic regression, and multistage conditional binary logistic regression with the LOGISTIC procedure for states transition probability estimation. The state transition probabilities are used as weights for interest rate and non-interest income models (which one is applied depends on the account state). Thus, the scoring model is split according to the customer behavior segment and the source of generated income. The total income consists of interest and non-interest income. Interest income is estimated with the credit limit utilization rate models. We test and compare five proportion models with the NLMIXED, LOGISTIC, and REG procedures in SAS/STAT® software. Non-interest income depends on the probability of being in a particular state, the two-stage model of conditional probability to make a point-of-sales transaction (POS) or cash withdrawal (ATM), and the amount of income generated by this transaction. We use the LOGISTIC procedure for conditional probability prediction and the GLIMMIX and PANEL procedures for direct amount estimation with pooled and random-effect panel data. The validation results confirm that traditional techniques can be effectively applied to complex tasks with many para
meters and multilevel business logic. The model is used in credit limit management, risk prediction, and client behavior analytics.
Denys Osipenko, the University of Edinburgh Business School
The recent advances in regulatory stress testing, including stress testing regulated by Comprehensive Capital Analysis and Review (CCAR) in the US, the Prudential Regulation Authority (PRA) in the UK, and the European Banking Authority in the EU, as well as the new international accounting requirement known as IFRS 9 (International Financial Reporting Standard), all pose new challenges to credit risk modeling. The increasing sophistication of the models that are supposed to cover all the material risks in the underlying assets in various economic scenarios makes models harder to implement. Banks are spending a lot of resources on the model implementation but are still facing issues due to long time to deployment and disconnection between the model development and implementation teams. Models are also required at a more granular level, in many cases, down to the trade and account levels. Efficient model execution becomes valuable for banks to get timely response to the analysis requests. At the same time, models are subject to more stringent internal and external scrutiny. This paper introduces a suite of risk modeling solutions from credit risk modeling leader SAS® to help banks overcome these new challenges and be competent to meet the regulatory requirements.
Wei Chen, SAS
Martim Rocha, SAS
Jimmy Skoglund, SAS
There are standard risk metrics financial institutions use to assess the risk of a portfolio. These include well known measures like value at risk and expected shortfall and related measures like contribution value at risk. While there are industry-standard approaches for calculating these measures, it is often the case that financial institutions have their own methodologies. Further, financial institutions write their own measures, in addition to the common risk measures. SAS® High-Performance Risk comes equipped with over 20 risk measures that use standard methodology, but the product also allows customers to define their own risk measures. These user-defined statistics are treated the same way as the built-in measures, but the logic is specified by the customer. This paper leads the user through the creation of custom risk metrics using the HPRISK procedure.
Katherine Taylor, SAS
Steven Miles, SAS
Customer retention is a primary concern for businesses that rely on a subscription revenue model. It is common for marketers of subscription-based offerings to develop predictive models that are aimed at identifying subscribers who have the highest risk of attrition. With these likely unsubscribes identified, marketers then attempt to forestall customer termination by using a variety of retention enhancement tactics, which might include free offers, customer training, satisfaction surveys, or other measures. Although customer retention is always a worthy pursuit, it is often expensive to retain subscribers. In many cases, associated retention programs simply prove unprofitable over time because the overall cost of such programs frequently exceeds the lifetime value of the cohort of unsubscribed customers. Generally, it is more profitable to focus resources on identifying and marketing to a targeted prospective customer. When the target marketing strategy focuses on identifying prospects who are most likely to subscribe over the long term, the need for special retention marketing efforts decreases sharply. This paper describes results of an analytically driven targeting approach that is aimed at inviting new customers to a milk and grocery home-delivery service, with the promise of attracting only those prospects who are expected to exhibit high long-term retention rates.
Customer lifetime value (LTV) estimation involves two parts: the survival probabilities and profit margins. This article describes the estimation of those probabilities using discrete-time logistic hazard models. The estimation of profit margins is based on linear regression. In the scenario, when outliers are present among margins, we suggest applying robust regression with PROC ROBUSTREG.
Vadim Pliner, Verizon Wireless
Introduction: Cycling is on the rise in many urban areas across the United States. The number of cyclist fatalities is also increasing, by 19% in the last 3 years. With the broad-ranging personal and public health benefits of cycling, it is important to understand factors that are associated with these traffic-related deaths. There are more distracted drivers on the road than ever before, but the question remains of the extent that these drivers are affecting cycling fatality rates. Methods: This paper uses the Fatality Analysis Reporting System (FARS) data to examine factors related to cyclist death when the drivers are distracted. We use a novel machine learning approach, adaptive LASSO, to determine the relevant features and estimate their effect. Results: If a cyclist makes an improper action at or just before the time of the crash, the likelihood of the driver of the vehicle being distracted decreases. At the same time, if the driver is speeding or has failed to obey a traffic sign and fatally hits a cyclist, the likelihood of them also being distracted increases. Being distracted is related to other risky driving practices when cyclists are fatally injured. Environmental factors such as weather and road condition did not impact the likelihood that a driver was distracted when a cyclist fatality occurred.
Lysbeth Floden, University of Arizona
Dr Melanie Bell, Dept of Epidemiology & Biostatistics, University of Arizona
Patrick O'Connor, University of Arizona
The DATA step and DS2 both offer the user a built-in general purpose hash object that has become the go-to tool for many data analysis problems. However, there are occasions where the best solution would require a custom object specifically tailored to the problem space. The DS2 Package syntax allows the user to create custom objects that can form linked structures in memory. With DS2 Packages it is possible to create lists or tree structures that are custom tailored to the problem space. For data that can describe a great many more states than actually exist, dynamic structures can provide an extremely compact way to manipulate and analyze the data. The SAS® In-Database Code Accelerator allows these custom packages to be deployed in parallel on massive data grids.
Robert Ray, SAS
During the course of a clinical trial study, large numbers of new and modified data records are received on an ongoing basis. Providing end users with an approach to continuously review and monitor study data, while enabling them to focus reviews on new or modified (incremental) data records, allows for greater efficiency in identifying potential data issues. In addition, supplying data reviewers with a familiar machine-readable output format (for example, Microsoft Excel) allows for greater flexibility in filtering, highlighting, and retention of data reviewers' comments. In this paper, we outline an approach using SAS® in a Windows server environment and a shared folder structure to automatically refresh data review listings. Upon each execution, the listings are compared against previously reviewed data to flag new and modified records, as well as carry forward any data reviewers' comments made during the previous review. In addition, we highlight the use and capabilities of the SAS® ExcelXP tagset, which enables greater control over data formatting, including management of Microsoft Excel's sometimes undesired automatic formatting. Overall, this approach provides a significantly improved end-user experience above and beyond the more traditional approach of performing cumulative or incremental data reviews using PDF listings.
Victor Lopez, Samumed, LLC
Heli Ghandehari, Samumed, LLC
Bill Knowlton, Samumed, LLC
Christopher Swearingen, Samumed, LLC
Transforming data into intelligence for effective decision-making support is critically based on the role and capacity of the Office of Institutional Research (OIR) in managing the institution's data. Presenters share their journey from providing spreadsheet data to developing SAS® programs and dashboards using SAS® Visual Analytics. Experience gained and lessons learned are also shared at this session. The presenters demonstrate two dashboards the OIR office developed: one for classroom utilization and one for the university's diversity initiatives. The presenters share the steps taken for creating the dashboard and they describe the process the office took in getting the stakeholders involved in determining the key performance indicators (KPIs) and in evaluating and providing feedback regarding the dashboard. They share their experience gained and lessons learned in building the dashboard.
Shweta Doshi, University of Georgia
Julie Davis, University of Georgia
This presentation describes a technique for dealing with the precision problems inherent in datetime values containing nanosecond data. Floating-point values cannot store sufficient precision for this, and this limitation can be a problem for transactional data where nanoseconds are pertinent. Methods discussed include separation of variables and using the special GROUPFORMAT feature of the BY statement with MERGE in the DATA step.
Rick Langston, SAS
Debt collection! The two words can trigger multiple images in one's mind--mostly harsh. However, let's try to think positively for a moment. In 2013, over $55 billion of debt were past due in the United States. What if all of these debts were left as is and the fate of credit issuers in the hands of good will payments made by defaulters? Well, this is not the most sustainable model, to say the least. In this situation, debt collection comes in as a tool that is employed at multiple levels of recovery to keep the credit flowing. Ranging from in-house to third-party to individual collection efforts, this industry is huge and plays an important role in keeping the engine of commerce running. In the recent past, with financial markets recovering and banks selling fewer charged-off accounts and at higher prices, debt collection has increasingly become a game of efficient operations backed by solid analytics. This paper takes you into the back alleys of all the data that is in there and gives an overview of some ways modeling can be used to impact the collection strategy and outcome. SAS® tools such as SAS® Enterprise Miner™ and SAS® Enterprise Guide® are extensively used for both data manipulation and modeling. Decision trees are given more focus to understand what factors make the most impact. Along the way, this paper also gives an idea of how analytics teams today are slowly trying to get the buy-in from other stakeholders in any company, which surprisingly is one of the most challenging aspects of our jobs.
Karush Jaggi, AFS Acceptance
Harold Dickerson, SquareTwo Financial
Thomas Waldschmidt, SquareTwo Financial
Do you know how many different ways SAS® Studio can run your programs with SAS® Grid Manager? SAS Studio is the latest and coolest interface to the SAS® software. As such, we want to use it in most situations, including sites that leverage SAS Grid Manager. Are you new to SAS and want to be guided by a modern GUI? SAS Studio is here to help you. Are you a code fanatic, who wants total control of how your program runs to harness the full power of SAS Grid Manager? Sure, SAS Studio is for you, too. This paper covers all the different SAS Studio editions. You can learn how to connect each of them to SAS Grid Manager and discover best practices for harnessing a high-performance SAS analytics environment, while avoiding potential pitfalls.
Edoardo Riva, SAS
Metadata is an integral and critical part of any environment. Metadata facilitates resource discovery and provides unique identification of every single digital component of a system, simple to complex. SAS® Visual Analytics, one of the most powerful analytics visualization platforms, leverages the power of metadata to provide a plethora of functionalities for all types of users. The possibilities range from real-time advanced analytics and power-user reporting to advanced deployment features for a robust and scalable distributed platform to internal and external users. This paper explains the best practices and advanced approaches for designing and managing metadata for a distributed global SAS Visual Analytics environment. Designing and building the architecture of such an environment requires attention to important factors like user groups and roles, access management, data protection, data volume control, performance requirements, and so on. This paper covers how to build a sustainable and scalable metadata architecture through a top-down hierarchical approach. It helps SAS Visual Analytics Data Administrators to improve the platform benchmark through memory mapping, perform administrative data load (AUTOLOAD, Unload, Reload-on-Start, and so on), monitor artifacts of distributed SAS® LASR™ Analytic Servers on co-located Hadoop Distributed File System (HDFS), optimize high-volume access via FullCopies, build customized FLEX themes, and so on. It showcases practical approaches to managing distributed SAS LASR Analytic Servers, offering guest access for global users, managing host accounts, enabling Mobile BI, using power-user reporting features, customizing formats, enabling home page customization, using best practices for environment migration, and much more.
Ratul Saha, Kavi Associates
Vimal Raj Arockiasamy, Kavi Associates
Vignesh Balasubramanian, Kavi Global
Electrolux is one of the largest appliance manufacturers in the world. Electrolux North America sells more than 2,000 products to end consumers through 9,000 business customers. To grow and increase profitability under challenging market conditions, Electrolux partnered with SAS® to implement an integrated platform for SAS® for Demand-Driven Planning and Optimization and improve service levels to its customers. The process uses historical order data to create a statistical monthly forecast. The Electrolux team then reviews the statistical forecast in SAS® Collaborative Planning Workbench, where they can add value based on their business insights and promotional information. This improved monthly forecast is broken down to the weekly level where it flows into SAS® Inventory Optimization Workbench. SAS Inventory Optimization Workbench then computes weekly inventory targets to satisfy the forecasted demand at the desired service level. This presentation also covers how Electrolux implemented this project. Prior to the commencement of the project, Electrolux and the SAS team jointly worked to quantify the value of the project and set the right expectations with the executive team. A detailed timeline with regular updates helped provide visibility to all stake holders. Finally, a clear change management strategy was also developed to define the roles and responsibilities after the implementation of SAS for Demand-Driven Planning and Optimization.
Pratapsinh Patil, ELECTROLUX
Aaron Raymond, Electrolux
Sachin Verma, Electrolux
When you create reports in SAS® Visual Analytics, you automatically have reports that work on mobile devices. How do you ensure that the reports are easy to use and understand on all of your desktops, tablets, and phones? This paper describes how you can design powerful reports that your users can easily view on all their devices. You also learn how to deliver reports to users effortlessly, ensuring that they always have the latest reports. Examples show you tips and techniques to use that create the best possible reports for all devices. The paper provides sample reports that you can download and interactively view on your own devices. These reports include before and after examples that illustrate why the recommended best practices are important. By using these tips and techniques you learn how to design a report once and have confidence that it can be viewed anywhere.
Karen Mobley, SAS
Rich Hogan, SAS
Pratik Phadke, SAS
Phishing is the attempt of a malicious entity to acquire personal, financial, or otherwise sensitive information such as user names and passwords from recipients through the transmission of seemingly legitimate emails. By quickly alerting recipients of known phishing attacks, an organization can reduce the likelihood that a user will succumb to the request and unknowingly provide sensitive information to attackers. Methods to detect phishing attacks typically require the body of each email to be analyzed. However, most academic institutions do not have the resources to scan individual emails as they are received, nor do they wish to retain and analyze message body data. Many institutions simply rely on the education and participation of recipients within their network. Recipients are encouraged to alert information security (IS) personnel of potential attacks as they are delivered to their mailboxes. This paper explores a novel and more automated approach that uses SAS® to examine email header and transmission data to determine likely phishing attempts that can be further analyzed by IS personnel. Previously a collection of 2,703 emails from an external filtering appliance were examined with moderate success. This paper focuses on the gains from analyzing an additional 50,000 emails, with the inclusion of an additional 30 known attacks. Real-time email traffic is exported from Splunk Enterprise into SAS for analysis. The resulting model aids in determining the effectiveness of alerting IS personnel to potential phishing attempts faster than a user simply forwarding a suspicious email to IS personnel.
Taylor Anderson, University of Alabama
Denise McManus, University of Alabama
This paper covers developing SAS® Studio repositories. SAS Studio introduced a new way of developing custom tasks using an XML markup specification and the Apache Velocity templating language. SAS Studio repositories build on this framework and provide a flexible way to package custom tasks and snippets. After tasks and snippets are packaged into a repository, they can be shared with users inside your organization or outside your organization. This paper uses several examples to help you create your first repository.
Swapnil Ghan, SAS
Michael Monaco, SAS
Amy Peters, SAS
As SAS® programmers, we often develop listings, graphs, and reports that need to be delivered frequently to our customers. We might decide to manually run the program every time we get a request, or we might easily schedule an automatic task to send a report at a specific date and time. Both scenarios have some disadvantages. If the report is manual, we have to find and run the program every time someone request an updated version of the output. It takes some time and it is not the most interesting part of the job. If we schedule an automatic task in Windows, we still sometimes get an email from the customers because they need the report immediately. That means that we have to find and run the program for them. This paper explains how we developed an on-demand report platform using SAS® Enterprise Guide®, SAS® Web Application Server, and stored processes. We had developed many reports for different customer groups, and we were getting more and more emails from them asking for updated versions of their reports. We felt we were not using our time wisely and decided to create an infrastructure where users could easily run their programs through a web interface. The tool that we created enables SAS programmers to easily release on-demand web reports with minimum programming. It has web interfaces developed using stored processes for the administrative tasks, and it also automatically customizes the front end based on the user who connects to the website. One of the challenges of the project was that certain reports had to be available to a specific group of users only.
Romain Miralles, Genomic Health
This paper presents an application based on predictive analytics and feature-extraction techniques to develop the alternative method for diagnosis of obstructive sleep apnea (OSA). Our method reduces the time and cost associated with the gold standard or polysomnography (PSG), which is operated manually, by automatically determining the OSA's severity of a patient via classification models using the time series from a one-lead electrocardiogram (ECG). The data is from Dr. Thomas Penzel of Philipps-University, Germany, and can be downloaded at www.physionet.org. The selected data consists of 10 recordings (7 OSAs, and 3 controls) of ECG collected overnight, and non-overlapping-minute-by-minute OSA episode annotations (apnea and non-apnea states). This accounts for a total of 4,998 events (2,532 non-apnea and 2,466 apnea minutes). This paper highlights the nonlinear decomposition technique, wavelet analysis (WA) in SAS/IML® software, to maximize the information of OSA symptoms from ECG, resulting in useful predictor signals. Then, the spectral and cross-spectral analyses via PROC SPECTRA are used to quantify important patterns of those signals to numbers (features), namely power spectral density (PSD), cross power spectral density (CPSD), and coherency, such that the machine learning techniques in SAS® Enterprise Miner™, can differentiate OSA states. To eliminate variations such as body build, age, gender, and health condition, we normalize each feature by the feature of its original signal (that is, ratio of PSD of ECGs WA by PSD of ECG). Moreover, because different OSA symptoms occur at different times, we account for this by taking features from adjacency minutes into analysis, and select only important ones using a decision tree model. The best classification result in the validation data (70:30) obtained from the Random Forest model is 96.83% accuracy, 96.39% sensitivity, and 97.26% specificity. The results suggest our method is well comparable to the gold
standard.
Woranat Wongdhamma, Oklahoma State University
Today's employment and business marketplace is highly competitive. As a result, it is necessary for SAS® professionals to differentiate themselves from the competition. Success depends on a number of factors, including positioning yourself with the necessary technical skills in relation to the competition. This presentation illustrates how SAS professionals can acquire a wealth of knowledge and enhance their skills by accessing valuable and free web content related to SAS. With the aid of a web browser and the Internet, anyone can access published PDF papers, Microsoft Word documents, Microsoft PowerPoint presentations, comprehensive student notes, instructor lesson plans, hands-on exercises, webinars, audios, videos, a comprehensive technical support website maintained by SAS, and more to acquire the essential expertise that is needed to cut through all the marketplace noise and begin differentiating yourself to secure desirable opportunities with employers and clients.
Kirk Paul Lafler, Software Intelligence Corporation
While cardiac revascularization procedures like cardiac catheterization, percutaneous transluminal angioplasty, and cardiac artery bypass surgery have become standard practices in restorative cardiology, the practice is not evenly prescribed or subscribed to. We analyzed Florida hospital discharge records for the period 1992 to 2010 to determine the odds of receipt of any of these procedures by Hispanics and non-Hispanic Whites. Covariates (potential confounders) were age, insurance type, gender, and year of discharge. Additional covariates considered included comorbidities such as hypertension, diabetes, obesity, and depression. The results indicated that even after adjusting for covariates, Hispanics in Florida during the time period 1992 to 2010 were consistently less likely to receive these procedures than their White counterparts. Reasons for this phenomenon are discussed.
C. Perry Brown, Florida A&M University
Jontae Sanders, Florida Department of Health
At a community college, there was a need for college employees to quickly and easily find available classroom time slots for the purposes of course scheduling. The existing method was time-consuming and inefficient, and there were no available IT resources to implement a solution. The Office of Institutional Research, which had already been delivering reports using SAS® Enterprise BI Server, created a report called Find an Open Room to fill the need. By combining SAS® programming techniques, a scheduled SAS® Enterprise Guide® project, and a SAS® Web Report Studio report delivered within the SAS® Information Delivery Portal, a report was created that allowed college users to search for available time slots.
Nicole Jagusztyn, Hillsborough Community College
The new SAS® High-Performance Statistics procedures were developed to respond to the growth of big data and computing capabilities. Although there is some documentation regarding how to use these new high-performance (HP) procedures, relatively little has been disseminated regarding under what specific conditions users can expect performance improvements. This paper serves as a practical guide to getting started with HP procedures in SAS®. The paper describes the differences between key HP procedures (HPGENSELECT, HPLMIXED, HPLOGISTIC, HPNLMOD, HPREG, HPCORR, HPIMPUTE, and HPSUMMARY) and their legacy counterparts both in terms of capability and performance, with a particular focus on discrepancies in real time required to execute. Simulations were conducted to generate data sets that varied on the number of observations (10,000, 50,000, 100,000, 500,000, 1,000,000, and 10,000,000) and the number of variables (50, 100, 500, and 1,000) to create these comparisons.
Diep Nguyen, University of South Florida
Sean Joo, University of South Florida
Anh Kellermann, University of South Florida
Jeff Kromrey, University of South Florida
Jessica Montgomery, University of South Florida
Patricia Rodríguez de Gil, University of South Florida
Yan Wang, University of South Florida
Discover how to document your SAS® programs, data sets, and catalogs with a few lines of code that include SAS functions, macro code, and SAS metadata. Do you start every project with the best of intentions to document all of your work, and then fall short of that aspiration when deadlines loom? Learn how your programs can automatically update your processing log. If you have ever wondered who ran a program that overwrote your data, SAS has the answer! And If you don't want to be tracing back through a year's worth of code to produce a codebook for your client at the end of a contract, SAS has the answer!
Roberta Glass, Abt Associates
Louise Hadden, Abt Associates
In any organization where people work with data, it is extremely unlikely that there will be only one way of doing things. Things like divisional silos and differences in education, training, and subject matter often result in a diversity of tools and levels of expertise. One situation that frequently arises is that of 'code versus click.' Let's call it the difference between code-based, 'power user' data tools and simpler, purely graphic point-and-click tools such as Microsoft Excel. Even though the work itself might be quite similar, differences in analysis tools often mean differences in vocabulary and experience, and it can be difficult to convert users of one tool to another. This discussion will highlight the potential challenges of SAS® adoption in an Excel-based workplace and propose strategies to gain new SAS advocates in your organization.
Andrew Clapson, MD Financial Management
Dynamic interactive visual displays known as dashboards are most effective when they show essential graphs, tables, statistics, and other information where data is the star. The first rule for creating an effective SAS® dashboard is to keep it simple. Striking a balance between content and style, a dashboard should be void of clutter so as not to distract from or obscure the information displayed. The second rule of effective dashboard design is to display data that meets one or more business or organizational objectives. To accomplish this, the elements in a dashboard should convey a format easily understood by its intended audience. Attendees learn how to create dynamic interactive user- and data-driven dashboards, graphical and table-driven dashboards, statistical dashboards, and drill-down dashboards with a purpose by using Base SAS® programming techniques including the DATA step, PROC FORMAT, PROC PRINT, PROC MEANS, PROC SQL, ODS, statistical graphics, and HTML.
Kirk Paul Lafler, Software Intelligence Corporation
Many organizations need to report forecasts of large numbers of time series at various levels of aggregation. Numerous model-based forecasts that are statistically generated at the lowest level of aggregation need to be combined to form an aggregate forecast that is not required to follow a fixed hierarchy. The forecasts need to be dynamically aggregated according to any subset of the time series, such as from a query. This paper proposes a technique for large-scale automatic forecast aggregation and uses SAS® Forecast Server and SAS/ETS® software to demonstrate this technique.
Michael Leonard, SAS
For any SAS® Platform Administrator, the challenge is to ensure the right jobs run for the right users with the right resources at the right time. Running on a grid-enabled SAS platform greatly assists this task, allowing prioritization of processes to ensure an optimal mix of critical batch processing and interactive user sessions. A key feature of SAS® Grid Computing 9.4 is grid options sets, which make it even easier to manage options, resources, and queues. Grid options sets allow different user groups to run SAS applications on a common shared SAS Application Server Context (such as SASApp), but the applications are still tailored to suit the requirements of each application and user group. Offering much more than just queue selection, grid options sets in SAS Grid Computing 9.4 now allow SAS Platform Administrators to effectively have 'conditional queues,' 'conditional configuration options,' and 'conditional resource requirements,' all centrally located and managed within a single SAS Application Server Context. This paper examines the benefits of SAS Grid Computing processing, looks at some of the issues encountered in previous versions of SAS Grid Computing, and explains how these issues are managed more effectively on a SAS 9.4 platform, thanks to SAS grid options sets.
Andrew Howell, ANJ Solutions P/L
Whether you have been programming in SAS® for years, are new to it, or have dabbled with SAS® Enterprise Guide® before, this hands-on workshop sheds some light on the depth, breadth, and power of the Enterprise Guide environment. With all the demands on your time, you need powerful tools that are easy to learn and deliver end-to-end support for your data exploration, reporting, and analytics needs. Included are the following: data exploration tools formatting code--cleaning up after your coworkers enhanced programming environment (and how to calm it down) easily creating reports and graphics producing the output formats you need (XLS, PDF, RTF, HTML) workspace layout start-up processing notes to help your coworkers use your processes This workshop uses SAS Enterprise Guide 7.1, but most of the content is applicable to earlier versions.
Marje Fecht, Prowerk Consulting
Data-driven decision making is critical for any organization to thrive in this fiercely competitive world. The decision-making process has to be accurate and fast in order to stay a step ahead of the competition. One major problem organizations face is huge data load times in loading or processing the data. Reducing the data loading time can help organizations perform faster analysis and thereby respond quickly. In this paper, we compared the methods that can import data of a particular file type in the shortest possible time and thereby increase the efficiency of decision making. SAS® takes input from various file types (such as XLS, CSV, XLSX, ACCESS, and TXT) and converts that input into SAS data sets. To perform this task, SAS provides multiple solutions (such as the IMPORT procedure, the INFILE statement, and the LIBNAME engine) to import the data. We observed the processing times taken by each method for different file types with a data set containing 65,535 observations and 11 variables. We executed the procedure multiple times to check for variation in processing time. From these tests, we recorded the minimum processing time for the combination of procedure and file type. From our analysis of processing times taken by each importing technique, we observed that the shortest processing times for CSV and TXT files, XLS and XLSX files, and ACCESS files are the INFILE statement, the LIBNAME engine, and PROC IMPORT, respectively.
Divya Dadi, Oklahoma State University
Rahul Jhaver, Oklahoma State University
This session illustrates how to quickly create rates over a specified period of time, using the MEANS and EXPAND procedures. For example, do you want to know how to use the power of SAS® to create a year-to-date, rolling 12-month, or monthly rate? At Kaiser Permanente, we use this technique to develop Emergency Department (ED) use rates, ED admit rates, patient day rates, readmission rates, and more. A powerful function of PROC MEANS, given a database table with several dimensions and one or more facts, is to perform a mathematical calculation on fact columns across several different combinations of dimensions. For example, if a membership database table exists with the dimensions member ID, year-month, line of business, medical office building, and age grouping, PROC MEANS can easily determine and output the count of members by every possible dimension combination into a SAS data set. Likewise, if a hospital visit database table exists with the same dimensions and facts, PROC MEANS can output the number of hospital visits by the dimension combinations into a second SAS data set. With the power of PROC EXPAND, each of the data sets above, once sorted properly, can have columns added, which calculate total members and total hospital visits by a time dimension of the analyst's choice. Common time dimensions used for Kaiser Permanente's utilization rates are monthly, rolling 12-months, and year-to-date. The resulting membership and hospital visit data sets can be joined with a MERGE statement, and simple division produces a rate for the given dimensions.
Thomas Gant, Kaiser Permanente
You have SAS® Enterprise Guide® installed. You use SAS Enterprise Guide in your day-to-day work. You see how Enterprise Guide can be an aid to accessing data and insightful analytics. You have people you work with or support who are new to SAS® and want to learn. You have people you work with or support who don't particularly want to code but use the GUI and wizard within Enterprise Guide. And then you have the spreadsheet addict, the person or group who refuse to even sign on to SAS. These people need to consume the data sitting in SAS, and they need to do analysis, but they want to do it all in a spreadsheet. But you need to retain an audit trail of the data, and you have to reduce the operational risk of using spreadsheets for reporting. What do you do? This paper shares some of the challenges and triumphs in empowering these very different groups of people using SAS.
Anita Measey, Bank of Montreal
What will your customer do next? Customers behave differently; they are not all average. Segmenting your customers into different groups enables you to provide different communications and interactions for the different segments, resulting in greater customer satisfaction as well as increased profits. Using SAS® Visual Analytics and SAS® Visual Statistics to visualize your segments with respect to customer attributes enables you to create more useful segments for customer relationship management and to understand the value and relative importance of different customer attributes. You can segment your customers by using the following methods: 1) business rules; 2) supervised clustering--decision trees and so on; 3) unsupervised clustering; 4) creating segments based on quantile membership. Whatever way you choose, SAS Visual Analytics enables you to graphically represent your customer data with respect to demographic, geographic, and customer behavioral dimensions. This paper covers the four segmentation techniques and demonstrates how SAS Visual Analytics and SAS Visual Statistics can be used for easy and comprehensive understanding of your customers.
Darius Baer, SAS
Suneel Grover, SAS
Ensemble models are a popular class of methods for combining the posterior probabilities of two or more predictive models in order to create a potentially more accurate model. This paper summarizes the theoretical background of recent ensemble techniques and presents examples of real-world applications. Examples of these novel ensemble techniques include weighted combinations (such as stacking or blending) of predicted probabilities in addition to averaging or voting approaches that combine the posterior probabilities by adding one model at a time. Fit statistics across several data sets are compared to highlight the advantages and disadvantages of each method, and process flow diagrams that can be used as ensemble templates for SAS® Enterprise Miner™ are presented.
Wendy Czika, SAS
Ye Liu, SAS Institute
Now that you have deployed SAS®, what should you do to ensure it continues to meet your SAS users' performance expectations? This paper discusses how to proactively monitor your SAS infrastructure, with tools that should be used on a regular basis to keep tabs on infrastructure performance and housekeeping. Basic SAS administration concepts are discussed, with references for blogs, communities, and the new visual learning sites.
Margaret Crevar, SAS
As Data Management professionals, you have to comply with new regulations and controls. One such regulation is Basel Committee on Banking Supervision (BCBS) 239. To respond to these new demands, you have to put processes and methods in place to automate metadata collection and analysis, and to provide rigorous documentation around your data flows. You also have to deal with many aspects of data management including data access, data manipulation (ETL and other), data quality, data usage, and data consumption, often from a variety of toolsets that are not necessarily from a single vendor. This paper shows you how to use SAS® technologies to support data governance requirements, including third party metadata collection and data monitoring. It highlights best practices such as implementing a business glossary and establishing controls for monitoring data. Attend this session to become familiar with the SAS tools used to meet the new requirements and to implement a more managed environment.
Jeff Stander, SAS
Motivated by the frequent need for equivalence tests in clinical trials, this presentation provides insights into tests for equivalence. We summarize and compare equivalence tests for different study designs, including one-parameter problems, designs with paired observations, and designs with multiple treatment arms. Power and sample size estimations are discussed. We also provide examples to implement the methods by using the TTEST, ANOVA, GLM, and MIXED procedures.
Fei Wang, McDougall Scientific Ltd.
John Amrhein, McDougall Scientific Ltd.
The experimental item response theory procedure (PROC IRT), included in recently released SAS/STAT® 13.1 and 13.2, enables item response modeling and trait estimation in SAS®. PROC IRT enables you to perform item parameter calibration and latent trait estimation using a wide spectrum of educational and psychological research. This paper evaluates the performance of PROC IRT in item parameter recovery and under various testing conditions. The pros and cons of PROC IRT versus BILOG-MG 3.0 are presented. For practitioners of IRT models, the development of IRT-related analysis in SAS is inspiring. This analysis offers a great choice to the growing population of IRT users. A shift to SAS can be beneficial based on several features of SAS: its flexibility in data management, its power in data analysis, its convenient output delivery, and its increasing richness in graphical presentation. It is critical to ensure the quality of item parameter calibration and trait estimation before you can continue with other components, such as test scoring, test form constructions, IRT equatings, and so on.
Yi-Fang Wu, ACT, Inc.
Scotiabank Colombian division - Colpatria, is the national leader in terms of providing credit cards, with more than 1,600,000 active cards--the equivalent to a portfolio of 700 million dollars approximately. The behavior score is used to offer credit cards through a cross-sell process, which happens only if customers have completed six months on books after using their first product with the bank. This is the minimum period of time requested by the behavior Artificial Neural Network (ANN) model. The six months on books internal policy suggests that the maturation of the client in this period is adequate, but this has never been proven. The following research aims to evaluate this hypothesis and calculate the appropriate time to offer cross-sales to new customers using Logistic Regression (Logit), while also segmenting these sales targets by their level of seniority using Discrete-Time Markov Chains (DTMC).
Oscar Javier Cortés Arrigui, Scotiabank - Colpatria
Miguel Angel Diaz Rodriguez, Scotiabank - Colpatria
With the big data throughputs generated by event streams, organizations can opportunistically respond with low-latency effectiveness. Having the ability to permeate identified patterns of interest throughout the enterprise requires deep integration between event stream processing and foundational enterprise data management applications. This paper describes the innovative ability to consolidate real-time data ingestion with controlled and disciplined universal data access from SAS® and Teradata.
Tho Nguyen, Teradata
Fiona McNeill, SAS
SAS® In-Memory Statistics uses a powerful interactive programming interface for analytics, aimed squarely at the data scientist. We show how the custom tasks that you can create in SAS® Studio (a web-based programming interface) can make everyone a data scientist! We explain the Common Task Model of SAS Studio, and we build a simple task in steps that carries out the basic functionality of the IMSTAT procedure. This task can then be shared amongst all users, empowering everyone on their journey to becoming a data scientist. During the presentation, it will become clear that not only can shareable tasks be created but the developer does not have to understand coding in Java, JavaScript, or ActionScript. We also use the task we created in the Visual Programming perspective in SAS Studio.
Stephen Ludlow, SAS
Every day, we are bombarded by pundits pushing big data as the cure for all research woes and heralding the death of traditional quantitative surveys. We are told how big data, social media, and text analytics will make the Likert scale go the way of the dinosaur. This presentation makes the case for evolving our surveys and data sets to embrace new technologies and modes while still welcoming new advances in big data and social listening. Examples from the global automotive industry are discussed to demonstrate the pros and cons of different types of data in an automotive environment.
Will Neafsey, Ford Motor Company
Does the rapidly changing fraud and compliance landscape make it difficult to adapt your applications to meet your current needs? Fraudsters are constantly evolving and your applications need to be able to keep up. No two businesses are ever the same. They all have different business drivers, customer needs, market demands, and strategic objectives. Using SAS® Visual Investigator, business units can quickly adapt to their ever-changing needs and deliver what end users and customers need to be effective investigators. Using the administrative tools provided, users can quickly and easily adapt the pages and data structures that underpin applications to fit their needs. This presentation walks through the process of updating an insurance fraud detection system using SAS Visual Investigator, including changing the way alerts are routed to users, pulling in additional information, and building out application pages. It includes an example of how an end user can customize a solution due to changing fraud threats.
Gordon Robinson, SAS
Generalized linear models (GLMs) are commonly used to model rating factors in insurance pricing. The integration of territory rating and geospatial variables poses a unique challenge to the traditional GLM approach. Generalized additive models (GAMs) offer a flexible alternative based on GLM principles with a relaxation of the linear assumption. We explore two approaches for incorporating geospatial data in a pricing model using a GAM-based framework. The ability to incorporate new geospatial data and improve traditional approaches to territory ratemaking results in further market segmentation and a better match of price to risk. Our discussion highlights the use of the high-performance GAMPL procedure, which is new in SAS/STAT® 14.1 software. With PROC GAMPL, we can incorporate the geographic effects of geospatial variables on target loss outcomes. We illustrate two approaches. In our first approach, we begin by modeling the predictors as regressors in a GLM, and subsequently model the residuals as part of a GAM based on location coordinates. In our second approach, we model all inputs as covariates within a single GAM. Our discussion compares the two approaches and demonstrates visualization of model outputs.
Carol Frigo, State Farm
Kelsey Osterloo, State Farm Insurance
The use of logistic models for independent binary data has relied first on asymptotic theory and later on exact distributions for small samples, as discussed by Troxler, Lalonde, and Wilson (2011). While the use of logistic models for dependent analysis based on exact analyses is not common, it is usually presented in the case of one-stage clustering. We present a SAS® macro that allows the testing of hypotheses using exact methods in the case of one-stage and two-stage clustering for small samples. The accuracy of the method and the results are compared to results obtained using an R program.
Kyle Irimata, Arizona State University
Jeffrey Wilson, Arizona State University
SAS® Embedded Process offers a flexible, efficient way to leverage increasing amounts of data by injecting the processing power of SAS® directly where the data lives. SAS Embedded Process can tap into the massively parallel processing (MPP) architecture of Hadoop for scalable performance. Using SAS® In-Database Technologies for Hadoop, you can run scoring models generated by SAS® Enterprise Miner™ or, with SAS® In-Database Code Accelerator for Hadoop, user-written DS2 programs in parallel. With SAS Embedded Process on Hadoop you can also perform data quality operations, and extract and transform data using SAS® Data Loader. This paper explores key SAS technologies that run inside the Hadoop parallel processing framework and prepares you to get started with them.
David Ghazaleh, SAS
Injury severity describes the severity of the injury to the person involved in the crash. Understanding the factors that influence injury severity can be helpful in designing mechanisms to reduce accident fatalities. In this research, we model and analyze the data as a hierarchy with three levels to answer the question what road, vehicle and driver-related factors influence injury severity. In this study, we used hierarchical linear modeling (HLM) for analyzing nested data from Fatality Analysis Reporting System (FARS). The results show that driver-related factors are directly related to injury severity. On the other hand, road conditions and vehicle characteristics have significant moderation impact on injury severity. We believe that our study has important policy implications for designing customized mechanisms specific to each hierarchical level to reduce the occurrence of fatal accidents.
The Armed Conflict Location & Event Data Project (ACLED) is a comprehensive public collection of conflict data for developing states. As such, the data is instrumental in informing humanitarian and development work in crisis and conflict-affected regions. The ACLED project currently manually codes for eight types of events from a summarized event description, which is a time-consuming process. In addition, when determining the root causes of the conflict across developing states, these fixed event types are limiting. How can researchers get to a deeper level of analysis? This paper showcases a repeatable combination of exploratory and classification-based text analytics provided by SAS® Contextual Analysis, applied to the publicly available ACLED for African states. We depict the steps necessary to determine (using semi-automated processes) the next deeper level of themes associated with each of the coded event types; for example, while there is a broad protests and riots event type, how many of those are associated individually with rallies, demonstrations, marches, strikes, and gatherings? We also explore how to use predictive models to highlight the areas that require aid next. We prepare the data for use in SAS® Visual Analytics and SAS® Visual Statistics using SAS® DS2 code and enable dynamic explorations of the new event sub-types. This ultimately provides the end-user analyst with additional layers of data that can be used to detect trends and deploy humanitarian aid where limited resources are needed the most.
Tom Sabo, SAS
Recent years have seen the birth of a powerful tool for companies and scientists: the Google Ngram data set, built from millions of digitized books. It can be, and has been, used to learn about past and present trends in the use of words over the years. This is an invaluable asset from a business perspective, mostly because of its potential application in marketing. The choice of words has a major impact on the success of a marketing campaign and an analysis of the Google Ngram data set can validate or even suggest the choice of certain words. It can also be used to predict the next buzzwords in order to improve marketing on social media or to help measure the success of previous campaigns. The Google Ngram data set is a gift for scientists and companies, but it has to be used with a lot of care. False conclusions can easily be drawn from straightforward analysis of the data. It contains only a limited number of variables, which makes it difficult to extract valuable information from it. Through a detailed example, this paper shows that it is essential to account for the disparity in the genre of the books used to construct the data set. This paper argues that for the years after 1950, the data set has been constructed using a much higher proportion of scientific books than for the years before. An ingenious method is developed to approximate, for each year, this unknown proportion of books coming from the scientific literature. A statistical model accounting for that change in proportion is then presented. This model is used to analyze the trend in the use of common words of the scientific literature in the 20th century. Results suggest that a naive analysis of the trends in the data can be misleading.
Aurélien Nicosia, Université Laval
Thierry Duchesne, Universite Laval
Samuel Perreault, Université Laval
Do you create complex reports using PROC REPORT? Are you confused by the COMPUTE BLOCK feature of PROC REPORT? Are you even aware of it? Maybe you already produce reports using PROC REPORT, but suddenly your boss needs you to modify some of the values in one or more of the columns. Maybe your boss needs to see the values of some rows in boldface and others highlighted in a stylish yellow. Perhaps one of the columns in the report needs to display a variety of fashionable formats (some with varying decimal places and some without any decimals). Maybe the customer needs to see a footnote in specific cells of the report. Well, if this sounds familiar then come take a look at the COMPUTE BLOCK of PROC REPORT. This paper shows a few tips and tricks of using the COMPUTE DEFINE block with conditional IF/THEN logic to make your reports stylish and fashionable. The COMPUTE BLOCK allows you to use data DATA step code within PROC REPORT to provide customization and style to your reports. We'll see how the Census Bureau produces a stylish demographic profile for customers of its Special Census program using PROC REPORT with the COMPUTE BLOCK. The paper focuses on how to use the COMPUTE BLOCK to create this stylish Special Census profile. The paper shows quick tips and simple code to handle multiple formats within the same column, make the values in the Total rows boldface, trafficlighting, and how to add footnotes to any cell based on the column or row. The Special Census profile report is an Excel table created with ODS tagsets.ExcelXP that is stylish and fashionable, thanks in part to the COMPUTE BLOCK.
Chris Boniface, Census Bureau
Road safety is a major concern for all United States of America citizens. According to the National Highway Traffic Safety Administration, 30,000 deaths are caused by automobile accidents annually. Oftentimes fatalities occur due to a number of factors such as driver carelessness, speed of operation, impairment due to alcohol or drugs, and road environment. Some studies suggest that car crashes are solely due to driver factors, while other studies suggest car crashes are due to a combination of roadway and driver factors. However, other factors not mentioned in previous studies may be contributing to automobile accident fatalities. The objective of this project was to identify the significant factors that lead to multiple fatalities in the event of a car crash.
Bill Bentley, Value-Train
Gina Colaianni, Kennesaw State University
Cheryl Joneckis, Kennesaw State University
Sherry Ni, Kennesaw State University
Kennedy Onzere, Kennesaw State University
With millions of users and peak traffic of thousands of requests a second for complex user-specific data, fantasy football offers many data design challenges. Not only is there a high volume of data transfers, but the data is also dynamic and of diverse types. We need to process data originating on the stadium playing field and user devices and make it available to a variety of different services. The system must be nimble and must produce accurate and timely responses. This talk discusses the strategies employed by and lessons learned from one of the primary architects of the National Football League's fantasy football system. We explore general data design considerations with specific examples of high availability, data integrity, system performance, and some other random buzzwords. We review some of the common pitfalls facing large-scale databases and the systems using them. And we cover some of the tips and best practices to take your data-driven applications from fantasy to reality.
Clint Carpenter, Carpenter Programming
SAS® for Windows is extremely powerful software, not only for analyzing data, but also for organizing and maintaining output and permanent data sets. By using pipes and operating system (x) commands within a SAS session, you can easily and effectively manage files of all types stored on your local network.
Emily Sisson, Boston University School of Public Health Data Coordinating Center
Logistic regression models are commonly used in direct marketing and consumer finance applications. In this context the paper discusses two topics about the fitting and evaluation of logistic regression models. Topic 1 is a comparison of two methods for finding multiple candidate models. The first method is the familiar best subsets approach. Best subsets is then compared to a proposed new method based on combining models produced by backward and forward selection plus the predictors considered by backward and forward. This second method uses the SAS® procedure HPLOGISTIC with selection of models by Schwarz Bayes criterion (SBC). Topic 2 is a discussion of model evaluation statistics to measure predictive accuracy and goodness-of-fit in support of the choice of a final model. Base SAS® and SAS/STAT® software are used.
Bruce Lund, Magnify Analytic Solutions, Division of Marketing Associates
U.S. stock exchanges (currently there are 12) are tracked in real time via the Consolidated Tape System (CTS) and the Consolidated Quotation System (CQS). CQS contains every updated quote (buyer's bid price and seller's offer price) from each exchange, covering some 8,500 stock tickers. This is the basis by which brokers can honor their obligation to investors, mandated by the U.S. Securities and Exchange Commission, to execute transactions at the best price, that is, at the National Best Bid and Offer (NBBO). With the advent of electronic exchanges and high-frequency trading (timestamps are published to the microsecond), data set size has become a major operational consideration for market researchers re-creating NBBO values (over 1 billion quotes requiring 80 gigabytes of storage for a normal trading day). This presentation demonstrates a straightforward use of hash tables for tracking constantly changing quotes for each ticker/exchange combination, in tandem with an efficient means of determining changes in NBBO with every new quote.
Mark Keintz, Wharton Research Data Services
In this era of data analytics, you are often faced with a challenge of joining data from multiple legacy systems. When the data systems share a consistent merge key, such as ID or SSN, the solution is straightforward. However, what do you do when there is no common merge key? If one data system has a character value ID field, another has an alphanumeric field, and the only common fields are the names or addresses or dates of birth, a standard merge query does not work. This paper demonstrates fuzzy matching methods that can overcome this obstacle and build your master record through Base SAS® coding. The paper also describes how to leverage the SAS® Data Quality Server in SAS® code.
Elena Shtern, SAS
Kim Hare, SAS
Hierarchical nonlinear mixed models are complex models that occur naturally in many fields. The NLMIXED procedure's ability to fit linear or nonlinear models with standard or general distributions enables you to fit a wide range of such models. SAS/STAT® 13.2 enhanced PROC NLMIXED to support multiple RANDOM statements, enabling you to fit nested multilevel mixed models. This paper uses an example to illustrate the new functionality.
Raghavendra Kurada, SAS
The popular MIXED, GLIMMIX, and NLMIXED procedures in SAS/STAT® software fit linear, generalized linear, and nonlinear mixed models, respectively. These procedures take the classical approach of maximizing the likelihood function to estimate model. The flexible MCMC procedure in SAS/STAT can fit these same models by taking a Bayesian approach. Instead of maximizing the likelihood function, PROC MCMC draws samples (using a variety of sampling algorithms) to approximate the posterior distributions of model parameters. Similar to the mixed modeling procedures, PROC MCMC provides estimation, inference, and prediction. This paper describes how to use the MCMC procedure to fit Bayesian mixed models and compares the Bayesian approach to how the classical models would be fit with the familiar mixed modeling procedures. Several examples illustrate the approach in practice.
Gordon Brown, SAS
Fang K. Chen, SAS
Maura Stokes, SAS
Analytics is now pervasive in our society, informing millions of decisions every year, and even making decisions without human intervention in automated systems. The implicit assumption is that all completed analytical projects are accurate and successful, but the reality is that some are not. Bad decisions that result from faulty analytics can be costly, embarrassing and even dangerous. By understanding the root causes of analytical failures, we can minimize the chances of repeating the mistakes of the past.
We introduce age-period-cohort (APC) models, which analyze data in which performance is measured by age of an account, account open date, and performance date. We demonstrate this flexible technique with an example from a recent study that seeks to explain the root causes of the US mortgage crisis. In addition, we show how APC models can predict website usage, retail store sales, salesperson performance, and employee attrition. We even present an example in which APC was applied to a database of tree rings to reveal climate variation in the southwestern United States.
Joseph Breeden, Prescient Models
SAS® Forecast Server provides an excellent tool towards forecasting of time series across a variety of scenarios and industries. Given sufficient data, SAS Forecast Server can provide insights into seasonality, trend, and effects of input variables and events. In some business cases, there might not be sufficient data within individual series to produce these complex models. This paper presents an approach to forecasting these short time series. Using sufficiently long time series as a training data set, forecasted shape and volumes are created through clustering and attribute-based regressions, allowing for new series to have forecasts based on their characteristics. These forecasts are passed into SAS Forecast Server through preprocessing. Production of artificial history and input variables, matched with a custom model addition to the standard model repository, allows for future values of the artificial variable to be realized as the forecast. This process not only relieves the need for manual entry of overrides, but also allows SAS Forecast Server to choose alternative models as additional observations are provided through incremental loads.
Cobey Abramowski, SAS
Christian Haxholdt, SAS Institute
Chris Houck, SAS Institute
Benjamin Perryman, SAS Institute
The importance of understanding terrorism in the United States assumed heightened prominence in the wake of the coordinated attacks of September 11, 2001. Yet, surprisingly little is known about the frequency of attacks that might happen in the future and the factors that lead to a successful terrorist attack. This research is aimed at forecasting the frequency of attacks per annum in the United States and determining the factors that contribute to an attack's success. Using the data acquired from the Global Terrorism Database (GTD), we tested our hypothesis by examining the frequency of attacks per annum in the United States from 1972 to 2014 using SAS® Enterprise Miner™ and SAS® Forecast Studio. The data set has 2,683 observations. Our forecasting model predicts that there could be as many as 16 attacks every year for the next 4 years. From our study of factors that contribute to the success of a terrorist attack, we discovered that attack type, weapons used in the attack, place of attack, and type of target play pivotal roles in determining the success of a terrorist attack. Results reveal that the government might be successful in averting assassination attempts but might fail to prevent armed assaults and facilities or infrastructure attacks. Therefore, additional security might be required for important facilities in order to prevent further attacks from being successful. Results further reveal that it is possible to reduce the forecasted number of attacks by raising defense spending and by putting an end to the raging war in the Middle East. We are currently working on gathering data to do in-depth analysis at the monthly and quarterly level.
SriHarsha Ramineni, Oklahoma State University
Koteswara Rao Sriram, Oklahoma State University
Collection of customer data is often done in status tables or snapshots, where, for example, for each month, the values for a handful of variables are recorded in a new status table whose name is marked with the value of the month. In this QuickTip, we present how to construct a table of last occurrence times for customers using a DATA step merge of such status tables and the colon (':') wildcard. If the status tables are sorted, this can be accomplished in four lines of code (where RUN; is the fourth). Also, we look at how to construct delta tables (for example, from one time period to another, or which customers have arrived or left) using a similar method of merge and colons.
Jingyu She, Danica Pension
In the fast-changing world of data management, new technologies to handle increasing volumes of (big) data are emerging every day. Many companies are struggling in dealing with these technologies and more importantly in integrating them into their existing data management processes. Moreover, they also want to rely on the knowledge built by their teams in existing products. Implementing change to learn new technologies can therefore be a costly procedure. SAS® is a perfect fit in this situation by offering a suite of software that can be set up to work with any third-party database through using the corresponding SAS/ACCESS® interface. Indeed, for new database technologies, SAS is releasing a specific SAS/ACCESS interface, allowing users to develop and migrate SAS solutions almost transparently. Your users need to know only a few techniques in order to combine the power of SAS with a third-party (big) database. This paper will help companies with the integration of rising technologies in their current SAS® Data Management platform using SAS/ACCESS. More specifically, the paper focuses on best practices for success in such an implementation. Integrating SAS Data Management with the IBM PureData for Analytics Appliance is provided as an example.
Magali Thesias, Deloitte
Oil and gas wells encounter many types of adverse events, including unexpected shut-ins, scaling, rod pump failures, breakthrough, and fluid influx, to name just a few common ones. This paper presents a real-time event detection system that presents visual reports and alerts petroleum engineers when an event is detected in a well. This system uses advanced time series analysis on both high- and low-frequency surface and downhole measurements. This system gives the petroleum engineer the ability to switch from passive surveillance of events, which then require remediation, to an active surveillance solution, which enables the engineer to optimize well intervention strategies. Events can occur simultaneously or rapidly, or can be masked over long periods of time. We tested the proposed method on data received from both simulated and publicly available data to demonstrate how a multitude of well events can be detected by using multiple models deployed into the data stream. In our demonstration, we show how our real-time system can detect a mud motor pressure failure resulting from a latent fluid overpressure event. Giving the driller advanced warning of the motor state and the potential to fail can save operators millions of dollars a year by reducing nonproductive time caused by these failures. Reducing nonproductive time is critical to reducing the operating costs of constructing a well and to improving cash flow by shortening the time to first oil.
Milad Falahi, SAS
Gilbert Hernandez, SAS Institute
When attempting to match names and addresses from different files, we often run into a situation where the names are similar, but not exactly the same. Sometimes there are additional words in the names, sometimes there are different spellings, and sometimes the businesses have the same name but are located thousands of miles apart. The files that contain the names might have numeric keys that cannot be matched. Therefore, we need to use a process called fuzzy matching to match the names from different files. The SAS® function COMPGED, combined with SAS character-handling functions, provides a straightforward method of applying business rules and testing for similarity.
Stephen Sloan, Accenture
Daniel Hoicowitz, Accenture
SAS® Customer Intelligence 360 is the new cloud-based customer data gathering application that uses the Amazon Web Services Cloud as a hosting platform. Learn how you can instrument your mobile application to gather usage data from your users as well as provide targeted application content based on either customer behavior or marketing campaigns.
Jim Adams, SAS
Analysts find the standard linear regression and analysis-of-variance models to be extremely convenient and useful tools. The standard linear model equation form is observations = (sum of explanatory variables) + residual with the assumptions of normality and homogeneity of variance. However, these tools are unsuitable for non-normal response variables in general. Using various transformations can stabilize the variance. These transformations are often ineffective because they fail to address the skewness problem. It can be complicated to transform the estimates back to their original scale and interpret the results of the analysis. At this point, we reach the limits of the standard linear model. This paper introduces generalized linear models (GzLM) using a systematic approach to adapting linear model methods on non-normal data. Why GzLM? Generalized linear models have greater power to identify model effects as statistically significant when the data are not normally distributed (Stroup, xvii). Unlike the standard linear model, the generalized linear model contains the distribution of the observations, the linear predictor or predictors, the variance function, and the link function. A few examples show how to build a GzLM for a variety of response variables that follows a Poisson, Negative Binomial, Exponential, or Gamma distribution. The SAS/STAT® GENMOD procedure is used to compute basic analyses.
Theresa Ngo, Warner Bros. Entertainment
Color is an important aspect of data visualization and provides an analyst with another tool for identifying data trends. But is the default option the best for every case? Default color scales may be familiar to us; however, they can have inherent flaws that skew our perception of the data. The impact of data visualizations can be considerably improved with just a little thought toward choosing the correct colors. Selecting an appropriate color map can be difficult, and you may decide that it may be easier to generate your own custom color scale. After a brief introduction to the red, green, and blue (RGB) color space, we discuss the strengths and weaknesses of some widely used color scales, as well as what to keep in mind when designing your own. A simple technique is presented, detailing how you can use SAS® to generate a number of color scales and apply them to your data. Using just a few graphics procedures, you can transform almost any complex data into an easily digestible set of visuals. The techniques used in this discussion were developed and tested using SAS® Enterprise Guide® 5.1.
Jeff Grant, Bank of Montreal
Mahmoud Mamlouk, BMO Harris Bank
There are many methods to randomize participants in randomized control trials. If it is important to have approximately balanced groups throughout the course of the trial, simple randomization is not a suitable method. Perhaps the most common alternative method that provides balance is the blocked randomization method. A less well-known method called the treatment adaptive randomized design also achieves balance. This paper shows you how to generate an entire randomization sequence to randomize participants in a two-group clinical trial using the adaptive biased coin randomization design (ABCD), prior to recruiting any patients. Such a sequence could be used in a central randomization server. A unique feature of this method allows the user to determine the extent to which imbalance is permitted to occur throughout the sequence while retaining the probabilistic nature that is essential to randomization. Properties of sequences generated by the ABCD approach are compared to those generated by simple randomization, a variant of simple randomization that ensures balance at the end of the sequence, and by blocked randomization.
Gary Foster, St Joseph's Healthcare
SAS® programmers spend hours developing DATA step code to clean up their data. Save time and money and enhance your data quality results by leveraging the power of the SAS Quality Knowledge Base (QKB). Knowledge Engineers at SAS have developed advanced context-specific data quality operations and packaged them in callable objects in the QKB. In this session, a Senior Software Development Manager at SAS explains the QKB and shows how to invoke QKB operations in a DATA step and in various SAS® Data Management offerings.
Brian Rineer, SAS
SAS® Output Delivery System (ODS) Graphics started appearing in SAS® 9.2. When first starting to use these tools, the traditional SAS/GRAPH® software user might come upon some very significant challenges in learning the new way to do things. This is further complicated by the lack of simple demonstrations of capabilities. Most graphs in training materials and publications are rather complicated graphs that, while useful, are not good teaching examples. This paper contains many examples of very simple ways to get very simple things accomplished. Over 20 different graphs are developed using only a few lines of code each, using data from the SASHELP data sets. The use of the SGPLOT, SGPANEL, and SGSCATTER procedures are shown. In addition, the paper addresses those situations in which the user must alternatively use a combination of the TEMPLATE and SGRENDER procedures to accomplish the task at hand. Most importantly, the use of ODS Graphics Designer as a teaching tool and a generator of sample graphs and code are covered. The emphasis in this paper is the simplicity of the learning process. Users will be able to take the included code and run it immediately on their personal machines to achieve an instant sense of gratification.
Roger Muller, Data-To-Events, Inc.
As more of your work is performed in an off-premises cloud environment, understanding how to get the data you want to analyze and report on becomes important. In addition, working in a cloud environment like Amazon Web Services might be something that is new to you or to your organization when you use a product like SAS® Visual Analytics. This presentation discusses how to get the best use out of cloud resources, how to efficiently transport information between systems, and how to continue to leverage data in an on-premises database management system (DBMS) in your future cloud system.
Gary Mehler, SAS
Since its inception, SAS® Text Miner has used the singular value decomposition (SVD) to convert a term-document matrix to a representation that is crucial for building successful supervised and unsupervised models. In this presentation, using SAS® code and SAS Text Miner, we compare these models with those that are based on SVD representations of subcomponents of documents. These more granular SVD representations are also used to provide further insights into the collection. Examples featuring visualizations, discovery of term collocations, and near-duplicate subdocument detection are shown.
Russell Albright, SAS
James Cox, SAS Institute
Ning Jin, SAS Institute
If your organization already deploys one or more software solutions via Amazon Web Services (AWS), you know the value of the public cloud. AWS provides a scalable public cloud with a global footprint, allowing users access to enterprise software solutions anywhere at any time. Although SAS® began long before AWS was even imagined, many loyal organizations driven by SAS are moving their local SAS analytics into the public AWS cloud, alongside other software hosted by AWS. SAS® Solutions OnDemand has assisted organizations in this transition. In this paper, we describe how we extended our enterprise hosting business to AWS. We describe the open source automation framework from which SAS Soultions onDemand built our automation stack, which simplified the process of migrating a SAS implementation. We'll provide the technical details of our automation and network footprint, a discussion of the technologies we chose along the way, and a list of lessons learned.
Ethan Merrill, SAS
Bryan Harkola, SAS
In today's world of torrential data, plotting large amounts of data has its own challenges. The SGPLOT procedure in ODS Graphics offers a simple yet powerful tool for your graphing needs. In this paper we present some graphs that work better for large data sets. We create heat maps, box plots, and parallel coordinate plots that visualize large data. You can make your graphs resilient to your growing data with ODS Graphics!
Prashant Hebbar, SAS Institute Inc
Lingxiao Li, SAS
Sanjay Matange, SAS
There are many ways to customize data presentation and graphics for an area as specialized as oncology. This paper touches on various approaches to presenting oncological data and on ways that each of the graphics can be customized and individualized based on the needs of a study. The topics range from using SGPLOTs and SGPANEL, which are a simple but visually effective way to present stratified information, to using a complex integration of a series of HIGHLOW plots to depict duration of patient responses on a clinical trial. We also take a detailed look at how to generate specialized and customized waterfall plots used to visualize tumor growth patterns, as well as linear graphics that display a different approach to viewing the same endpoint. In addition to specifying the syntax to graph the data, instruction will be given on how to construct data sets in order to achieve the desired graphics.
Nora Ruel, City of Hope
Project management is a hot topic across many industries, and there are multiple commercial software applications for managing projects available. The reality, however, is that the majority of project management software is not applicable for daily usage. SAS® has a solution for this issue that can be used for managing projects graphically in real time. This paper introduces a new paradigm for project management using the SAS® Graph Template Language (GTL). SAS clients, in real time, can use GTL to visualize resource assignments, task plans, delivery tracking, and project status across multiple project levels for more efficient project management.
Zhouming(Victor) Sun, Medimmune
SAS® programs can be very I/O intensive. SAS data sets with inappropriate variable attributes can degrade the performance of SAS programs. Using SAS compression offers some relief but does not eliminate the issues caused by inappropriately defined SAS variables. This paper examines the problems that inappropriate SAS variable attributes can cause and introduces a macro to tackle the problem of minimizing the footprint of a SAS data set.
Brian Varney, Experis BI & Analytics Practice
Would you like to be more confident in producing graphs and figures? Do you understand the differences between the OVERLAY, GRIDDED, LATTICE, DATAPANEL, and DATALATTICE layouts? Would you like to know how to easily create life sciences industry standard graphs such as adverse event timelines, Kaplan-Meier plots, and waterfall plots? Finally, would you like to learn all these methods in a relaxed environment that fosters questions? Great--this topic is for you! In this hands-on workshop, you will be guided through the Graph Template Language (GTL). You will also complete fun and challenging SAS graphics exercises to enable you to more easily retain what you have learned. This session is structured so that you will learn how to create the standard plots that your manager requests, how to easily create simple ad hoc plots for your customers, and also how to create complex graphics. You will be shown different methods to annotate your plots, including how to add Unicode characters to your plots. You will find out how to create reusable templates, which can be used by your team. Years of information have been carefully condensed into this 90-minute hands-on, highly interactive session. Feel free to bring some of your challenging graphical questions along!
Kriss Harris, SAS Specialists Ltd
Annotating a blank Case Report Form (blankcrf.pdf, which is required for a nondisclosure agreement with the US Food and Drug Administration) by hand is an arduous process that can take many hours of precious time. Once again SAS® comes to the rescue! You can have SAS use the Study Data Tabulation Model (SDTM) specifications data to create all the necessary annotations and place them on the appropriate pages of the blankcrf.pdf. In addition you can dynamically color and adjust the font size of these annotations. This approach uses SAS, Adobe Acrobat's forms definition format (FDF) language, and Adobe Reader to complete the process. In this paper I go through each of the steps needed and explain in detail exactly how to accomplish each task.
Steven Black, Agility Clinical
SAS® provides some powerful, flexible tools for creating reports, like the REPORT and TABULATE procedures. With the advent of the Output Delivery System (ODS), you have almost total control over how the output from those procedures looks. But there are still times where you need (or want) just a little more and that's where the Report Writing Interface (RWI) can help. The Report Writing Interface is just a fancy way of saying you're using the ODSOUT object in a DATA step. This object allows you to lay out the page, create tables, embed images, add titles and footnotes, and more--all from within a DATA step, using whatever DATA step logic you need. Also, all the style capabilities of ODS are available to you so that the output created by your DATA step can have fonts, sizes, colors, backgrounds, and borders to make your report look just like you want. This presentation quickly covers some of the basics of using the ODSOUT object and then walks through some of the techniques to create four real-world examples. Who knows, you might even go home and replace some of your PROC REPORT code--I know I have!
Pete Lund, Looking Glass Analytics
Since Atul Gawande popularized the term in describing the work of Dr. Jeffrey Brenner in a New Yorker article, hot-spotting has been used in health care to describe the process of identifying super-utilizers of health care services, then defining intervention programs to coordinate and improve their care. According to Brenner's data from Camden, New Jersey, 1% of patients generate 30% of payments to hospitals, while 5% of patients generate 50% of payments. Analyzing administrative health care claims data, which contains information about diagnoses, treatments, costs, charges, and patient sociodemographic data, can be a useful way to identify super-users, as well as those who may be receiving inappropriate care. Both groups can be targeted for care management interventions. In this paper, techniques for patient outlier identification and prioritization are discussed using examples from private commercial and public health insurance claims data. The paper also describes techniques used with health care claims data to identify high-risk, high-cost patients and to generate analyses that can be used to prioritize patients for various interventions to improve their health.
Paul LaBrec, 3M Health Information Systems
A group tasked with testing SAS® software from the customer perspective has gathered a number of helpful hints for SAS® 9.4 that will smooth the transition to its new features and products. These hints will help with the 'huh?' moments that crop up when you are getting oriented and will provide short, straightforward answers. We also share insights about changes in your order contents. Gleaned from extensive multi-tier deployments, SAS® Customer Experience Testing shares insiders' practical tips to ensure that you are ready to begin your transition to SAS 9.4 and guidance for after you are at SAS 9.4. The target audience for this paper is primarily system administrators who will be installing, configuring, or administering the SAS 9.4 environment. This paper was first published in 2012; it has been revised each year since with new information.
Lisa Cook, SAS
Lisa Cook, SAS
Edith Jeffries, SAS
Cindy Taylor, SAS
SAS® Federated Query Language (FedSQL) is a SAS proprietary implementation of the ANSI SQL:1999 core standard capable of providing a scalable, threaded, high-performance way to access relational data in multiple data sources. This paper explores the FedSQL language in a three-step approach. First, we introduce the key features of the language and discuss the value each feature provides. Next, we present the FEDSQL procedure (the Base SAS® implementation of the language) and compare it to PROC SQL in terms of both syntax and performance. Finally, we examine how FedSQL queries can be embedded in DS2 programs to merge the power of these two high-performance languages.
Shaun Kaufmann, Farm Credit Canada
You can use annotation, modify templates, and change dynamic variables to customize graphs in SAS®. Standard graph customization methods include template modification (which most people use to modify graphs that analytical procedures produce) and SG annotation (which most people use to modify graphs that procedures such as PROC SGPLOT produce). However, you can also use SG annotation to modify graphs that analytical procedures produce. You begin by using an analytical procedure, ODS Graphics, and the ODS OUTPUT statement to capture the data that go into the graph. You use the ODS document to capture the values that the procedure sets for the dynamic variables, which control many of the details of how the graph is created. You can modify the values of the dynamic variables, and you can modify graph and style templates. Then you can use PROC SGRENDER along with the ODS output data set, the captured or modified dynamic variables, the modified templates, and SG annotation to create highly customized graphs. This paper shows you how and provides examples.
Warren Kuhfeld, SAS
Electric load forecasting is a complex problem that is linked with social activity considerations and variations in weather and climate. Furthermore, electric load is one of only a few goods that require the demand and supply to be balanced at almost real time (that is, there is almost no inventory). As a result, the utility industry could be much more sensitive to forecast error than many other industries. This electric load forecasting problem is even more challenging for holidays, which have limited historical data. Because of the limited holiday data, the forecast error for holidays is higher, on average, than it is for regular days. Identifying and using days in the history that are similar to holidays to help model the demand during holidays is not new for holiday demand forecasting in many industries. However, the electric demand in the utility industry is strongly affected by the interaction of weather conditions and social activities, making the problem even more dynamic. This paper describes an investigation into the various technologies that are used to identify days that are similar to holidays and using those days for holiday demand forecasting in the utility industry.
Rain Xie, SAS
Alex Chien, SAS Institute
Contemporary data-collection processes usually involve recording information about the geographic location of each observation. This geospatial information provides modelers with opportunities to examine how the interaction of observations affects the outcome of interest. For example, it is likely that car sales from one auto dealership might depend on sales from a nearby dealership either because the two dealerships compete for the same customers or because of some form of unobserved heterogeneity common to both dealerships. Knowledge of the size and magnitude of the positive or negative spillover effect is important for creating pricing or promotional policies. This paper describes how geospatial methods are implemented in SAS/ETS® and illustrates some ways you can incorporate spatial data into your modeling toolkit.
Guohui Wu, SAS
Jan Chvosta, SAS
SAS® Enterprise Guide® is an extremely valuable tool for programmers, but it should also be leveraged by managers and executives to do data exploration, get information on the fly, and take advantage of the powerful analytics and reporting that SAS® has to offer. This can all be done without learning to program. This paper gives an overview of how SAS Enterprise Guide can improve the process of turning real-time data into real-time business decisions by managers.
Steven First, Systems Seminar Consultants, Inc.
As SAS® programmers we often want our code or program logic to be driven by the data at hand, rather than be directed by us. Such dynamic code enables the data to drive the logic or sequence of execution. This type of programming can be very useful when creating lists of observations, variables, or data sets from ever-changing data. Whether these lists are used to verify the data at hand or are to be used in later steps of the program, dynamic code can write our lists once and ensure that the values change in tandem with our data. This Quick Tip paper will present the concepts of creating data-driven lists for observations, variables, and data sets, the code needed to execute these tasks, and examples to better explain the process and results of the programs we will create.
Kate Burnett-Isaacs, Statistics Canada
SAS® Event Stream Processing is designed to analyze and process large volumes of streaming data in motion. SAS Event Stream Processing provides a browser-based user interface that enables you to create and test event stream processing models in a visual drop-and-drag environment. This environment delivers a highly interactive and intuitive user experience. This paper describes the visual, interactive interface for building models and monitoring event stream activity. It also provides examples to demonstrate how you can easily build a model using the graphical user interface of SAS Event Stream Processing. In these examples, SAS Event Stream Processing serves as the front end to process high-velocity streams. On the back end, SAS® Real-Time Decision Manager consumes events and makes the final decision to push the suited offer to the customer. This paper explains the concepts of windows, retention, edges, and connector. It also explains how SAS Event Stream Processing integrates with SAS Real-Time Decision Manager.
Lei Xiao, SAS
Fang Meng, SAS
SAS® Data Management is not a dating application. However, as a data analyst, you do strive to find the best matches for your data. Similar to a dating application, when trying to find matches for your data, you need to specify the criteria that constitutes a suitable match. You want to strike a balance between being too stringent with your criteria and under-matching your data and being too loose with your criteria and over-matching your data. This paper highlights various SAS Data Management matching techniques that you can use to strike the right balance and help your data find its perfect match, and as a result improve your data for reporting and analytics purposes.
Mary Kathryn Queen, SAS
The SAS® Scalable Performance Data Server and SAS® Scalable Performance Data Engine are data formats from SAS® that support the creation of analytical base tables with tens of thousands of columns. These analytical base tables are used to support daily predictive analytical routines. Traditionally Storage Area Network (SAN) storage has been and continues to be the primary storage platform for the SAS Scalable Performance Data Server and SAS Scalable Performance Data Engine formats. Due to cost constraints associated with SAN storage, companies have added Hadoop to their environments to help minimize storage costs. In this paper we explore how the SAS Scalable Performance Data Server and SAS Scalable Performance Data Engine leverage the Hadoop Distributed File System.
Steven Sober, SAS
Today's SAS® environment has large numbers of concurrent SAS processes that have to process ever-growing data volumes. To help SAS users remain productive, SAS administrators must ensure that SAS applications have sufficient computer resources that are properly configured and monitored often. Understanding how all of the components of SAS work and how they are used by your users is the first step. The guidance offered in this paper helps SAS administrators evaluate hardware, operating system, and infrastructure options for a SAS environment that will keep their SAS applications running at optimal performance and keep their user community happy.
Margaret Crevar, SAS
Most SAS® products are offered as web-based applications these days. Even though a web-based interface provides unmatched accessibility, it comes with known security vulnerabilities. This paper examines the most common exploitation scenarios and suggests the ways to make web applications more secure. The top ten focus areas include protection of user credentials, use of stronger authentication methods, implementation of SSL for all communications between client and server, understanding of attacking mechanism, penetration testing, adoption and integration with third-party security packages, encryption of any sensitive data, security logging and auditing, mobile device access management, and prevention of threats from inside.
Heesun Park, SAS
In SAS® LASR™ Analytic Server, data can reside in three types of environments: client hard disk (for example, a laptop), the Hadoop File System (HDFS) and the memory of the SAS LASR Analytic Server. Moving the data efficiently among these is critical for getting insights from the data on time. In this paper, we illustrate all the possible ways to move the data, including 1) moving data from client hard disk to HDFS; 2) moving data from HDFS to client hard disk; 3) moving data from HDFS to SAS LASR Analytic Server; 4) moving data from SAS LASR Analytic Server to HDFS; 5) moving data from client hard disk to SAS LASR Analytic Server; and 6) moving data from SAS LASR Analytic Server to client hard disk.
Yue Qi, SAS
Managing large data sets comes with the task of providing a certain level of quality assurance, no matter what the data is used for. We present here the fundamental SAS® procedures to perform when determining the completeness of a data set. Even though each data set is unique and has its own variables that need more examination in detail, it is important to first examine the size, time, range, interactions, and purity (STRIP) of a data set to determine its readiness for any use. This paper covers first steps you should always take, regardless of whether you're dealing with health, financial, demographic, or environmental data.
Michael Santema, Kaiser Permanente
Fagen Xie, Kaiser Permanente
This paper provides tips and techniques to speed up the validation process without and with automation. For validation without automation, it introduces both standard use and clever use of options and statements to be implemented in the COMPARE procedure that can speed up the validation process. For validation with automation, a macro named %QCDATA is introduced for individual data set validation, and a macro named %QCDIR is introduced for comparison of data sets in two different directories. Also introduced in this section is a definition of &SYSINFO and an explanation of how it can be of use to interpret the result of the comparison.
Alice Cheng, Portola Pharmaceuticals
Justina Flavin, Independent Consultant
Michael Wise, Experis BI & Analytics Practice
PERL is a good language to work with text in in the case of SAS® with character variables. PERL was much used in website programming in the early days of the World Wide Web. SAS provides some PERL functions in DATA step statements. PERL is useful for finding text patterns or simply strings of text and returning the strings or positions in a character variable of the string. I often use PERL to locate a string of text in a character variable. Knowing this location, I know where to start a substr function. The poster covers a number of PERL functions available in SAS® 9.2 and later. Each function is explained with written text, and then a brief code demonstration shows how I use the function. The examples are jointly explained based on SAS documentation and my best practices.
Peter Timusk, Independent
Have you ever wondered how to get the most from Web 2.0 technologies in order to visualize SAS® data? How to make those graphs dynamic, so that users can explore the data in a controlled way, without needing prior knowledge of SAS products or data science? Wonder no more! In this session, you learn how to turn basic sashelp.stocks data into a snazzy HighCharts stock chart in which a user can review any time period, zoom in and out, and export the graph as an image. All of these features with only two DATA steps and one SORT procedure, for 57 lines of SAS code.
Vasilij Nevlev, Analytium Ltd
Your organization already uses SAS® Visual Analytics and you have designed reports for internal use. Now you want to publish a report on your external website. How do you design a report for the general public considering the wide range of education and abilities? This paper defines best practices for designing reports that are universally accessible to the broadest audience. You learn tips and techniques for designing reports that the general public can easily understand and use to gain insight. You also learn how to leverage features that help you comply with your legal obligations regarding users with disabilities. The paper includes recommendations and examples that you can apply to your own reports.
Jesse Sookne, SAS
Julianna Langston, SAS Institute
Karen Mobley, SAS
Ed Summers, SAS
Although hashing methods such as SHA256 and MD5 are already available in SAS®, other methods may be needed by SAS users. This presentation shows how hashing and methods can be implemented using SAS DATA steps and macros, with judicious use of the bitwise functions (BAND, BOR, and so on) and by referencing public domain sources.
Rick Langston, SAS
In the aftermath of the 2008 global financial crisis, banks had to improve their data risk aggregation in order to effectively identify and manage their credit exposures and credit risk, create early warning signs, and improve the ability of risk managers to challenge the business and independently assess and address evolving changes in credit risk. My presentation focuses on using SAS® Credit Risk Dashboard to achieve all of the above. Clearly, you can use my method and principles of building a credit risk dashboard to build other dashboards for other types of risks as well (market, operational, liquidity, compliance, reputation, etc.). In addition, because every bank must integrate the various risks with a holistic view, each of the risk dashboards can be the foundation for building an effective enterprise risk management (ERM) dashboard that takes into account correlation of risks, risk tolerance, risk appetite, breaches of limits, capital allocation, risk-adjusted return on capital (RAROC), and so on. This will support the actions of top management so that the bank can meet shareholder expectations in the long term.
Boaz Galinson, leumi
In the pharmaceutical industry, generating summary tables or listings in PDF format is becoming increasingly popular. You can create various PDF output files with different styles by combining the ODS PDF statement and the REPORT procedure in SAS® software. In order to validate those outputs, you need to import the PDF summary tables or listings into SAS data sets. This paper presents an overview of a utility that reads in an uncompressed PDF file that is generated by SAS, extracts the data, and converts it to SAS data sets.
William Wu, Herodata LLC
Yun Guan, Theravance Biopharm
Importing metadata can be a time-consuming, painstaking, and fraught task when you have multiple SAS packages to deal with. Fortunately SAS® software provides a batch facility to import (and export) metadata by using the command line, thereby saving you from many a mouse click and potential repetitive strain injury when you use the Metadata Import Wizard. This paper aims to demystify the SAS batch import tool. It also introduces a template-driven process for programmatically building a batch file that contains the syntax needed to import metadata and then automatically executing the batch import scripts.
David Moors, Whitehound Limited
Looking for new ways to improve your business? Try mining your own data! Event log data is a side product of information systems generated for audit and security purposes and is seldom analyzed, especially in combination with business data. Along with the cloud computing era, more event log data has been accumulated and analysts are searching for innovative ways to take advantage of all data resources in order to get valuable insights. Process mining, a new field for discovering business patterns from event log data, has recently proved useful for business applications. Process mining shares some algorithms with data mining but it is more focused on interpretation of the detected patterns rather than prediction. Analysis of these patterns can lead to improvements in the efficiency of common existing and planned business processes. Through process mining, analysts can uncover hidden relationships between resources and activities and make changes to improve organizational structure. This paper shows you how to use SAS® Analytics to gain insights from real event log data.
Emily (Yan) Gao, SAS
Robert Chu, SAS
Xudong Sun, SAS
Statistical quality improvement is based on understanding process variation, which falls into two categories: variation that is natural and inherent to a process, and unusual variation due to specific causes that can be addressed. If you can distinguish between natural and unusual variation, you can take action to fix a broken process and avoid disrupting a stable process. A control chart is a tool that enables you to distinguish between the two types of variation. In many health care activities, carefully designed processes are in place to reduce variation and limit adverse events. The types of traditional control charts that are designed to monitor defect counts are not applicable to monitoring these rare events, because these charts tend to be overly sensitive, signaling unusual variation each time an event occurs. In contrast, specialized rare events charts are well suited to monitoring low-probability events. These charts have gained acceptance in health care quality improvement applications because of their ease of use and their suitability for processes that have low defect rates. The RAREEVENTS procedure, which is new in SAS/QC® 14.1, produces rare events charts. This paper presents an overview of PROC RAREEVENTS and illustrates how you can use rare events charts to improve health care quality.
Bucky Ransdell, SAS
Memory-based reasoning (MBR) is an empirical classification method that works by comparing cases in hand with similar examples from the past and then applying that information to the new case. MBR modeling is based on the assumptions that the input variables are numeric, orthogonal to each other, and standardized. The latter two assumptions are taken care of by principal components transformation of raw variables and using the components instead of the raw variables as inputs to MBR. To satisfy the first assumption, the categorical variables are often dummy coded. This raises issues such as increasing dimensionality and overfitting in the training data by introducing discontinuity in the response surface relating inputs and target variables. The Weight of Evidence (WOE) method overcomes this challenge. This method measures the relative response of the target for each group level of a categorical variable. Then the levels are replaced by the pattern of response of the target variable within that category. The SAS® Enterprise Miner™ Interactive Grouping Node is used to achieve this. With this method, the categorical variables are converted into numeric variables. This paper demonstrates the improvement in performance of an MBR model when categorical variables are WOE coded. A credit screening data set obtained from SAS® Education that comprises 25 attributes for 3000 applicants is used for this study. Three different types of MBR models were built using the SAS Enterprise Miner MBR node to check the improvement in performance. The results show the MBR model with WOE coded categorical variables is the best, based on misclassification rate. Using this data, when WOE coding is adopted, the model misclassification rate decreases from 0.382 to 0.344 while the sensitivity of the model increases from 0.552 to 0.572.
Vinoth Kumar Raja, West Corporation
Goutam Chakraborty, Oklahoma State University
Vignesh Dhanabal, Oklahoma State University
Many organizations want to innovate with the analytics solutions from SAS®. However, many companies are facing constraints in time and money to build an innovation strategy. In this session, you learn how SaasNow can offer you a flexible solution to become innovative again. During this session, you experience how easy it is to deploy a SAS® Visual Analytics and SAS® Visual Statistics environment in just 30 minutes in a pay-per-month model. Visitors to this session receive a free voucher to test-drive SaasNow!
Working with big data is often time consuming and challenging. The primary goal in programming is to maximize throughputs while minimizing the use of computer processing time, real time, and programmers' time. By using the Multiprocessing (MP) CONNECT method on a symmetric multiprocessing (SMP) computer, a programmer can divide a job into independent tasks and execute the tasks as threads in parallel on several processors. This paper demonstrates the development and application of a parallel processing program on a large amount of health-care data.
Shuhua Liang, Kaiser Permanente
A picture is worth a thousand words, but what if there are a billion words? This is where the picture becomes even more important, and this is where infographics step in. Infographics are a representation of information in a graphic format designed to make the data easily understandable, at a glance, without having to have a deep knowledge of the data. Due to the amount of data available today, more infographics are being created to communicate information and insight from all available data, both in the boardroom and on social media. This session shows you how to create information graphics that can be printed, shared, and dynamically explored with objects and data from SAS® Visual Analytics. Connect your infographics to the high-performance analytical engine from SAS® for repeatability, scale, and performance on big data, and for ease of use. You will see how to leverage elements of your corporate dashboards and self-service analytics while communicating subjective information and adding the context that business teams require, in a highly visual format. This session looks at how SAS® Office Analytics enables a Microsoft Office user to create infographics for all occasions. You will learn the workflow that lets you get the most from your SAS Visual Analytics system without having to code anything. You will leave this session with the perfect blend of creative freedom and data governance that comes from leveraging the power of SAS Visual Analytics and the familiarity of Microsoft Office.
Travis Murphy, SAS
One of the tedious but necessary things that SAS® programmers must do is to trace and monitor SAS runs: counting observations and columns, checking performance, drawing flowcharts, and diagramming data flows. This is required in order to audit results, find errors, find long-running steps, identify large data files, and to simply understand what a particular job does. SAS® Enterprise Guide® includes functionality to help with some of this, on some SAS jobs. This paper presents an innovative custom tool that automatically produces flowcharts--showing all SAS steps in a job, file counts in and out, variable counts, recommendations and generated code for performance improvements, and more. The system was written completely in SAS using SAS source programs, SAS logs, PROC SCAPROC output, and more as input. This tool can eliminate much of the manual work needed in job optimization, space issues, debugging, change management, documentation, and job check-out. As has been said many times before: We have computers, let's use them.
Steven First, Systems Seminar Consultants, Inc.
Jennifer First-Kluge, Systems Seminar Consultants, Inc.
An interactive SAS® environment is preferred for developing programs as it gives the flexibility of instantly viewing the log in the log window. The programmer must review the log window to ensure that each and every single line of a written program is running successfully without displaying any messages defined by SAS that are potential errors. Reviewing the log window every time is not only time consuming but also prone to manual error for any level of programmer. Just to confirm that the log is free from error, the programmer must check the log. Currently in the interactive SAS environment there is no way to get an instant notification about the generated log from the Log window, indicating whether there have been any messages defined by SAS that are potential errors. This paper introduces an instant approach to analyzing the Log window using the SAS macro %ICHECK that displays the reports instantly in the same SAS environment. The report produces a summary of all the messages defined by SAS in the Log window. The programmer does not need to add %ICHECK at the end of the program. Whether a single DATA step, a single PROC step, or the whole program is submitted, the %ICHECK macro is automatically executed at the end of every submission. It might be surprising to you to learn how a compiled macro can be executed without calling it in the Editor window. But it is possible with %ICHECK, and you can develop it using only SAS products. It can be used in a Windows or UNIX interactive SAS environment without requiring any user inputs. With the proposed approach, there is a significant benefit in the log review process and a 100% gain in time saved for all levels of programmers because the log is free from error. Similar functionality can be introduced in the SAS product itself.
Amarnath Vijayarangan, Emmes Services Pvt Ltd, India
PALANISAMY MOHAN, ICON CLINICAL RESEARCH INDIA PVT LTD
Real-time, integrated marketing solutions are a necessity for maintaining your competitive advantage. This presentation provides a brief overview of three SAS products (SAS® Marketing Automation, SAS® Real-Time Decision Manager, and SAS® Event Stream Processing) that form a basis for building modern, real-time, interactive marketing solutions. It presents typical (and also possible) customer-use cases that you can implement with a comprehensive real-time interactive marketing solution, in major industries like finance (banking), telco, and retail. It demonstrates typical functional architectures that need to be implemented to support business cases (how solution components collaborate with customer's IT landscape and with each other). And it provides examples of our experience in implementing these solutions--dos and don'ts, best practices, and what to expect from an implementation project.
Dmitriy Alergant, Tier One Analytics
Marje Fecht, Prowerk Consulting
Microsoft Visual Basic Scripting Edition (VBScript) and SAS® software are each powerful tools in their own right. These two technologies can be combined so that SAS code can call a VBScript program or vice versa. This gives a programmer the ability to automate SAS tasks; traverse the file system; send emails programmatically via Microsoft Outlook or SMTP; manipulate Microsoft Word, Microsoft Excel, and Microsoft PowerPoint files; get web data; and more. This paper presents example code to demonstrate each of these capabilities.
Christopher Johnson, BrickStreet Insurance
In studies where randomization is not possible, imbalance in baseline covariates (confounding by indication) is a fundamental concern. Propensity score matching (PSM) is a popular method to minimize this potential bias, matching individuals who received treatment to those who did not, to reduce the imbalance in pre-treatment covariate distributions. PSM methods continue to advance, as computing resources expand. Optimal matching, which selects the set of matches that minimizes the average difference in propensity scores between mates, has been shown to outperform less computationally intensive methods. However, many find the implementation daunting. SAS/IML® software allows the integration of optimal matching routines that execute in R, e.g. the R optmatch package. This presentation walks through performing optimal PSM in SAS® through implementing R functions, assessing whether covariate trimming is necessary prior to PSM. It covers the propensity score analysis in SAS, the matching procedure, and the post-matching assessment of covariate balance using SAS/STAT® 13.2 and SAS/IML procedures.
Lucy D'Agostino McGowan, Vanderbilt University
Robert Greevy, Department of Biostatistics, Vanderbilt University
Companies looking for an optimal solution to run their SAS® Analytics need a seamless way to manage their data between many different systems, including commodity Hadoop storage and the more traditional data warehouse. This presentation outlines a simple path for building a single platform that integrates SAS®, Hadoop, and the data warehouse into a single, pre-configured solution, as well as strategies for querying data within multiple existing systems and combing the results to produce even more powerful decision-making possibilities.
This paper builds on the knowledge gained in the paper 'Introduction to ODS Graphics.' The capabilities in ODS Graphics grow with every release as both new paradigms and smaller tweaks are introduced. After talking with the ODS developers, I have chosen a selection of the many wonderful capabilities to highlight here. This paper provides the reader with more tools for his or her tool belt. Visualization of data is an important part of telling the story seen in the data. And while the standards and defaults in ODS Graphics are very well done, sometimes the user has specific nuances for characters in the story or additional plot lines they want to incorporate. Almost any possibility, from drama to comedy to mystery, is available in ODS Graphics if you know how. We explore tables, annotation and changing attributes, as well as the block plot. Any user of Base SAS® on any platform will find great value from the SAS® ODS Graphics procedures. Some experience with these procedures is assumed, but not required.
Chuck Kincaid, Experis BI & Analytics Practice
Organizations view Hadoop as a lower cost means to assemble their data in one location. Increasingly, the Hadoop environment also provides the processing horsepower to analyze the patterns contained in this data to create business value. SAS® Grid Manager for Hadoop allows you to co-locate your SAS® Analytics workload directly on your Hadoop cluster via an integration with the YARN and Oozie components of the Hadoop ecosystem. YARN allows Hadoop to become the operating system for your data-it is a tool that manages and mediates access to the shared pool of data, and manages and mediates the resources that are used to manipulate the pool. Learn the capabilities and benefits of SAS Grid Manager for Hadoop as well as some configuration tips. In addition, sample SAS Grid jobs are provided to illustrate different ways to access and analyze your Hadoop data with your SAS Grid jobs.
Cheryl Doninger, SAS
Doug Haigh, SAS
This presentation teaches the audience how to use ODS Graphics. Now part of Base SAS®, ODS Graphics are a great way to easily create clear graphics that enable users to tell their stories well. SGPLOT and SGPANEL are two of the procedures that can be used to produce powerful graphics that used to require a lot of work. The core of the procedures is explained, as well as some of the many options available. Furthermore, we explore the ways to combine the individual statements to make more complex graphics that tell the story better. Any user of Base SAS on any platform will find great value in the SAS® ODS Graphics procedures.
Chuck Kincaid, Experis BI & Analytics Practice
The SAS® Output Delivery System (ODS) ExcelXP tagset offers users the ability to export a SAS data set, with all of the accompanying functions and calculations, to a Microsoft Excel spreadsheet. For industries, this is particularly useful because although not everyone is a SAS programmer, they would like to have access to and manipulate data from SAS. The ExcelXP tagset is one of several built-in templates for exporting data in SAS. The tagset gives programmers the ability to export functions and calculations into the cells of an Excel spreadsheet. Several options within the tagset enable the programmer to customize the Excel file. Some of these options enable the programmer to name the worksheet, style each table, embed titles and footnotes, and export multiple data tables to the same Excel worksheet.
Veronica Renauldo, Grand Valley State University
While there is no shortage of MS and MA programs in data science, PhD programs have been slow to develop. Is there a role for a PhD in data science? How would this differ from an advanced degree in statistics? Computer science? This session explores the issues related to the curriculum content for a PhD in data science and what those students would be expected to do after graduation. Come join the conversation.
Jennifer Priestley, Kennesaw State University
Every day, companies all over the world are moving their data into the cloud. While there are many options available, much of this data will wind up in Amazon Redshift. As a SAS® user, you are probably wondering, 'What is the best way to access this data using SAS?' This paper discusses the many ways that you can use SAS/ACCESS® software to get to Amazon Redshift. We compare and contrast the various approaches and help you decide which is best for you. Topics include building a connection, moving data into Amazon Redshift, and applying performance best practices.
Chris DeHart, SAS
Jeff Bailey, SAS
Gone are the days when the only method of receiving a loan was by visiting your local branch and working with a loan officer. In today's economy, financial institutions increasingly rely on online channels to interact with their customers. The anonymity that is inherent in this channel makes it a prime target for fraudsters. The solution is to profile the behavior of internet banking in real time and assess each transaction for risk as it is processed in order to prevent financial loss before it occurs. SAS® Visual Scenario Designer enables you to create rules, scenarios, and models, test their impact, and inject them into real-time transaction processing using SAS® Event Stream Processing.
Sam Atassi, SAS
Jamie Hutton, SAS
Do you want to see and experience how to configure SAS® Enterprise Miner™ single sign-on? Are you looking to explore setting up Integrated Windows Authentication with SAS® Visual Analytics? This hands-on workshop demonstrates how you can configure Kerberos delegation with SAS® 9.4. You see how to validate the prerequisites, make the configuration changes, and use the applications. By the end of this workshop you will be empowered to start your own configuration.
Stuart Rogers, SAS
High-quality effective graphs not only enhance understanding of the data but also facilitate regulators in the review and approval process. In recent SAS® releases, SAS has made significant progress toward more efficient graphing in ODS Statistical Graphics (SG) procedures and Graph Template Language (GTL). A variety of graphs can be quickly produced using convenient built-in options in SG procedures. With graphical examples and comparison between SG procedures and traditional SAS/GRAPH® procedures in reporting clinical trial data, this paper highlights several key features in ODS Graphics to efficiently produce sophisticated statistical graphs with more flexible and dynamic control of graphical presentation including: 1) Better control of axes in different scales and intervals; 2) Flexible ways to control graph appearance; 3) Plots overlay in single-cell or multi-cell graphs; 4) Enhanced annotation; 5) Classification panel of multiple plots with individualized labeling.
Yuxin (Ellen) Jiang, Biogen
Considering the fact that SAS® Grid Manager is becoming more and more popular, it is important to fulfill the user's need for a successful migration to a SAS® Grid environment. This paper focuses on key requirements and common issues for new SAS Grid users, especially if they are coming from a traditional environment. This paper describes a few common requirements like the need for a current working directory, the change of file system navigation in SAS® Enterprise Guide® with user-given location, getting job execution summary email, and so on. The GRIDWORK directory has been introduced in SAS Grid Manager, which is a bit different from the traditional SAS WORK location. This paper explains how you can use the GRIDWORK location in a more user-friendly way. Sometimes users experience data set size differences during grid migration. A few important reasons for data set size difference are demonstrated. We also demonstrate how to create new custom scripts as per business needs and how to incorporate them with SAS Grid Manager engine.
Piyush Singh, TATA Consultancy Services Ltd
Tanuj Gupta, TATA Consultancy Services
Prasoon Sangwan, Tata consultancy services limited
You are a skilled SAS® programmer. You can code circles around those newbies who point and click in SAS® Enterprise Guide®. And yet& there are tasks you struggle with on a regular basis, such as Is the name of that data set DRUG or DRUGS? and What intern wrote this code? It's not formatted well at all and is hard to read. In this seminar you learn how to program, yes program, more efficiently. You learn the benefits of autocomplete and inline help, as well as how to easily format the code that intern wrote that you inherited. In addition, you learn how to create a process flow of a program to identify any dead ends, i.e., data sets that get created but are not used in that program.
Michelle Buchecker, ThotWave Technologies
This paper presents the use of latent class analysis (LCA) to base the identification of a set of mutually exclusive latent classes of individuals on responses to a set of categorical, observed variables. The LCA procedure, a user-defined SAS® procedure for conducting LCA and LCA with covariates, is demonstrated using follow-up data on substance use from Monitoring the Future panel data, a nationally representative sample of high school seniors who are followed at selected time points during adulthood. The demonstration includes guidance on data management prior to analysis, PROC LCA syntax requirements and options, and interpretation of output.
Patricia Berglund, University of Michigan
The current study looks at several ways to investigate latent variables in longitudinal surveys and their use in regression models. Three different analyses for latent variable discovery are briefly reviewed and explored. The latent analysis procedures explored in this paper are PROC LCA, PROC LTA, PROC TRAJ, and PROC CALIS. The latent variables are then included in separate regression models. The effect of the latent variables on the fit and use of the regression model compared to a similar model using observed data is briefly reviewed. The data used for this study was obtained via the National Longitudinal Study of Adolescent Health, a study distributed and collected by Add Health. Data was analyzed using SAS® 9.4. This paper is intended for any level of SAS® user. This paper is also written to an audience with a background in behavioral science and/or statistics.
Deanna Schreiber-Gregory, National University
From stock price histories to hospital stay records, analysis of time series data often requires use of lagged (and occasionally lead) values of one or more analysis variables. For the SAS® user, the central operational task is typically getting lagged (lead) values for each time point in the data set. While SAS has long provided a LAG function, it has no analogous lead function--an especially significant problem in the case of large data series. This paper reviews the LAG function, in particular the powerful, but non-intuitive implications of its queue-oriented basis. The paper demonstrates efficient ways to generate leads with the same flexibility as the LAG function, but without the common and expensive recourse to data re-sorting. It also shows how to dynamically generate leads and lags through use of the hash object.
Mark Keintz, Wharton Research Data Services
Quality issues can easily go undetected in the field for months, even years. Why? Customers are more likely to complain through their social networks than directly through customer service channels. The Internet is the closest friend for most users and focusing purely on internal data isn't going to help. Customer complaints are often handled by marketing or customer service and aren't shared with engineering. This lack of integrated vision is due to the silo mentality in organizations and proves costly in the long run. Information systems are designed with business rules to find out of spec situations or to trigger when a certain threshold of reported cases is reached before engineering is involved. Complex business problems demand sophisticated technology to deliver advanced analytical capabilities. Organizations can react only to what they know about, but unstructured data is an invaluable asset that we must hatch on. So what if you could infuse the voice of the customer directly into engineering and detect emerging issues months earlier than your existing process? Lenovo just did that. Issues were detected in time to remove problematic parts, eventually leading to massive reductions in cost and time. Lenovo observed a significant decline in call volume to its corporate call center. It became evident that customers' perception of quality relative to the market is a voice we can't afford to ignore. Lenovo Corporate Analytics unit partnered with SAS® to build an end-to-end process that constitutes data acquisition and preparation, analysis using linguistic text analytics models, and quality control metrics for visualization dashboards. Lenovo executives consume these reports and make informed decisions. In this paper, we talk in detail about perceptual quality and why we should care about it. We also tell you where to look for perceptual quality data and best practices for starting your own perceptual quality program through the voice of customer analytics.
Murali Pagolu, SAS
Rob Carscadden, SAS
In this paper, a SAS® macro is introduced that can help users find and access their folders and files very easily. By providing a path to the macro and letting the macro know which folders and files you are looking for under this path, the macro creates an HTML report that lists the matched folders and files. The best part of this HTML report is that it also creates a hyperlink for each folder and file so that when a user clicks the hyperlink, it directly opens the folder or file. Users can also ask the macro to find certain folders or files by providing part of the folder or file name as the search criterion. The results shown in the report can be sorted in different ways so that it can further help users quickly find and access their folders and files.
Ting Sa, Cincinnati Children's Hospital Medical Center
Are you still using TRIM, LEFT, and vertical bar operators to concatenate strings? It's time to modernize and streamline that clumsy code by using the string concatenation functions introduced in SAS®9. This paper is an overview of the CAT, CATS, CATT, and CATX functions introduced in SAS®9, and the new CATQ function added in SAS® 9.2. In addition to making your code more compact and readable, this family of functions offers some new tricks for accomplishing previously cumbersome tasks.
Josh Horstman, Nested Loop Consulting
Report automation and scheduling are very hot topics in many corporate industries. Automating reports has many advantages, including reducing workload, eliminating repetitive tasks, generating accurate results, and offering better performance. In recent years SAS® launched more powerful tools to present and share business analytics data. This paper illustrates the stepwise process of how to deploy and schedule reports on a server using SAS® Management Console 9.4. Many of us know that the scheduling jobs can be done using Windows (as well as scheduling jobs at the server level) and it is important to note that the server-side scheduling has more advantages than scheduling jobs on Windows. The Windows scheduler invokes SAS programs on a local PC and these are more often subject to system crashes. The main advantage of scheduling on the server is that most jobs that are scheduled run using nighttime facilities when there is faster record retrieval and less load burden on database servers. Other advantages of scheduling on the server side are that all scheduled jobs are at one location and it is also easy to maintain and keep track of log files if any scheduled jobs fail to run. This paper includes an overview of the schedule manager in SAS Management Console 9.4, a description of system tools and their options, instructions for converting SAS® Enterprise Guide® point-and-click programs into a consolidated SAS program for deployment, and several other related topics.
Anjan Matlapudi,, AmeriHealth Caritas Family of Companies
Is uniqueness essential for your reports? SAS® Visual Analytics provides the ability to customize your reports to make them unique by using the SAS® Theme Designer. The SAS Theme Designer can be accessed from the SAS® Visual Analytics Hub to create custom themes to meet your branding needs and to ensure a unified look across your company. The report themes affect the colors, fonts, and other elements that are used in tables and graphs. The paper explores how to access SAS Theme Designer from the SAS Visual Analytics home page, how to create and modify report themes that are used in SAS Visual Analytics, how to create report themes from imported custom themes, and how to import and export custom report themes.
Meenu Jaiswal, SAS
Ipsita Samantarai, SAS Research & Development (India) Pvt Ltd
As a result of globalization, the durable goods market has become increasingly competitive, with market conditions that challenge profitability for manufacturers. Moreover, high material costs and the capital-intensive nature of the industry make it essential that companies understand demand signals and utilize supply chain capacity as effectively as possible. To grow and increase profitability under these challenging market conditions, a major durable goods company has partnered with SAS to streamline analysis of pricing and profitability, optimize inventory, and improve service levels to its customers. The price of a product is determined by a number of factors, such as the strategic importance of customers, supply chain costs, market conditions, and competitive prices. Offering promotions is an important part of a marketing strategy; it impacts purchasing behaviors of business customers and end consumers. This paper describes how this company developed a system to analyze product profitability and the impact of promotion on purchasing behaviors of both their business customers and end consumers. This paper also discusses how this company uses integrated demand planning and inventory optimization to manage its complex multi-echelon supply chain. The process uses historical order data to create a statistical forecast of demand, and then optimizes inventory across the supply chain to satisfy the forecast at desired service levels.
VARUNRAJ VALSARAJ, SAS
Bahadir Aral, SAS Institute Inc
Baris Kacar, SAS Institute Inc
Jinxin Yi, SAS Institute Inc
We spend so much time talking about GRID environments, distributed jobs, and huge data volumes that we ignore the thousands of relatively tiny programs scheduled to run every night, which produce only a few small data sets, but make all the difference to the users who depend on them. Individually, these jobs might place a negligible load on the system, but by their sheer number they can often account for a substantial share of overall resources, sometimes even impacting the performance of the bigger, more important jobs. SAS® programs, by their varied nature, use available resources in a varied way. Depending on whether a SAS® procedure is CPU-, disk- or memory-intensive, chunks of a memory or CPU can sometimes remain unused for quite a while. Bigger jobs leave bigger chunks, and this is where being small and able to effectively exploit those leftovers can be a great advantage. We call our main optimization technique the Fast Lane, which is a queue configuration dedicated to jobs with consistently small SASWORK directories, that, when available, lets them use unused RAM in place of their SASWORK disk. The approach improves overall CPU saturation considerably while taking loads off the I/O subsystem, and without failure results in improved runtimes for big jobs and small jobs alike, without requiring any changes to deployed code. This paper explores the practical aspects of implementing the Fast Lane on your environment. The time-series profiling of overnight workload finds available resource gaps, identifies candidate jobs to fill those gaps, schedules and queues triggers, changes application server configuration, and monitors and controls techniques that keep everything running smoothly. And, of course, are the very real, tangible gains in performance that Fast Lane has delivered in real-world scenarios.
Nikola Markovic, Boemska Ltd.
Greg Nelson, ThotWave
The Limit of Detection (LoD) is defined as the lowest concentration or amount of material, target, or analyte that is consistently detectable (for polymerase chain reaction [PCR] quantitative studies, in at least 95% of the samples tested). In practice, the estimation of the LoD uses a parametric curve fit to a set of panel member (PM1, PM2, PM3, and so on) data where the responses are binary. Typically, the parametric curve fit to the percent detection levels takes on the form of a probit or logistic distribution. The SAS® PROBIT procedure can be used to fit a variety of distributions, including both the probit and logistic. We introduce the LOD_EST SAS macro that takes advantage of the SAS PROBIT procedure's strengths and returns an information-rich graphic as well as a percent detection table with associated 95% exact (Clopper-Pearson) confidence intervals for the hit rates at each level.
Jesse Canchola, Roche Molecular Systems, Inc.
Pari Hemyari, Roche Molecular Systems, Inc.
Retailers need critical information about the expected inventory pattern over the life of a product to make pricing, replenishment, and staffing decisions. Hotels rely on booking curves to set rates and staffing levels for future dates. This paper explores a linear state space approach to understanding these industry challenges, applying the SAS/ETS® SSM procedure. We also use the SAS/ETS SIMILARITY procedure to provide additional insight. These advanced techniques help us quantify the relationship between the current inventory level and all previous inventory levels (in the retail case). In the hospitality example, we can evaluate how current total bookings relate to historical booking levels. Applying these procedures can produce valuable new insights about the nature of the retail inventory cycle and the hotel booking curve.
Beth Cubbage, SAS
Business Intelligence users analyze business data in a variety of ways. Seventy percent of business data contains location information. For in-depth analysis, it is essential to combine location information with mapping. New analytical capabilities are added to SAS® Visual Analytics, leveraging the new partnership with Esri, a leader in location intelligence and mapping. The new capabilities enable users to enhance the analytical insights from SAS Visual Analytics. This paper demonstrates and discusses the new partnership with Esri and the new capabilities added to SAS Visual Analytics.
Murali Nori, SAS
Himesh Patel, SAS
Does lyric complexity impact song popularity, and can analysis of the Billboard Top 100 from 1955-2015 be used to evaluate this hypothesis? The music industry has undergone a dramatic change. New technologies enable everyone's voice to be heard and has created new avenues for musicians to share their music. These technologies are destroying entry barriers, resulting in an exponential increase in competition in the music industry. For this reason, optimization that enables musicians and record labels to recognize opportunities to release songs with a likelihood of high popularity is critical. The purpose of this study is to determine whether the complexity of song lyrics impacts popularity and to provide guidance and useful information toward business opportunities for musicians and record labels. One such opportunity is the optimization of advertisement budgets for songs that have the greatest chance for success, at the appropriate time and for the appropriate audience. Our data has been extracted from open-source hosts, loaded into a comprehensive, consolidated data set, and cleaned and transformed using SAS® as the primary tool for analysis. Currently our data set consists of 334,784 Billboard Top 100 observations, with 26,869 unique songs. The type of analyses used includes: complexity-popularity relationship, popularity churn (how often Billboard Top 100 resets), top five complexity-popularity relationships, longevity-complexity relationship, popularity prediction, and sentiment analysis on trends over time. We have defined complexity as the number of words in a song, unique word inclusion (compared to other songs), and repetition of each word in a song. The Billboard Top 100 represents an optimal source of data as all songs on the list have some measure of objective popularity, which enables lyric comparison analysis among chart position.
John Harden, Oklahoma State University
Yang Gao, Oklahoma State University
Vojtech Hrdinka, Oklahoma State University
Chris Linn, Oklahoma State University
Markov chain Monte Carlo (MCMC) algorithms are an essential tool in Bayesian statistics for sampling from various probability distributions. Many users prefer to use an existing procedure to code these algorithms, while others prefer to write an algorithm from scratch. We demonstrate the various capabilities in SAS® software to satisfy both of these approaches. In particular, we first illustrate the ease of using the MCMC procedure to define a structure. Then we step through the process of using SAS/IML® to write an algorithm from scratch, with examples of a Gibbs sampler and a Metropolis-Hastings random walk.
Chelsea Lofland, University of California, Santa Cruz
For SAS® Enterprise Guide® users, sometimes macro variables and their values need to be brought over to the local workspace from the server, especially when multiple data sets or outputs need to be written to separate files in a local drive. Manually retyping the macro variables and their values in the local workspace after they have been created on the server workspace would be time-consuming and error-prone, especially when we have quite a number of macro variables and values to bring over. Instead, this task can be achieved in an efficient manner by using dictionary tables and the CALL SYMPUT routine, as illustrated in more detail below. The same approach can also be used to bring macro variables and their values from the local to the server workspace.
Khoi To, Office of Planning and Decision Support, Virginia Commonwealth University
Macroeconomic simulation analysis provides in-depth insights to a portfolio's performance spectrum. Conventionally, portfolio and risk managers obtain macroeconomic scenarios from third parties such as the Federal Reserve and determine portfolio performance under the provided scenarios. In this paper, we propose a technique to extend scenario analysis to an unconditional simulation capturing the distribution of possible macroeconomic climates and hence the true multivariate distribution of returns. We propose a methodology that adds to the existing scenario analysis tools and can be used to determine which types of macroeconomic climates have the most adverse outcomes for the portfolio. This provides a broader perspective on value at risk measures thereby allowing more robust investment decisions. We explain the use of SAS® procedures like VARMAX and PROC COPULA in SAS/IML® in this analysis.
Srikant Jayaraman, SAS
Joe Burdis, SAS Research and Development
Lokesh Nagar, SAS Research and Development
Logic model produced propensity scores have been intensively used to assist direct marketing name selections. As a result, only customers with an absolute higher likelihood to respond are mailed offers in order to achieve cost reduction. Thus, event ROI is increased. There is a fly in the ointment, however. Compared to the model building performance time window, usually 6 months to 12 months, a marketing event time period is usually much shorter. As such, this approach lacks of the ability to deselect those who have a high propensity score but are unlikely to respond to an upcoming campaign. To consider dynamically building a complete propensity model for every upcoming camping is nearly impossible. But, incorporating time to respond has been of great interest to marketers to add another dimension for response prediction enhancement. Hence, this paper presents an inventive modeling technique combining logistic regression and the Cox Proportional Hazards Model. The objective of the fusion approach is to allow a customer's shorter next to repurchase time to compensate for his or her insignificant lower propensity score in winning selection opportunities. The method is accomplished using PROC LOGISTIC, PROC LIFETEST, PROC LIFEREF, and PROC PHREG on the fusion model that is building in a SAS® environment. This paper also touches on how to use the results to predict repurchase response by demonstrating a case of repurchase time-shift prediction on the 12-month inactive customers of a big box store retailer. The paper also shares a results comparison between the fusion approach and logit alone. Comprehensive SAS macros are provided in the appendix.
Hsin-Yi Wang, Alliance Data Systems
SAS® Visual Analytics Explorer puts the robust power of decision trees at your fingertips, enabling you to visualize and explore how data is structured. Decision trees help analysts better understand discrete relationships within data by visually showing how combinations of variables lead to a target indicator. This paper explores the practical use of decision trees in SAS Visual Analytics Explorer through an example of risk classification in the financial services industry. It explains various parameters and implications, explores ways the decision tree provides value, and provides alternative methods to help you the reality of imperfect data.
Stephen Overton, Zencos Consulting LLC
Ben Murphy, Zencos Consulting LLC
As part of promoting a data-driven culture and data analytics modernization at its federal sector clientele, Northrop Grumman developed a framework for designing and implementing an in-house Data Analytics and Research Center (DAARC) using a SAS® set of tools. This DAARC provides a complete set of SAS® Enterprise BI (Business Intelligence) and SAS® Data Management tools. The platform can be used for data research, evaluations, and analysis and reviews by federal agencies such as the Social Security Administration (SSA), the Center for Medicare and Medicaid Services (CMS), and others. DAARC architecture is based on a SAS data analytics platform with newer capabilities of data mining, forecasting, visual analytics, and data integration using SAS® Business Intelligence. These capabilities enable developers, researchers, and analysts to explore big data sets with varied data sources, create predictive models, and perform advanced analytics including forecasting, anomaly detection, use of dashboards, and creating online reports. The DAARC framework that Northrop Grumman developed enables agencies to implement a self-sufficient 'analytics as a service' approach to meet their business goals by making informed and proactive data-driven decisions. This paper provides a detailed approach to how the DAARC framework was established in strong partnership with federal customers of Northrop Grumman. This paper also discusses the best practices that were adopted for implementing specific business use cases in order to save tax-payer dollars through many research-related analytical and statistical initiatives that continue to use this platform.
Vivek Sethunatesan, Northrop Grumman
When I help users design or debug their SAS® programs, they are sometimes unable to provide relevant SAS data sets because they contain confidential information. Sometimes, confidential data values are intrinsic to their problem, but often the problem could still be identified or resolved with innocuous data values that preserve some of the structure of the confidential data. Or the confidential values are in variables that are unrelated to the problem. While techniques for masking or disguising data exist, they are often complex or proprietary. In this paper, I describe a very simple macro, REVALUE, that can change the values in a SAS data set. REVALUE preserves some of the structure of the original data by ensuring that for a given variable, observations with the same real value have the same replacement value, and if possible, observations with a different real value have a different replacement value. REVALUE enables the user to specify which variables to change and whether to order the replacement values for each variable by the sort order of the real values or by observation order. I discuss the REVALUE macro in detail and provide a copy of the macro.
Bruce Gilsen, Federal Reserve Board
Business problems have become more stratified and micro-segmentation is driving the need for mass-scale, automated machine learning solutions. Additionally, deployment environments include diverse ecosystems, requiring hundreds of models to be built and deployed quickly via web services to operational systems. The new SAS® automated modeling tool allows you to build and test hundreds of models across all of the segments in your data, testing a wide variety of machine learning techniques. The tool is completely customizable, allowing you transparent access to all modeling results. This paper shows you how to identify hundreds of champion models using SAS® Factory Miner, while generating scoring web services using SAS® Decision Manager. Immediate benefits include efficient model deployments, which allow you to spend more time generating insights that might reveal new opportunities, expose hidden risks, and fuel smarter, well-timed decisions.
Jonathan Wexler, SAS
Steve Sparano, SAS
The SQL procedure is extremely powerful when it comes to summarizing and aggregating data, but it can be a little daunting for programmers who are new to SAS® or for more experienced programmers who are more familiar with using the SUMMARY or MEANS procedure for aggregating data. This hands-on workshop demonstrates how to use PROC SQL for a variety of summarization and aggregation tasks. These tasks include summarizing multiple measures for groupings of interest, combining summary and detail information (via several techniques), nesting summary functions by using inline views, and generating summary statistics for derived or calculated variables. Attendees will learn how to use a variety of PROC SQL summary functions, how to effectively use WHERE and HAVING clauses in constructing queries, and how to exploit the PROC SQL remerge. The examples become progressively more complex over the course of the workshop as you gain mastery of using PROC SQL for summarizing data.
Christianna Williams, Self-Employed
It is of paramount importance for brand managers to measure and understand consumer brand associations and the mindspace their brand captures. Brands are encoded in memory on a cognitive and emotional basis. Traditionally, brand tracking has been done by surveys and feedback, resulting in a direct communication that covers the cognitive segment and misses the emotional segment. Respondents generally behave differently under observation and in solitude. In this paper, a new brand-tracking technique is proposed that involves capturing public data from social media that focuses more on the emotional aspects. For conceptualizing and testing this approach, we downloaded nearly one million tweets for three major brands--Nike, Adidas, and Reebok--posted by users. We proposed a methodology and calculated metrics (benefits and attributes) using this data for each brand. We noticed that generally emoticons are not used in sentiment mining. To incorporate them, we created a macro that automatically cleans the tweets and replaces emoticons with an equivalent text. We then built supervised and unsupervised models on those texts. The results show that using emoticons improves the efficiency of predicting the polarity of sentiments as the misclassification rate was reduced from 0.31 to 0.24. Using this methodology, we tracked the reactions that are triggered in the minds of customers when they think about a brand and thus analyzed their mind share.
Sharat Dwibhasi, Oklahoma State University
Although limited to a small fraction of health care providers, the existence and magnitude of fraud in health insurance programs requires the use of fraud prevention and detection procedures. Data mining methods are used to uncover odd billing patterns in large databases of health claims history. Efficient fraud discovery can involve the preliminary step of deploying automated outlier detection techniques in order to classify identified outliers as potential fraud before an in-depth investigation. An essential component of the outlier detection procedure is the identification of proper peer comparison groups to classify providers as within-the-norm or outliers. This study refines the concept of peer comparison group within the provider category and considers the possibility of distinct billing patterns associated with medical or surgical procedure codes identifiable by the Berenson-Eggers Type of System (BETOS). The BETOS system covers all HCPCS codes (Health Care Procedure Coding System); assigns a HCPCS code to only one BETOS code; consists of readily understood clinical categories; consists of categories that permit objective assignment& (Center for Medicare and Medicaid Services, CMS). The study focuses on the specialty General Practice and involves two steps: first, the identification of clusters of similar BETOS-based billing patterns; and second, the assessment of the effectiveness of these peer comparison groups in identifying outliers. The working data set is a sample of the summary of 2012 data of physicians active in health care government programs made publicly available by the CMS through its website. The analysis uses PROC FASTCLUS, the SAS® cubic clustering criterion approach, to find the optimal number of clusters in the data. It also uses PROC ROBUSTREG to implement a multivariate adaptive threshold outlier detection method.
Paulo Macedo, Integrity Management Services
In 1965, nearly half of all Americans 65 and older had no health insurance. Now, 50 years later, only 2% lack health insurance. The difference, of course, is Medicare. Medicare now covers 55 million people, about 17% of the US population, and is the single largest purchaser of personal health care. Despite this success, the rising costs of health care in general and Medicare in particular have become a growing concern. Medicare policies are important not only because they directly affect large numbers of beneficiaries, payers, and providers, but also because they affect private-sector policies as well. Analyses of Medicare policies and their consequences are complicated both by the effects of an aging population that has changing cost drivers (such as less smoking and more obesity) and by different Medicare payment models. For example, the average age of the Medicare population will initially decrease as the baby-boom generation reaches eligibility, but then increase as that generation grows older. Because younger beneficiaries have lower costs, these changes will affect cost trends and patterns that need to be interpreted within the larger context of demographic shifts. This presentation examines three Medicare payment models: fee-for-service (FFS), Medicare Advantage (MA), and Accountable Care Organizations (ACOs). FFS, originally based on payment methods used by Blue Cross and Blue Shield in the mid-1960s, pays providers for individual services (for example, physicians are paid based on the fees they charge). MA is a capitated payment model in which private plans receive a risk-adjusted rate. ACOs are groups of providers who are given financial incentives for reducing cost and maintaining quality of care for specified beneficiaries. Each model has strengths and weaknesses in specific markets. We examine each model, in addition to new data sources and more recent, innovative payment models that are likely to affect future trends.
Paul Gorrell, IMPAQ International
Code review is an important tool for ensuring the quality and maintainability of an organization's ETL processes. This paper introduces a method that analyzes the metadata to assign an alert score to each job. For the jobs of each flow, the metadata is collected, giving information about the number of transformations in a job, the amount of user-written code, the number of loops, the number of empty description fields, whether naming conventions are followed, and so on. This is aggregated into complexity indicators per job. Together with other information about the jobs (e.g. the CPU-time used from logging applications) this information can be made available in SAS® Visual Analytics, creating a dashboard that highlights areas for closer inspection.
Frank Poppe, PW Consulting
Joris Huijser, De Nederlandsche Bank N.V.
Microsoft Office products play an important role in most enterprises. SAS® is combined with Microsoft Office to assist in decision making in everyday life. Most SAS users have moved from SAS® 9.3, or 32-bit SAS, to SAS® 9.4 for the exciting new features. This paper describes a few things that do not work quite the same in SAS 9.4 as they did in the 32-bit version. It discusses the reasons for the differences and the new approaches that SAS 9.4 provides. The paper focuses on how SAS 9.4 works with Excel and Access. Furthermore, this presentation summarizes methods by LIBNAME engines and by import or export procedures. It also demonstrates how to interactively import or export PC files with SAS 9.4. The issue of incompatible format catalogs is addressed as well.
Hong Zhang, Mathematica Policy Research
Every day, businesses have to remain vigilant of fraudulent activity, which threatens customers, partners, employees, and financials. Normally, networks of people or groups perpetrate deviant activity. Finding these connections is now made easier for analysts with SAS® Visual Investigator, an upcoming SAS® solution that ultimately minimizes the loss of money and preserves mutual trust among its shareholders. SAS Visual Investigator takes advantage of the capabilities of the new SAS® In-Memory Server. Investigators can efficiently investigate suspicious cases across business lines, which has traditionally been difficult. However, the time required to collect, process and identify emerging fraud and compliance issues has been costly. Making proactive analysis accessible to analysts is now more important than ever. SAS Visual Investigator was designed with this goal in mind and a key component is the visual social network view. This paper discusses how the network analysis view of SAS Visual Investigator, with all its dynamic visual capabilities, can make the investigative process more informative and efficient.
Danielle Davis, SAS
Stephen Boyd, SAS Institute
Ray Ong, SAS Institute
When analyzing data with SAS®, we often encounter missing or null values in data. Missing values can arise from the availability, collectibility, or other issues with the data. They represent the imperfect nature of real data. Under most circumstances, we need to clean, filter, separate, impute, or investigate the missing values in data. These processes can take up a lot of time, and they are annoying. For these reasons, missing values are usually unwelcome and need to be avoided in data analysis. There are two sides to every coin, however. If we can think outside the box, we can take advantage of the negative features of missing values for positive uses. Sometimes, we can create and use missing values to achieve our particular goals in data manipulation and analysis. These approaches can make data analyses convenient and improve work efficiency for SAS programming. This kind of creative and critical thinking is the most valuable quality for data analysts. This paper exploits real-world examples to demonstrate the creative uses of missing values in data analysis and SAS programming, and discusses the advantages and disadvantages of these methods and approaches. The illustrated methods and advanced programming skills can be used in a wide variety of data analysis and business analytics fields.
Justin Jia, Trans Union Canada
Shan Shan Lin, CIBC
Specialized access requirements to tap into event streams vary depending on the source of the events. Open-source approaches from Spark, Kafka, Storm, and others can connect event streams to big data lakes, like Hadoop and other common data management repositories. But a different approach is needed to ensure latency and throughput are not adversely affected when processing streaming data with different open-source frameworks, that is, they need to scale. This talk distinguishes the advantages of adapters and connectors and shows how SAS® can leverage both Hadoop and YARN technologies to scale while still meeting the needs of streaming data analysis and large, distributed data repositories.
Evan Guarnaccia, SAS
Fiona McNeill, SAS
Steve Sparano, SAS
Wouldn't it be fantastic to develop and tune scenarios in SAS® Visual Scenario Designer and then smoothly incorporate them into your SAS® Anti-Money Laundering solution with just a few clicks of your mouse? Well, now there is a way. SAS Visual Scenario Designer is the first data-driven solution for interactive rule and scenario authoring, testing, and validation. It facilitates exploration, visualization, detection, rule writing, auditing, and parameter tuning to reduce false positives; and all of these tasks are performed using point and click. No SAS® coding skills required! Using the approach detailed in this paper, we demonstrate how you can seamlessly port these SAS Visual Scenario Designer scenarios into your SAS Anti-Money Laundering solution. Rewriting the SAS Visual Scenario Designer scenarios in Base SAS® is no longer required! Furthermore, the SAS Visual Scenario Designer scenarios are executed on the lightning-speed SAS® LASR™ Analytic Server, reducing the time of the SAS Anti-Money Laundering scenario nightly batch run. The results of both the traditional SAS Anti-Money Laundering alerts and SAS Visual Scenario Designer alerts are combined and available for display on the SAS® Enterprise Case Management interface. This paper describes the different ways that the data can be explored to detect anomalous patterns and the three mechanisms for translating these patterns into rules. It also documents how to create the scenarios in SAS Visual Scenario Designer; how to test and tune the scenarios and parameters; and how alerts are ported seamlessly into the SAS Anti-Money Laundering alert generation process and the SAS Enterprise Case Management system.
Renee Palmer, SAS
Yue Chai, SAS Institute
Across the languages of SAS® are many golden nuggets--functions, formats, and programming features just waiting to impress your friends and colleagues. While learning SAS for over 30 years, I have collected a few of these nuggets, and I offer a dozen more of them to you in this presentation. I presented the first dozen in a similar paper at SAS Global Forum 2015.
Peter Crawford, Crawford Software Consultancy limited
Today's financial institutions employ tens of thousands of employees globally in the execution of manually intensive processes, from transaction processing to fraud investigation. While heuristics-based solutions have widely penetrated the industry, these solutions often work in the realm of broad generalities, lacking the nuance and experience of a human decision-maker. In today's regulatory environment, that translates to operational inefficiency and overhead as employees labor to rationalize human decisions that run counter to computed predictions. This session explores options that financial services institutions have to augment and automate day-to-day decision-making, leading to improvements in consistency, accuracy, and business efficiency. The session focuses on financial services case studies, including anti-money laundering, fraud, and transaction processing, to demonstrate real-world examples of how organizations can make the transition from predictions to decisions.
At Royal Bank of Scotland, one of our key organizational design principles is to 'share everything we can share.' In essence, this promotes the cross-departmental sharing of platform services. Historically, this was never enforced on our Business Intelligence platforms like SAS®, resulting in a diverse technology estate, which presents challenges to our platform team for maintaining software currency, software versions, and overall quality of service. Currently, we have SAS® 8.2 and SAS® 9.1.3 on the mainframe, SAS® 9.2, SAS® 9.3, and SAS® 9.4 across our Windows and Linux servers, and SAS® 9.1.3 and SAS® 9.4 on PC across the bank. One of the benefits to running a multi-tenant SAS environment is removing the need to procure, install, and configure a new environment when a new department wants to use SAS. However, the process of configuring a secure multi-tenant environment, using the default tools and procedures, can still be very labor intensive. This paper explains how we analyzed the benefits of creating a shared Enterprise Business Intelligence platform in SAS alongside the risks and organizational barriers to the approach. Several considerations are presented as well as some insight into how we managed to convince our key stakeholders with the approach. We also look at the 'custom' processes and tools that RBS has implemented. Through this paper, we encourage other organizations to think about the various considerations we present to decide if sharing is right for their context to maximize the return on investment in SAS.
Dileep Pournami, RBS
Christopher Blake, RBS
Ekaitz Goienola, SAS
Sergey Iglov, RBS
It is well documented in the literature the impact of missing data: data loss (Kim & Curry, 1977) and consequently loss of power (Raaijmakers, 1999), and biased estimates (Roth, Switzer, & Switzer, 1999). If left untreated, missing data is a threat to the validity of inferences made from research results. However, the application of a missing data treatment does not prevent researchers from reaching spurious conclusions. In addition to considering the research situation at hand and the type of missing data present, another important factor that researchers should also consider when selecting and implementing a missing data treatment is the structure of the data. In the context of educational research, multiple-group structured data is not uncommon. Assessment data gathered from distinct subgroups of students according to relevant variables (e.g., socio-economic status and demographics) might not be independent. Thus, when comparing the test performance of subgroups of students in the presence of missing data, it is important to preserve any underlying group effect by applying separate multiple imputation with multiple groups before any data analysis. Using attitudinal (Likert-type) data from the Civics Education Study (1999), this simulation study evaluates the performance of multiple-group imputation and total-group multiple imputation in terms of item parameter invariance within structural equation modeling using the SAS® procedure CALIS and item response theory using the SAS procedure MCMC.
Patricia Rodriguez de Gil, University of South Florida
Jeff Kromrey, University of South Florida
We asked for it, and once again, SAS delivers! Now production in SAS® 9.4M3, the ODS EXCEL statement provides users with an extremely easy way to export output to Microsoft Excel. Using existing ODS techniques that you may already be familiar with (like predefined styles, traffic-lighting, and custom formatting) in tandem with common Excel features (like formulas, frozen headers and footers, and page setup options), you can create a publication-ready document in a snap. As a matter of fact, the new ODS EXCEL statement is such a handy tool, you may find yourself using it for more than large-scope production programs. This statement is so straightforward that it may be a good choice for outputting quick ad hoc requests as well. Join me on my adventure as I explore how to use the ODS EXCEL statement to create institutional research reports. I offer tips and techniques you can use to make your output excel lent, too.
Gina Huff, Western Kentucky University
Writing music with SAS® is a simple process. Including a snippet of music in a program is a great way to signal completion of processing. It's also fun! This paper illustrates a method for translating music into SAS code using the CALL SOUND routine.
Dan Bretheim, Towers Watson
A new ODS destination for creating Microsoft Excel workbooks is available starting in the third maintenance release of SAS® 9.4. This destination creates native Microsoft Excel XLSX files, supports graphic images, and offers other advantages over the older ExcelXP tagset. In this presentation you learn step-by-step techniques for quickly and easily creating attractive multi-sheet Excel workbooks that contain your SAS® output. The techniques can be used regardless of the platform on which SAS software is installed. You can even use them on a mainframe! Creating and delivering your workbooks on-demand and in real time using SAS server technology is discussed. Although the title is similar to previous presentations by this author, this presentation contains new and revised material not previously presented.
Vince DelGobbo, SAS Institute Inc.
Customers are under increasing pressure to determine how data, reports, and other assets are interconnected, and to determine the implications of making changes to these objects. Of special interest is performing lineage and impact analysis on data from disparate systems, for example, SAS®, IBM InfoSphere DataStage, Informatica PowerCenter, SAP BusinessObjects, and Cognos. SAS provides utilities to identify relationships between assets that exist within the SAS system, and beginning with the third maintenance release of SAS® 9.4, these utilities support third-party products. This paper provides step-by-step instructions, tips, and techniques for using these utilities to collect metadata from these disparate systems, and to perform lineage and impact analysis.
Liz McIntosh, SAS
You've heard all the talk about SAS® Visual Analytics--but maybe you are still confused about how the product would work in your SAS® environment. Many customers have the same points of confusion about what they need to do with their data, how to get data into the product, how SAS Visual Analytics would benefit them, and even should they be considering Hadoop or the cloud. In this paper, we cover the questions we are asked most often about implementation, administration, and usage of SAS Visual Analytics.
Tricia Aanderud, Zencos Consulting LLC
Ryan Kumpfmiller, Zencos Consulting
Nick Welke, Zencos Consulting
In the consumer credit industry, privacy is key and the scrutiny increases every day. When returning files to a client, we must ensure that they are depersonalized so that the client cannot match back to any personally identifiable information (PII). This means that we must locate any values for a variable that occur on a limited number of records and null them out (i.e., replace them with missing values). When you are working with large files that have more than one million observations and thousands of variables, locating variables with few unique values is a difficult task. While the FREQ procedure and the DATA step merging can accomplish the task, using first./last. BY variable processing to locate the suspect values and hash objects to merge the data set back together might offer increased efficiency.
Renee Canfield, Experian
The SAS® macro language is an efficient and effective way to handle repetitive processing tasks. One such task is conducting the same DATA steps or procedures for different time periods, such as every month, quarter, or year. SAS® dates are based on the number of days between January 1st, 1960, and the target date value, which, while very simple for a computer, is not a human-friendly representation of dates. SAS dates and macro processing are not simple concepts themselves, so mixing the two can be particularly challenging, and it can be very easy to end up working with bad dates! Understanding how macros and SAS dates work individually and together can greatly improve the efficiency of your coding and data processing tasks. This paper covers the basics of SAS macro processing, SAS dates, and the benefits and difficulties of using SAS dates in macros, to ensure that you never have a bad date again.
Kate Burnett-Isaacs, Statistics Canada
Andrew Clapson, MD Financial Management
Many databases include default values that are set inappropriately. For example, a top-level Yes/No question might be followed by a group of check boxes, to which a respondent might indicate multiple answers. When creating the database, a programmer might choose to set a default value of 0 for each box that is not checked. One should interpret these default values with caution, however, depending on whether the top-level question is answered Yes or No or is missing. A similar scenario occurs with a Yes/No question where No is the default value if the question is not answered (but actually the value should be missing). These default values might be scattered throughout the database; there might be no pattern to their occurrence. Records without valid information should be omitted from statistical analysis. Programmers should understand the difference between missing values and invalid values (that is, incorrectly set defaults) because it is important to handle these records differently. Choosing the best method to omit records with invalid values can be difficult. Manual corrections are often prohibitively time-consuming. SAS® offers a useful approach: a combined DATA step and SQL procedure. This paper provides a step-by-step method to accomplish the task.
Lily Yu, Statistics Collaborative, Inc.
In today's fast paced work environment, time management is crucial to the success of the project. Sending requests to SAS® programmers to run reports every time you need to get the most current data can be a stretch sometimes on an already strained schedule. Why bother to contact the programmer? Why not build the execution of the SAS program into the report itself so that when the report is launched, real-time data is retrieved and the report shows the most recent data. This paper demonstrates that by opening an existing SAS report in Microsoft Word or Microsoft Excel, the data in the report refreshes automatically. Simple Visual Basic for Applications (VBA) code is written in Word or Excel. When an existing SAS report is opened, this VBA code calls the SAS programs that create the report from within a Microsoft Office product and overwrites the existing report data with the most current data.
Ron Palanca, Mathematica Policy Research
An ad publisher like eBay has multiple ad sources to serve ads from. Specifically, for eBay's text ads, there are two ad sources, viz. eBay Commerce Network (ECN) and Google. The problem and solution we have formulated considers two ad sources. However, the solution can be extended for any number of ad sources. A study was done on performance of ECN and Google ad sources with respect to the revenue per mille (RPM, or revenue per thousand impressions) ad queries they bring while serving ads. It was found that ECN performs better for some ad queries and Google performs better for others. Thus, our problem is to optimally allocate ad traffic between the two ad sources to maximize RPM.
Huma Zaidi, eBay
Chitta Ranjan, GTU
As any airline traveler knows, connection time is a key element of the travel experience. A tight connection time can cause angst and concern, while a lengthy connection time can introduce boredom and a longer than desired travel time. The same elements apply when constructing schedules for airline pilots. Like passengers, pilot schedules are built with connections. Delta Air Lines operates a hub and spoke system that feeds both passengers and pilots from the spoke stations and connects them through the hub stations. Pilot connection times that are tight can result in operational disruptions, whereas extended pilot connection times are inefficient and unnecessarily costly. This paper demonstrates how Delta Air Lines used SAS® PROC REG and PROC LOGISTIC to analyze historical data in order to build operationally robust and financially responsible pilot connections.
Andy Hummel, Delta Air Lines
Today, companies are increasingly using analytics to discover new revenue-increasing and cost-saving opportunities. Many business professionals turn to SAS, a leader in business analytics software and service, to help them improve performance and make better decisions faster. Analytics are also being used in risk management, fraud detection, life sciences, sports, and many more emerging markets. To maximize their value to a business, analytics solutions need to be deployed quickly and cost-effectively, while also providing the ability to scale readily without degrading performance. Of course, in today's demanding environments, where budgets are shrinking and the number of mandates to reduce carbon footprints are growing, the solution must deliver excellent hardware utilization, power efficiency, and return on investment. To address some of these challenges, Red Hat and SAS have collaborated to recommend the best practices for configuring SAS® 9 running on Red Hat Enterprise Linux. The scope of this document includes Red Hat Enterprise Linux 6 and 7. Researched areas include the I/O subsystem, file system selection, and kernel tuning, in both bare-metal and kernel-based virtual machine (KVM) environments. In addition, we include grid configurations that run with the Red Hat Resilient Storage Add-On, which includes Global File System 2 (GFS2) clusters.
Barry Marson, Red Hat, Inc
In cooperation with the Joint Research Centre - European Commission (JRC), we have developed a number of innovative techniques to detect outliers on a large scale. In this session, we show the power of SAS/IML® Studio as an interactive tool for exploring and detecting outliers using customized algorithms that were built from scratch. The JRC uses this for detecting abnormal trade transactions on a large scale. The outliers are detected using the Forward Search, which starts from a central subset in the data and subsequently adds observations that are close to the current subset based on regression (R-student) or multivariate (Mahalanobis distance) output statistics. The implementation of this algorithm and its applications were done in SAS/IML Studio and converted to a macro for use in the IML procedure in Base SAS®.
Jos Polfliet, SAS
This paper highlights many of the major capabilities of the DATASETS procedure. It discusses how it can be used as a tool to update variable information in a SAS® data set; to provide information on data set and catalog contents; to delete data sets, catalogs, and indexes; to repair damaged SAS data sets; to rename files; to create and manage audit trails; to add, delete, and modify passwords; to add and delete integrity constraints; and more. The paper contains examples of the various uses of PROC DATASETS that programmers can cut and paste into their own programs as a starting point. After reading this paper, a SAS programmer will have practical knowledge of the many different facets of this important SAS procedure.
Michael Raithel, Westat
The DATA step has served SAS® programmers well over the years. Now there is a new, exciting, and powerful alternative: the DS2 procedure. By introducing an object-oriented programming environment, PROC DS2 enables users to effectively manipulate complex data and efficiently manage programming through additional data types, programming structure elements, user-defined methods, and shareable packages, as well as threaded execution. This hands-on workshop was developed based on our experiences with getting started with PROC DS2 and learning to use it to access, manage, and share data in a scalable and standards-based way. It will help SAS users of all levels easily get started with PROC DS2 and understand its basic functionality by practicing with its features.
Peter Eberhardt, Fernwood Consulting Group Inc.
Xue Yao, Winnipeg Regional Health Aurthority
In our book SAS for Mixed Models, we state that 'the majority of modeling problems are really design problems.' Graduate students and even relatively experienced statistical consultants can find translating a study design into a useful model to be a challenge. Generalized linear mixed models (GLMMs) exacerbate the challenge, because they accommodate complex designs, complex error structures, and non-Gaussian data. This talk covers strategies that have proven successful in design of experiment courses and in consulting sessions. In addition, GLMM methods can be extremely useful in planning experiments. This talk discusses methods to implement precision and power analysis to help choose between competing plausible designs and to assess the adequacy of proposed designs.
Walter Stroup, University of Nebraska, Lincoln
In recent years, big data has been in the limelight as a solution for business issues. Implementation of big data mining has begun in a variety of industries. The variety of data types and the velocity of increasing data have been astonishing, represented as structured data stored in a relational database or unstructured data (for example, text data, GPS data, image data, and so on). In the pharmaceutical industry, big data means real-world data such as Electronic Health Record, genomics data, medical imaging data, social network data, and so on. Handling these types of big data often requires the special environment infrastructure for statistical computing. Our presentation covers case study 1: IMSTAT implementation as a large-scale parallel computation environment; conversion from business issue to data science issue in pharma; case study 2: data handling and machine learning for vertical and horizontal big data by using PROC IMSTAT; the importance of the analysis result integration; and caution points of big data mining.
Yoshitake Kitanishi, Shionogi & Co., Ltd.
Ryo Kiguchi, Shionogi & Co., Ltd.
Akio Tsuji, Shionogi & Co., Ltd.
Hideaki Watanabe, Shionogi & Co., Ltd.
Inspired by Christianna William's paper on transitioning to PROC SQL from the DATA step, this paper aims to help SQL programmers transition to SAS® by using PROC SQL. SAS adapted the Structured Query Language (SQL) by means of PROC SQL back with SAS®6. PROC SQL syntax closely resembles SQL. However, there are some SQL features that are not available in SAS. Throughout this paper, we outline common SQL tasks and how they might differ in PROC SQL. We also introduce useful SAS features that are not available in SQL. Topics covered are appropriate for novice SAS users.
Barbara Ross, NA
Jessica Bennett, Snap Finance
Often SAS® users within business units find themselves running wholly within a single production environment, where non-production environments are reserved for the technology support teams to roll out patches, maintenance releases, and so on. So what is available for business units to set up and manage their intra-platform business systems development life cycle (SDLC)? Another scenario is when a new platform is provisioned, but the business units are responsible for promoting their own content from the old to the new platform. How can this be done without platform administrator roles? This paper examines some typical issues facing business users trying to manage their SAS content, and it explores the capabilities of various SAS tools available to promote and manage metadata, mid-tier content, and SAS® Enterprise Guide® projects.
Andrew Howell, ANJ Solutions P/L
Today, there are 28 million small businesses, which account for 54% of all sales in the United States. The challenge is that small businesses struggle every day to accurately forecast future sales. These forecasts not only drive investment decisions in the business, but also are used in setting daily par, determining labor hours, and scheduling operating hours. In general, owners use their gut instinct. Using SAS® provides the opportunity to develop accurate and robust models that can unlock costs for small business owners in a short amount of time. This research examines over 5,000 records from the first year of daily sales data for a start-up small business, while comparing the four basic forecasting models within SAS® Enterprise Guide®. The objective of this model comparison is to demonstrate how quick and easy it is to forecast small business sales using SAS Enterprise Guide. What does that mean for small businesses? More profit. SAS provides cost-effective models for small businesses to better forecast sales, resulting in better business decisions.
Cameron Jagoe, The University of Alabama
Taylor Larkin, The University of Alabama
Denise McManus, University of Alabama
Pedal-to-the-metal analytics is the notion that analytics can be quickly and easily achieved using web technologies. SAS® web technologies seamlessly integrate with each other through a web browser and with data via web APIs, enabling organizations to leapfrog traditional, manual analytic and data processes. Because of this integration (and the operational efficiencies obtained as a result), pedal-to-the-metal analytics dramatically accelerates the analytics lifecycle, which consists of these steps: 1) Data Preparation; 2) Exploration; 3) Modeling; 4) Scoring; and 5) Evaluating results. In this paper, data preparation is accomplished with SAS® Studio custom tasks (reusable drag-and-drop visual components or interfaces for underlying SAS code). This paper shows users how to create and implement these for public health surveillance. With data preparation complete, explorations of the data can be performed using SAS® Visual Analytics. Explorations provide insights for creating, testing, and comparing models in SAS® Visual Statistics to predict or estimate risk. The model score code produced by SAS Visual Statistics can then be deployed from within SAS Visual Analytics for scoring. Furthermore, SAS Visual Analytics provides the necessary dashboard and reporting capabilities to evaluate modeling results. In conclusion, the approach presented in this paper provides both new and long-time SAS users with easy-to-follow guidance and a repeatable workflow to maximize the return on their SAS investments while gaining actionable insights on their data. So, fasten your seat belts and get ready for the ride!
Manuel Figallo, SAS
Today many analysts in information technology are facing challenges working with large amounts of data. Most analysts are smart enough to write queries using correct table join conditions, text mining, index keys, and hash objects for quick data retrieval. The SAS® system is now able to work with Hadoop data storage to provide efficient processing power to SAS analytics. In this paper we demonstrate differences in data retrieval between the SQL procedure, the DATA step, and the Hadoop environment. In order to test data retrieval time, we used the following code to randomly generate ten million observations with character and numeric variables using the RANUNI function. DO LOOP=1 to 10e7 will generate ten million records; however this code can generate any number of records by changing the exponential log. We use the most commonly used functions and procedures to retrieve records on a test data set and we illustrate real- and CPU-time processing. All PROC SQL and Hadoop queries are on SAS® Enterprise Guide® 6.1 to record processing time. This paper includes an overview of Hadoop data architecture and describes how to connect to a Hadoop environment through SAS. It provides sample queries for data retrieval, explains differences using Hadoop pass-through versus Hadoop LIBNAME, and covers the big challenges for real-time and CPU-time comparison using the most commonly used SAS functions and procedures.
Anjan Matlapudi,, AmeriHealth Caritas Family of Companies
The North Carolina Community College System office can quickly and easily enable colleges to compare their program's success to other college programs. Institutional researchers can now spend their days quickly looking at trends, abnormalities, and other colleges, compared to spending their days digging for data to load into a Microsoft Excel spreadsheet. We look at performance measures and how programs are being graded using SAS® Visual Analytics.
Bill Schneider, North Carolina Community College System
SAS® provides in-database processing technology in the SQL procedure, which allows the SQL explicit pass-through method to push some or all of the work to a database management system (DBMS). This paper focuses on using the SAS SQL explicit pass-through method to transform Teradata table columns into rows. There are two common approaches for transforming table columns into rows. The first approach is to create narrow tables, one for each column that requires transposition, and then use UNION or UNION ALL to append all the tables together. This approach is straightforward but can be quite cumbersome, especially when there is a large number of columns that need to be transposed. The second approach is using the Teradata TD_UNPIVOT function, which makes the wide-to-long table transposition an easy job. However, TD_UNPIVOT allows you to transpose only columns with the same data type from wide to long. This paper presents a SAS macro solution to the wide-to-long table transposition involving different column data types. Several examples are provided to illustrate the usage of the macro solution. This paper complements the author's SAS paper Performing Efficient Transposes on Large Teradata Tables Using SQL Explicit Pass-Through in which the solution of performing the long-to-wide table transposition method is discussed. SAS programmers who are working with data stored in an external DBMS and would like to efficiently transpose their data will benefit from this paper.
Tao Cheng, Accenture
SAS® software provides many DATA step functions that search and extract patterns from a character string, such as SUBSTR, SCAN, INDEX, TRANWRD, etc. Using these functions to perform pattern matching often requires you to use many function calls to match a character position. However, using the Perl regular expression (PRX) functions or routines in the DATA step improves pattern-matching tasks by reducing the number of function calls and making the program easier to maintain. This talk, in addition to discussing the syntax of Perl regular expressions, demonstrates many real-world applications.
Arthur Li, City of Hope
Financial institutions are working hard to optimize credit and pricing strategies at adjudication and for ongoing account and customer management. For cards and other personal lending products, there is intense competitive pressure, together with relentless revenue challenges, that creates a huge requirement for sophisticated credit and pricing optimization tools. Numerous credit and pricing optimization applications are available on the market to satisfy these needs. We present a relatively new approach that relies heavily on the effect modeling (uplift or net lift) technique for continuous target metrics--revenue, cost, losses, and profit. Examples of effect modeling to optimize the impact of marketing campaigns are known. We discuss five essential steps on the credit and pricing optimization path: (1) setting up critical credit and pricing champion/challenger tests, (2) performance measurement of specific test campaigns, (3) effect modeling, (4) defining the best effect model, and (5) moving from the effect model to the optimal solution. These steps require specific applications that are not easily available in SAS®. Therefore, necessary tools have been developed in SAS/STAT® software. We go through numerous examples to illustrate credit and pricing optimization solutions.
Yuri Medvedev, Bank of Montreal
Optimization models require continuous constraints to converge. However, some real-life problems are better described by models that incorporate discontinuous constraints. A common type of such discontinuous constraints becomes apparent when a regulation-mandated diversification requirement is implemented in an investment portfolio model. Generally stated, the requirement postulates that the aggregate of investments with individual weights exceeding certain threshold in the portfolio should not exceed some predefined total within the portfolio. This format of the diversification requirement can be defined by the rules of any specific portfolio construction methodology and is commonly imposed by the regulators. The paper discusses the impact of this type of discontinuous portfolio diversification constraint on the portfolio optimization model solution process, and develops a convergent approach. The latter includes a sequence of definite series of convergent non-linear optimization problems and is presented in the framework of the OPTMODEL procedure modeling environment. The approach discussed has been used in constructing investable equity indexes.
Taras Zlupko, University of Chicago
Robert Spatz, University of Chicago
Fast detection of forest fires is a great concern among environmental experts and national park managers because forest fires create economic and ecological damage and endanger human lives. For effective fire control and resource preparation, it is necessary to predict fire occurrences in advance and estimate the possible losses caused by fires. For this purpose, using real-time sensor data of weather conditions and fire occurrence are highly recommended in order to support the predicting mechanism. The objective of this presentation is to use SAS® 9.4 and SAS® Enterprise Miner™ 14.1 to predict the probability of fires and to figure out special weather conditions resulting in incremental burned areas in Montesinho Park forest (Portugal). The data set was obtained from the Center for Machine Learning and Intelligent Systems at the University of California, Irvine and contains 517 observations and 13 variables from January 2000 to December 2003. Support Vector Machine analyses with variable selection were performed on this data set for fire occurrence prediction with a validation accuracy of approximately 60%. The study also incorporates Incremental Response technique and hypothesis testing to estimate the increased probability of fire, as well as the extra burned area under various conditions. For example, when there is no rain, a 27% higher chance of fires and 4.8 hectares of extra burned area are recorded, compared to when there is rain.
Quyen Nguyen, Oklahoma State University
Due to advances in medical care and the rise in living standards, life expectancy increased on average to 79 years in the US. This resulted in an increase in aging populations and increased demand for development of technologies that aid elderly people to live independently and safely. It is possible to achieve this challenge through ambient-assisted living (AAL) technologies that help elderly people live more independently. Much research has been done on human activity recognition (HAR) in the last decade. This research work can be used in the development of assistive technologies, and HAR is expected to be the future technology for e-health systems. In this research, I discuss the need to predict human activity accurately by building various models in SAS® Enterprise Miner™ 14.1 on a free public data set that contains 165,633 observations and 19 attributes. Variables used in this research represent the metrics of accelerometers mounted on the waist, left thigh, right arm, and right ankle of four individuals performing five different activities, recorded over a period of eight hours. The target variable predicts human activity such as sitting, sitting down, standing, standing up, and walking. Upon comparing different models, the winner is an auto neural model whose input is taken from stepwise logistic regression, which is used for variable selection. The model has an accuracy of 98.73% and sensitivity of 98.42%.
Venkata Vennelakanti, Oklahoma State University
There is a growing recognition that mental health is a vital public health and development issue worldwide. Considerable studies have reported that there are close interactions between poverty and mental illness. The association between mental illness and poverty is cyclic and negative. Impoverished people are generally more prone to mental illness and are less capable to afford treatment. Likewise, people with mental health problems are more likely to be in poverty. The availability of free mental health treatment to people in poverty is critical to break this vicious cycle. Based on this hypothesis, a model was developed based on the responses provided by mental health facilities to a federally supported survey. We examined if we can predict whether the mental health facilities will offer free treatment to the patients who cannot afford treatment costs in the United States. About a third of the 9,076 mental health facilities who responded to the survey stated that they offer free treatment to the patients incapable of paying. Different machine learning algorithms and regression models were assessed to predict correctly which facility would offer treatment at no cost. Using a neural network model in conjunction with a decision tree for input variable selection, we found that the best performing model can predict the mental health facilities with an overall accuracy of 71.49%. Sensitivity and specificity of the selected neural network model was 82.96 and 56.57% respectively. The top five most important covariates that explained the model's predictive power are: ownership of mental health facilities, whether facilities provide a sliding fee scale, the type of facility, whether facilities provide mental health treatment service in Spanish, and whether facilities offer smoking cession services. More detailed spatial and descriptive analysis of those key variables will be conducted.
Ram Poudel, Oklahoma state university
Lynn Xiang, Oklahoma State University
Years ago, doctors advised women with autoimmune diseases such as systemic lupus erythematosus (SLE) and rheumatoid arthritis (RA) not to become pregnant for fear of maternal health. Now, it is known that healthy pregnancy is possible for women with lupus but at the expense of a higher pregnancy complication rate. The main objective of this research is to identify key factors contributing to these diseases and to predict the occurrence rate of SLE and RA in pregnant women. Based on the approach used in this study, the prediction of adverse pregnancy outcomes for women with SLE, RA, and other diseases such as diabetes mellitus (DM) and antiphospholipid antibody syndrome (APS) can be carried out. These results will help pregnant women undergo a healthy pregnancy by receiving proper medication at an earlier stage. The data set was obtained from Cerner Health Facts data warehouse. The raw data set contains 883,473 records and 85 variables such as diagnosis code, age, race, procedure code, admission date, discharge date, total charges, and so on. Analyses were carried out with two different data sets--one for SLE patients and the other for RA patients. The final data sets had 398,742 and 397,898 records each for modeling SLE and RA patients, respectively. To provide an honest assessment of the models, the data was split into training and validation using the data partition node. Variable selection techniques such as LASSO, LARS, stepwise regression, and forward regression were used. Using a decision tree, prominent factors that determine the SLE and RA occurrence rate were identified separately. Of all the predictive models run using SAS® Enterprise Miner™ 12.3, the model comparison node identified the decision tree (Gini) as the best model with the least misclassification rate of 0.308 to predict the SLE patients and 0.288 to predict the RA patients.
Ravikrishna Vijaya Kumar, Oklahoma State University
SHRIE RAAM SATHYANARAYANAN, Oklahoma State University - Center for Health Sciences
In recent years, many companies are trying to understand the rare events that are very critical in the current business environment. But a data set with rare events is always imbalanced and the models developed using this data set cannot predict the rare events precisely. Therefore, to overcome this issue, a data set needs to be sampled using specialized sampling techniques like over-sampling, under-sampling, or the synthetic minority over-sampling technique (SMOTE). The over-sampling technique deals with randomly duplicating minority class observations, but this technique might bias the results. The under-sampling technique deals with randomly deleting majority class observations, but this technique might lose information. SMOTE sampling deals with creating new synthetic minority observations instead of duplicating minority class observations or deleting the majority class observations. Therefore, this technique can overcome the problems, like biased results and lost information, found in other sampling techniques. In our research, we used an imbalanced data set containing results from a thyroid test with 3,163 observations, out of which only 4.7 percent of the observations had positive test results. Using SAS® procedures like PROC SURVERYSELECT and PROC MODECLUS, we created over-sampled, under-sampled, and the SMOTE sampled data set in SAS® Enterprise Guide®. Then we built decision tree, gradient boosting, and rule induction models using four different data sets (non-sampled, majority under-sampled, minority over-sampled with majority under-sampled, and minority SMOTE sampled with majority under-sampled) in SAS® Enterprise Miner™. Finally, based on the receiver operating characteristic (ROC) index, Kolmogorov-Smirnov statistics, and the misclassification rate, we found that the models built using minority SMOTE sampled with the majority under-sampled data yields better output for this data set.
Rhupesh Damodaran Ganesh Kumar, Oklahoma State University (SAS and OSU data mining Certificate)
Kiren Raj Mohan Jagan Mohan, Zions Bancorporation
Many inquisitive minds are filled with excitement and anticipation of response every time one posts a question on a forum. This paper explores the factors that impact the response time of the first response for questions posted in the SAS® Community forum. The factors are contributors' availability, nature of topic, and number of contributors knowledgeable for that particular topic. The results from this project help SAS® users receive an estimated response time, and the SAS Community forum can use this information to answer several business questions such as following: What time of the year is likely to have an overflow of questions? Do specific topics receive delayed responses? Which days of the week are the community most active? To answer such questions, we built a web crawler using Python and Selenium to fetch data from the SAS Community forum, one of the largest analytics groups. We scraped over 13,443 queries and solutions starting from January 2014 to present. We also captured several query-related attributes such as the number of replies, likes, views, bookmarks, and the number of people conversing on the query. Using different tools, we analyzed this data set after clustering the queries into 22 subtopics and found interesting patterns that can help the SAS Community forum in several ways, as presented in this paper.
Praveen Kumar Kotekal, Oklahoma State University
In early 2006, the United States experienced a housing bubble that affected over half of the American states. It was one of the leading causes of the 2007-2008 financial recession. Primarily, the overvaluation of housing units resulted in foreclosures and prolonged unemployment during and after the recession period. The main objective of this study is to predict the current market value of a housing unit with respect to fair market rent, census region, metropolitan statistical area, area median income, household income, poverty income, number of units in the building, number of bedrooms in the unit, utility costs, other costs of the unit, and so on, to determine which factors affect the market value of the housing unit. For the purpose of this study, data was collected from the Housing Affordability Data System of the US Department of Housing and Urban Development. The data set contains 20 variables and 36,675 observations. To select the best possible input variables, several variable selection techniques were used. For example, LARS (least angle regression), LASSO (least absolute shrinkage and selection operator), adaptive LASSO, variable selection, variable clustering, stepwise regression, (PCA) principal component analysis only with numeric variables, and PCA with all variables were all tested. After selecting input variables, numerous modeling techniques were applied to predict the current market value of a housing unit. An in-depth analysis of the findings revealed that the current market value of a housing unit is significantly affected by the fair market value, insurance and other costs, structure type, household income, and more. Furthermore, a higher household income and median income of an area are associated with a higher market value of a housing unit.
Mostakim Tanjil, Oklahoma State University
Goutam Chakraborty, Oklahoma State University
The Oklahoma State Department of Health (OSDH) conducts home visiting programs with families that need parental support. Domestic violence is one of the many screenings performed on these visits. The home visiting personnel are trained to do initial screenings; however, they do not have the extensive information required to treat or serve the participants in this arena. Understanding how demographics such as age, level of education, and household income among others, are related to domestic violence might help home visiting personnel better serve their clients by modifying their questions based on these demographics. The objective of this study is to better understand the demographic characteristics of those in the home visiting programs who are identified with domestic violence. We also developed predictive models such as logistic regression and decision trees based on understanding the influence of demographics on domestic violence. The study population consists of all the women who participated in the Children First Program of the OSDH from 2012 to 2014. The data set contains 1,750 observations collected during screening by the home visiting personnel over the two-year period. In addition, they must have completed the Demographic form as well as the Relationship Assessment form at the time of intake. Univariate and multivariate analysis has been performed to discover the influence that age, education, and household income have on domestic violence. From the initial analysis, we can see that women who are younger than 25 years old, who haven't completed high school, and who are somewhat dependent on their husbands or partners for money are most vulnerable. We have even segmented the clients based on the likelihood of domestic violence.
Soumil Mukherjee, Oklahoma State University
Goutam Chakraborty, Oklahoma State University
Miriam McGaugh, Oklahoma state department of Health
This session discusses challenges and considerations typically faced in credit risk scoring as well as options and practical ways to address them. No two problems are ever the same even if the general approach is clear. Every model has its own unique characteristics and creative ways to address them. Successful credit scoring modeling projects are always based on a combination of both advanced analytical techniques and data, and a deep understanding of the business and how the model will be applied. Different aspects of the process are discussed, including feature selection, reject inferencing, sample selection and validation, and model design questions and considerations.
Regina Malina, Equifax
SAS® 9.4 allows for extensive customization of configuration settings. These settings are changed as new products are added into a deployment and upgrades to existing products are deployed into the SAS® infrastructure. The ability to track which configuration settings change during the addition of a specific product or the installation of a particular platform maintenance release can be very useful. Often, customers run a SAS deployment step and wonder what exactly changed and in which files. The use of version control systems is becoming increasingly popular for tracking configuration settings. This paper demonstrates how to use Git, a hugely popular, open-source version control system to manage, track, audit, and revert changes to your SAS configuration infrastructure. Using Git, you can quickly list which files were changed by the addition of a product or maintenance package and inspect the differences. You can then revert to the previous settings if that becomes desirable.
Alec Fernandez, SAS
Automation of everyday activities holds the promise of consistency, accuracy, and relevancy. When applied to business operations, the additional benefits of governance, adaptability, and risk avoidance are realized. Prescriptive analytics empowers both systems and front-line workers to take the desired company action each and every time. And with data streaming from transactional systems, from the Internet of Things (IoT), and any other source, doing the right thing with exceptional processing speed produces the responsiveness that customers depend on. This talk describes how SAS® and Teradata are enabling prescriptive analytics in current business environments and in the emerging IoT.
Tho Nguyen, Teradata
Fiona McNeill, SAS
In customer relationship management (CRM) or consumer finance it is important to predict the time of repeated events. These repeated events might be a purchase, service visit, or late payment. Specifically, the goal is to find the probability density for the time to first event, the probability density for the time to second event, and so on. Two approaches are presented and contrasted. One approach uses discrete time hazard modeling (DTHM). The second, a distinctly different approach, uses a multinomial logistic model (MLM). Both DTHM and MLM satisfy an important consistency criterion, which is to provide probability densities whose average across all customers equals the empirical population average. DTHM requires fewer models if the number of repeated events is small. MLM requires fewer models if the time horizon is small. SAS® code is provided through a simulation in order to illustrate the two approaches.
Bruce Lund, Magnify Analytic Solutions, Division of Marketing Associates
In a data warehousing system, change data capture (CDC) plays an important part not just in making the data warehouse (DWH) aware of the change but also in providing a means of flowing the change to the DWH marts and reporting tables so that we see the current and latest version of the truth. This and slowly changing dimensions (SCD) create a cycle that runs the DWH and provides valuable insights in the history and for the decision-making future. What if the source has no CDC? It would be an ETL nightmare to identify the exact change and report the absolute truth. If these two processes can be combined into a single process where just one single transform does both jobs of identifying the change and applying the change to the DWH, then we can save significant processing times and value resources of the system. Hence, I came up with a hybrid SCD with CDC approach for this. My paper focuses on sources that DO NOT have CDC in their sources and need to perform SCD Type 2 on such records without worrying about data duplications and increased processing times.
Vishant Bhat, University of Newcastle
Tony Blanch, SAS Consultant
Horizontal data sorting is a very useful SAS® technique in advanced data analysis when you are using SAS programming. Two years ago (SAS® Global Forum Paper 376-2013), we presented and illustrated various methods and approaches to perform horizontal data sorting, and we demonstrated its valuable application in strategic data reporting. However, this technique can also be used as a creative analytic method in advanced business analytics. This paper presents and discusses its innovative and insightful applications in product purchase sequence analyses such as product opening sequence analysis, product affinity analysis, next best offer analysis, time-span analysis, and so on. Compared to other analytic approaches, the horizontal data sorting technique has the distinct advantages of being straightforward, simple, and convenient to use. This technique also produces easy-to-interpret analytic results. Therefore, the technique can have a wide variety of applications in customer data analysis and business analytics fields.
Justin Jia, Trans Union Canada
Shan Shan Lin, CIBC
Donorschoose.org is a nonprofit organization that allows individuals to donate directly to public school classroom projects. Teachers from public schools post a request for funding a project with a short essay describing it. Donors all around the world can look at these projects when they log in to Donorschoose.org and donate to projects of their choice. The idea is to have a personalized recommendation webpage for all the donors, which will show them the projects, which they prefer, like and love to donate. Implementing a recommender system for the DonorsChoose.org website will improve user experience and help more projects meet their funding goals. It also will help us in understanding the donors' preferences and delivering to them what they want or value. One type of recommendation system can be designed by predicting projects that will be less likely to meet funding goals, segmenting and profiling the donors and using that information for recommending right projects when the donors log in to DonorsChoose.org.
Heramb Joshi, Oklahoma State University
Goutam Chakraborty, Oklahoma State University
Sandeep Chittoor, Student
Vignesh Dhanabal, Oklahoma State University
Sharat Dwibhasi, Oklahoma State University
A bubble map is a useful tool for identifying trends and visualizing the geographic proximity and intensity of events. This session describes how to use readily available map data sets in SAS® along with PROC GEOCODE and PROC GMAP to turn a data set of addresses and events into a bubble map of the United States with scaled bubbles depicting the location and intensity of events.
Caroline Walker, Warren Rogers Associates
Regression is used to examine the relationship between one or more explanatory (independent) variables and an outcome (dependent) variable. Ordinary least squares regression models the effect of explanatory variables on the average value of the outcome. Sometimes, we are more interested in modeling the median value or some other quantile (for example, the 10th or 90th percentile). Or, we might wonder if a relationship between explanatory variables and outcome variable is the same for the entire range of the outcome: Perhaps the relationship at small values of the outcome is different from the relationship at large values of the outcome. This presentation illustrates quantile regression, comparing it to ordinary least squares regression. Topics covered will include: a graphical explanation of quantile regression, its assumptions and advantages, using the SAS® QUANTREG procedure, and interpretation of the procedure's output.
Ruth Croxford, Institute for Clinical Evaluative Sciences
Representational State Transfer (REST) is being used across the industry for designing networked applications to provide lightweight and powerful alternatives to web services such as SOAP and Web Services Description Language (WSDL). Since REST is based entirely on HTTP, SAS® provides everything you need to make REST calls and to process structured and unstructured data alike. This paper takes a look at how some enhancements in the third maintenance release of SAS® 9.4 can benefit you in this area. Learn how the HTTP procedure and other SAS language features provide everything you need to simply and securely use REST.
Joseph Henry, SAS
One of the most important factors driving the success of requirements-gathering can be easily overlooked. Your user community needs to have a clear understanding of what is possible: from different ways to represent a hierarchy to how visualizations can drive an analysis to newer, but less common, visualizations that are quickly becoming standard. Discussions about desktop access versus mobile deployment and/or which users might need more advanced statistical reporting can lead to a serious case of option overload. One of the best cures for option overload is to provide your user community with access to template reports they can explore themselves. In this paper, we describe how you can take a single rich data set and build a set of template reports that demonstrate the full functionality of SAS® Visual Analytics, a suite of the most common, most useful SAS Visual Analytics report structures, from high-level dashboards to statistically deep dynamic visualizations. We show exactly how to build a dozen template reports from a single data source, simultaneously representing options for color schemes, themes, and other choices to consider. Although this template suite approach can apply to any industry, our example data set will be publicly available data from the Home Mortgage Disclosure Act, de-identified data on mortgage loan determinations. Instead of beginning requirements-gathering with a blank slate, your users can begin the conversation with, I would like something like Template #4, greatly reducing the time and effort required to meet their needs.
Elliot Inman, SAS
Michael Drutar, SAS
Creating web applications and web services can sometimes be a daunting task. With the ever changing facets of web programming, it can be even more challenging. For example, client-side web programming using applets was very popular almost 20 years ago, only to be replaced with server-side programming techniques. With all of the advancements in JavaScript libraries, it seems client-side programming is again making a comeback. Amidst all of the changing web technologies, surprisingly one of the most powerful tools I have found that has provided exceptional capabilities across all types of web development techniques has been SAS/IntrNet®. Traditionally seen as a server-side programming tool for generating complete web pages based on SAS® content, SAS/IntrNet, coupled with jQuery AJAX or AJAX alone, also has the ability to provide client-side web programming techniques, as well as provide RESTful web service implementations. I hope to show that with the combination of these tools including the JSON procedure from SAS® 9.4, simple yet powerful web services or dynamic content rich web pages can be created easily and rapidly.
Jeremy Palbicki, Mayo Clinic
JavaScript Object Notation (JSON) has quickly become the de-facto standard for data transfer on the Internet, due to an increase in both web data and the usage of full-stack JavaScript. JSON has become dominant in the emerging technologies of the web today, such as the Internet of Things and the mobile cloud. JSON offers a light and flexible format for data transfer, and can be processed directly from JavaScript without the need for an external parser. The SAS® JSON procedure lacks the ability to read in JSON. However, the addition of the GROOVY procedure in SAS® 9.3 allows for execution of Java code from within SAS, allowing for JSON data to be read into a SAS data set through XML conversion. This paper demonstrates the method for parsing JSON into data sets with Groovy and the XML LIBNAME Engine, all within Base SAS®.
John Kennedy, Mesa Digital
This paper describes the merging of designed experiments and regularized regression to find the significant factors to use for a real-time predictive model. The real-time predictive model is used in a Weyerhaeuser modified fiber mill to estimate the value of several key quality characteristics. The modified fiber product is accepted or rejected based on the model prediction or the predicted values' deviation from lab tests. To develop the model, a designed experiment was needed at the mill. The experiment was planned by a team of engineers, managers, operators, lab techs, and a statistician. The data analysis used the actual values of the process variables manipulated in the designed experiment. There were instances of the set point not being achieved or maintained for the duration of the run. The lab tests of the key quality characteristics were used as the responses. The experiment was designed with JMP® 64-bit Edition 11.2.0 and estimated the two-way interactions of the process variables. It was thought that one of the process variables was curvilinear and this effect was also made estimable. The LASSO method, as implemented in the SAS® GLMSELECT procedure, was used to analyze the data after cleaning and validating the results. Cross validation was used as the variable selection operator. The resulting prediction model passed all the predetermined success criteria. The prediction model is currently being used real time in the mill for product quality acceptance.
Jon Lindenauer, Weyerhaeuser
The success of any marketing promotion is measured by the incremental response and revenue generated by the targeted population known as Test in comparison with the holdout sample known as Control. An unbiased random Test and Control sampling ensures that the incremental revenue is in fact driven by the marketing intervention. However, isolating the true incremental effect of any particular marketing intervention becomes increasingly challenging in the face of overlapping marketing solicitations. This paper demonstrates how a look-alike model can be applied using the GMATCH algorithm on a SAS® platform to design a truly comparable control group to accurately measure and isolate the impact of a specific marketing intervention.
Mou Dutta, Genpact LLC
Arjun Natarajan, Genpact LLC
As credit unions market themselves to increase their market share against the big banks, they understandably focus on gaining new members. However, they must also retain their existing members. Otherwise, the new members they gain can easily be offset by existing members who leave. Happily, by using predictive analytics as described in this paper, keeping (and further engaging) existing members can actually be much easier and less expensive than enlisting new members. This paper provides a step-by-step overview of a relatively simple but comprehensive approach to reduce member attrition. We first prepare the data for a statistical analysis. With some basic predictive analytics techniques, we can then identify those members who have the highest chance of leaving and the highest value. For each of these members, we can also identify why they would leave, thus suggesting the best way to intervene to retain them. We then make suggestions to improve the model for better accuracy. Finally, we provide suggestions to extend this approach to further engaging existing members and thus increasing their lifetime value. This approach can also be applied to many other organizations and industries. Code snippets are shown for any version of SAS® software; they also require SAS/STAT® software.
Nate Derby, Stakana Analytics
Mark Keintz, Wharton Research Data Services
Abstract logistic regression is one of the widely used regression models in statistics. Its dependent variable is a discrete variable with at least two levels. It measures the relationship between categorical dependent variables and independent variables and predicts the likelihood of having the event associated with an outcome variable. Variable reduction and screening are the techniques that reduce the redundant independent variables and filter out the independent variables with less predictive power. For several reasons, variable reduction and screening are important and critical especially when there are hundreds of covariates available that could possibly fit into the model. These techniques decrease the model estimation time and they can avoid the occurrence of non-convergence and can generate an unbiased inference of the outcome. This paper reviews various approaches for variable reduction, including traditional methods starting from univariate single logistic regressions and methods based on variable clustering techniques. It also presents hands-on examples using SAS/STAT® software and illustrates three selection nodes available in SAS® Enterprise Miner™: variable selection node, variable clustering node, and decision tree node.
Xinghe Lu, Amerihealth Caritas
SAS® Visual Analytics offers many new and exciting ways to look at your data. Users are able to load data at will to explore and ask questions of their data. But what happens if you need to prevent users from viewing all data? What can you do to prevent users from loading too much data? What happens if a user loads data that exceeds the hardware capacity of your environment? This session covers practical ways to limit available resources and secure SAS LASR data. Real-world scenarios are covered and attendees can participate in an open discussion.
David Franklin, SAS
SharePoint is a web application framework and platform developed by Microsoft, mostly used for content and document management by mid-size businesses and large departments. Linking SAS® with SharePoint combines the power of these two into one. This paper shows users how to send PDF reports and files in other formats (such as Microsoft Excel files, HTML files, JPEG files, zipped files, and so on) from SAS to a SharePoint Document Library. The paper demonstrates how to configure SharePoint Document Library settings to receive files from SAS. A couple of SAS code examples are included to show how to send files from SAS to SharePoint. The paper also introduces a framework for creating data visualization on SharePoint by feeding SAS data into HTML pages on SharePoint. An example of how to create an infographics SharePoint page with data from SAS is also provided.
Xiaogang Tang, Wyndham Worldwide
With advances in technology, the world of sports is now offering rich data sets that are of interest to statisticians. This talk concerns some research problems in various sports that are based on large data sets. In baseball, PITCHf/x data is used to help quantify the quality of pitches. From this, questions about pitcher evaluation and effectiveness are addressed. In cricket, match commentaries are parsed to yield ball-by-ball data in order to assist in building a match simulator. The simulator can then be used to investigate optimal lineups, player evaluation, and the assessment of fielding.
Our company Veikkaus is a state-owned gambling and lottery company in Finland that has a national legalized monopoly for gambling. All the profit we make goes back to Finnish society (for art, sports, science, and culture), and this is done by our government. In addition to the government's requirements of profit, the state (Finland) also requires us to handle the adverse social aspects of gaming, such as problem gambling. The challenge in our business is to balance between these two factors. For the purposes of problem gambling, we have used SAS® tools to create a responsible gaming tool, called VasA, based on a logistic regression model. The name VasA is derived from the Finnish words for 'Responsible Customership.' The model identifies problem gamblers from our customer database using the data from identified gaming, money transfers, web behavior, and customer data. The variables that were used in the model are based on the theory behind the problem gambling. Our actions for problem gambling include, for example, different CRM and personalization of a customer's website in our web service. There were several companies who provided responsible gambling tools as such for us to buy, but we wanted to create our own for two reasons. Firstly, we wanted it to include our whole customer database, meaning all our customers and not just those customers who wanted to take part in it. These other tools normally include only customers who want to take part. The other reason was that we saved a ridiculous amount of money by doing it by ourselves compared to having to buy one. During this process, SAS played a big role, from gathering the data to the construction of the tool, and from modeling to creating the VasA variables, then on to the database, and finally to the analyses and reporting.
Tero Kallioniemi, Veikkaus
Sometimes, the relationship between an outcome (dependent) variable and the explanatory (independent) variable(s) is not linear. Restricted cubic splines are a way of testing the hypothesis that the relationship is not linear or summarizing a relationship that is too non-linear to be usefully summarized by a linear relationship. Restricted cubic splines are just a transformation of an independent variable. Thus, they can be used not only in ordinary least squares regression, but also in logistic regression, survival analysis, and so on. The range of values of the independent variable is split up, with knots defining the end of one segment and the start of the next. Separate curves are fit to each segment. Overall, the splines are defined so that the resulting fitted curve is smooth and continuous. This presentation describes when splines might be used, how the splines are defined, choice of knots, and interpretation of the regression results.
Ruth Croxford, Institute for Clinical Evaluative Sciences
Value-based reimbursement is the emerging strategy in the US healthcare system. The premise of value-based care is simple in concept--high quality and low cost provides the greatest value to patients and the various parties that fund their coverage. The basic equation for value is equally simple to compute: value=quality/cost. However, there are significant challenges to measuring it accurately. Error or bias in measuring value could result in the failure of this strategy to ultimately improve the healthcare system. This session discusses various methods and issues with risk adjustment in a value-based reimbursement model. Risk adjustment is an essential tool for ensuring that fair comparisons are made when deciding what health services and health providers have high value. The goal this presentation is to give analysts an overview of risk adjustment and to provide guidance for when, why, and how to use risk adjustment when quantifying performance of health services and healthcare providers on both cost and quality. Statistical modeling approaches are reviewed and practical issues with developing and implementing the models are discussed. Real-world examples are also provided.
Daryl Wansink, Conifer Value Based Care
This paper explores some proven methods used to automate complex SAS® Enterprise Guide® projects so that the average Joe can run them with little or no prior experience. There are often times when a programmer is requested to extract data and dump it into Microsoft Excel for a user. Often these data extracts are very similar and can be run with previously saved code. However, the user quite often has to wait for the programmer to have the time to simply run the code. By automating the code, the programmer regains control over their data requests. This paper discusses the benefits of establishing macro variables and creating stored procedures, among other tips
Jennifer Davies, Department of Education
Create a model with SAS Contextual Analysis, deploy the model to your Hadoop cluster, run the model using map/reduce driven by SAS Code Accelerator not alongside, but in Hadoop. Abstract: Hadoop took the world by storm as a cost efficient software framework for storing data. Hadoop is more than that, it's an analytics platform. Analytics is more than just crunching numbers, it's also about gaining knowledge from your organization's unstructured data. SAS has the tools to do both. In this paper, we'll demonstrate how to access your unstructured data, automatically create a Text Analytics model, deploy that model in Hadoop and run SAS's Code Accelerator to give you insights into your big and unstructured data lying dormant in Hadoop. We'll show you how the insights gained can help you make better informed decisions for your organization.
David Bultman, SAS
Adam Pilz, SAS
Analyzing massive amounts of big data quickly to get at answers that increase agility--an organization's ability to sense change and respond--and drive time-sensitive business decisions is a competitive differentiator in today's market. Customers will gain from the deep understanding of storage architecture provided by EMC combined with the deep expertise in analytics provided by SAS. This session provides an overview of how your mixed analytics SAS® workloads can be transformed on EMC XtremIO, DSSD, and Pivotal solutions. Whether you're working with one of the three primary SAS file systems or SAS® Grid, performance and capacity scale linearly, significantly eliminate application latency, and remove the complexity of storage tuning from the equation. SAS and EMC have partnered to meet customer challenges and deliver a modern analytic architecture. This unified approach encompasses big data management, analytics discovery, and deployment via end-to-end solutions that solve your big data problems. Learn about best practices from fellow customers and how they deliver new levels of SAS business value.
High-quality documentation of SAS® code is standard practice in multi-user environments for smoother group collaborations. One of the documentation items that facilitate program sharing and retrospective review is a header section at the beginning of a SAS program highlighting the main features of the program, such as the program's name, its creation date, the program's aims, the programmer's identification, and the project title. In this header section, it is helpful to keep a list of the inputs and outputs of the SAS program (for example, SAS data sets and files that the program used and created). This paper introduces SAS-IO, a browser-based HTML/JavaScript tool that can automate production of such an input/output list. This can save the programmers' time, especially when working with long SAS programs.
Mohammad Reza Rezai, Institute for Clinical Evaluative Sciences
This presentation describes the new SAS® Customer Intelligence 360 solution for multi-armed bandit analysis of controlled experiments. Multi-armed bandit analysis has been in existence since the 1950s and is sometimes referred to as the K- or N-armed bandit problem. In this problem, a gambler at a row of slot machines (sometimes known as 'one-armed bandits') has to decide which machines to play, as well as the frequency and order in which to play each machine with the objective of maximizing returns. In practice, multi-armed bandits have been used to model the problem of managing research projects in a large organization. However, the same technology can be applied within the marketing space where, for example, variants of campaign creatives or variants of customer journey sub-paths are compared to better understand customer behavior by collecting their respective response rates. Distribution weights of the variants are adjusted as time passes and conversions are observed to reflect what customers are responding to, thus optimizing the performance of the campaign. The SAS Customer Intelligence 360 analytic engine collects data at regularly scheduled time intervals and processes it through a simulation engine to return an updated weighting schema. The process is automated in the digital space and led through the solution's content serving and execution engine for interactive communication channels. Both marketing gains and costs are studied during the process to illustrate the business value of multi-arm bandit analysis. SAS® has coupled results from traditional A/B testing to feed the multi-armed bandit engine, which are presented.
Thomas Lehman, SAS
If, like the character from George Orwell's novel, you need to control what your users are doing and also need to report on it, then this is paper for you. Using a combination of resources available in SAS® 9.4, any administrator can control what users are allowed to perform within SAS, and then can create comprehensive and customized reports detailing what was done. This paper discusses how metadata roles can be used to control users' capabilities. Particular attention is given to the user roles available to SAS® Enterprise Guide® and the SAS® Add-in for Microsoft Office, as well as administrator roles available to the SAS® Management Console. Best practices are discussed when it comes to the creation of these roles and how they should be applied to groups and users. The second part of this paper explores how to monitor SAS® utilization through SAS® Environment Manager. It investigates the data collected through its extended monitoring and how this can be harvested to create reports that can track sessions launched, procedures used, data accessed, and other purpose-built reports.This paper is for SAS Administrators who are responsible for the maintenance of systems, system architects that need to design new deployments, and users interested in an understanding of how to use SAS in a secure organization.
Elena Muriel, Amadeus Software Limited
SAS® Customer Intelligence 360 enables marketers to create activity maps to delight customers with a consistent experience on digital channels such as web, mobile, and email. Activity maps enable the user to create a customer journey across digital channels and assign conversion measures as individual customers and prospects navigate the customer experience. This session details how those interactions are constructed in terms of what types of interactions tasks are supported and how those interaction tasks can be related to each other in a visual paradigm with built-in goal splits, analytics, wait periods, and A/B testing. Business examples are provided for common omni-channel problems such as cross-selling and customer segmentation. In addition, we cover the integration of SAS® Marketing Automation and other marketing products, which enables sites to leverage their current segments and treatments for the digital channels.
Mark Brown, SAS
Brian Chick, SAS
SAS® users are always surprised to discover their programs contain bugs (or errors). In fact, when asked, users will emphatically stand by their programs and logic by saying they are error free. But, the vast number of experiences, along with the realities of writing code, say otherwise. Errors in program code can appear anywhere, whether accidentally introduced by developers or programmers, when writing code. No matter where an error occurs, the overriding sentiment among most users is that debugging SAS programs can be a daunting and humbling task. This presentation explores the world of SAS errors, providing essential information about the various error types. Attendees learn how errors are created, their symptoms, identification techniques, and how to apply effective techniques to better understand, repair, and enable program code to work as intended.
Kirk Paul Lafler, Software Intelligence Corporation
Historically, administration of your SAS® Grid Manager environment has required interaction with a number of disparate applications including Platform RTM for SAS, SAS® Management Console, and command line utilities. With the third maintenance release of SAS® 9.4, you can now use SAS® Environment Manager for all monitoring and management of your SAS Grid. The new SAS Environment Manager interface gives you the ability to configure the Load Sharing Facility (LSF), manage and monitor high-availability applications, monitor overall SAS Grid health, define event-based alerts, and much, much more through a single, unified, web-based interface.
Scott Parrish, SAS
Paula Kavanagh, SAS Institute, Inc.
Linda Zeng, SAS Institute, Inc.
This session is an in-depth review of SAS® Grid performance on IBM Hardware. This review spans our environment's growth over the last four years and includes the latest upgrade to our environment from the first maintenance release of SAS® 9.3 to the third maintenance release of SAS® 9.4 (and doing a hardware refresh in the process).
Whayne Rouse, Humana
Andrew Scott, Humana
Longitudinal data with time-dependent covariates is not readily analyzed as there are inherent, complex correlations due to the repeated measurements on the sampling unit and the feedback process between the covariates in one time period and the response in another. A generalized method of moments (GMM) logistic regression model (Lalonde, Wilson, and Yin 2014) is one method for analyzing such correlated binary data. While GMM can account for the correlation due to both of these factors, it is imperative to identify the appropriate estimating equations in the model. Cai and Wilson (2015) developed a SAS® macro using SAS/IML® software to fit GMM logistic regression models with extended classifications. In this paper, we expand the use of this macro to allow for continuous responses and as many repeated time points and predictors as possible. We demonstrate the use of the macro through two examples, one with binary response and another with continuous response.
Katherine Cai, Arizona State University
Jeffrey Wilson, Arizona State University
It is not uncommon to hear SAS® administrators complain that their IT department and users just don't get it when it comes to metadata and security. For the administrator or user not familiar with SAS, understanding how SAS interacts with the operating system, the file system, external databases, and users can be confusing. This paper walks you through all the basic metadata relationships and how they are created on an installation of SAS® Enterprise Office Analytics installation in a Windows environment. This guided tour unravels the mystery of how the host system, external databases, and SAS work together to give users what they need, while reliably enforcing the appropriate security.
Charyn Faenza, F.N.B. Corporation
The purpose of this paper is to provide an overview of SAS® metadata security for new or inexperienced SAS administrators. The focus of the discussion is on identifying the most common metadata security objects such as access control entries (ACEs), access control templates (ACTs), metadata folders, authentication domains, etc. and describing how these objects work together to secure the SAS environment. Based on a standard SAS® Enterprise Office Analytics for Midsize Business installation in a Windows environment, this paper walks through a simple example of securing a metadata environment, which demonstrates how security is prioritized, the impact of each security layer, and how conflicts are resolved.
Charyn Faenza, F.N.B. Corporation
Mobile devices are an integral part of a business professional's life. These mobile devices are getting increasingly powerful in terms of processor speeds and memory capabilities. Business users can benefit from a more analytical visualization of the data along with their business context. The new SAS® Mobile BI contains many enhancements that facilitate the use of SAS® Analytics in the newest version of SAS® Visual Analytics. This paper demonstrates how to use the new analytical visualization that has been added to SAS Mobile BI from SAS Visual Analytics, for a richer and more insightful experience for business professionals on the go.
Murali Nori, SAS
Spontaneous combustion describes combustion that occurs without an external ignition source. With the right combination of fire tetrahedron components--including fuel, oxidizer, heat, and chemical reaction--it can be a deadly yet awe-inspiring phenomenon, and differs from traditional combustion that requires a fire source, such as a match, flint, or spark plugs (in the case of combustion engines). SAS® code as well often requires a 'spark' the first time it is run or run within a new environment. Thus, SAS® programs might operate correctly in an organization's original development environment, but might fail in its production environment until folders are created, SAS® libraries are assigned, control tables are constructed, or configuration files are built or modified. And, if software portability is problematic within a single organization, imagine the complexities that exist when SAS code is imported from a blog, white paper, textbook, or other external source into one's own infrastructure. The lack of software portability and the complexities of initializing new code often compel development teams to build code from scratch rather than attempting to rehabilitate or customize existent code to run in their environment. A solution is to develop SAS code that flexibly builds and validates its environment and required components during execution. To that end, this text describes techniques that increase the portability, reusability, and maintainability of SAS code by demonstrating self-extracting, spontaneously combustible code that requires no spark.
Troy Hughes, Datmesis Analytics
Over the past two years, the Analytics group at 89 Degrees has completely overhauled the toolset we use in our day-to-day work. We implemented SAS® Visual Analytics, initially to reduce the time to create new reports, and then to increase access to the data so that users not familiar with SAS® can create their own explorations and reports. SAS Visual Analytics has become a collaboration tool between our analysts and their business partners, with proportionally more time spent on the interpretation. (We show an example of this in this presentation.) Flush with success, we decided to tackle another area where processing times were longer than we would like, namely weblog data. We were treating weblog data in one of two ways: (1) creating a structured view of unstructured data by saving a handful of predefined variables from each session (for example, session and customer identifiers, page views, time on site, and so on), or (2) storing the granular weblog data in a Hadoop environment and relying on our data management team to fulfill data requests. We created a business case for SAS/ACCESS® Interface for Hadoop, invested in extra hardware, and created a big data environment that could be accessed directly from SAS by the analysts. We show an example of how we created new variables and used them in a logistic regression analysis to predict combined online and offline shopping rates using online and offline historic behavior. Our next dream is to bring all of that together by linking SAS Visual Statistics to a Hadoop environment other than the SAS® LASR™ Analytic server. We share our progress, and hopefully our success, as part of the session.
Rosie Poultney, 89 Degrees
In today's Business Intelligence world, self-service, which allows an everyday knowledge worker to explore data and personalize business reports without being tech-savvy, is a prerequisite. The new release of SAS® Visual Statistics introduces an HTML5-based, easy-to-use user interface that combines statistical modeling, business reporting, and mobile sharing into a one-stop self-service shop. The backbone analytic server of SAS Visual Statistics is also updated, allowing an end user to analyze data of various sizes in the cloud. The paper illustrates this new self-service modeling experience in SAS Visual Statistics using telecom churn data, including the steps of identifying distinct user subgroups using decision tree, building and tuning regression models, designing business reports for customer churn, and sharing the final modeling outcome on a mobile device.
Xiangxiang Meng, SAS
Don Chapman, SAS
Cheryl LeSaint, SAS
The third maintenance release of SAS® 9.4 was a huge release with respect to the interoperability between SAS® and Hadoop, the industry standard for big data. This talk brings you up-to-date with where we are: more distributions, more data types, more options. Come and learn about the exciting new developments for blending your SAS processing with your shared Hadoop cluster. Grid processing. Check. SAS data sets on HDFS. Check.
Paul Kent, SAS
Revolution Analytics reports more than two million R users worldwide. SAS® has the capability to use R code, but users have discovered a slight learning curve to performing certain basic functions such as getting data from the web. R is a functional programming language while SAS is a procedural programming language. These differences create difficulties when first making the switch from programming in R to programming in SAS. However, SAS/IML® software enables integration between the two languages by enabling users to write R code directly into SAS/IML. This paper details the process of using the SAS/IML command Submit /R and the R package XML to get data from the web into SAS/IML. The project uses public basketball data for each of the 30 NBA teams over the past 35 years, taken directly from Basketball-Reference.com. The data was retrieved from 66 individual web pages, cleaned using R functions, and compiled into a final data set composed of 48 variables and 895 records. The seamless compatibility between SAS and R provide an opportunity to use R code in SAS for robust modeling. The resulting analysis provides a clear and concise approach for those interested in pursuing sports analytics.
Matt Collins, University of Alabama
Taylor Larkin, The University of Alabama
Coronal mass ejections (CMEs) are massive explosions of magnetic field and plasma from the Sun. While responsible for the northern lights, these eruptions can cause geomagnetic storms and cataclysmic damage to Earth's telecommunications systems and power grid infrastructures. Hence, it is imperative to construct highly accurate predictive processes to determine whether an incoming CME will produce devastating effects on Earth. One such process, called stacked generalization, trains a variety of models, or base-learners, on a data set. Then, using the predictions from the base-learners, another model is trained to learn from the metadata. The goal of this meta-learner is to deduce information about the biases from the base-learners to make more accurate predictions. Studies have shown success in using linear methods, especially within regularization frameworks, at the meta-level to combine the base-level predictions. Here, SAS® Enterprise Miner™ 13.1 is used to reinforce the advantages of regularization via the Least Absolute Shrinkage and Selection Operator (LASSO) on this type of metadata. This work compares the LASSO model selection method to other regression approaches when predicting the occurrence of strong geomagnetic storms caused by CMEs.
Taylor Larkin, The University of Alabama
Denise McManus, University of Alabama
This paper examines the various sampling options that are available in SAS® through PROC SURVEYSELECT. We do not cover all of the possible sampling methods or options that PROC SURVEYSELECT features. Instead, we look at Simple Random Sampling, Stratified Random Sampling, Cluster Sampling, Systematic Sampling, and Sequential Random Sampling.
Rachael Becker, University of Central Florida
Drew Doyle, University of Central Florida
Just as there are many ways to solve any problem in any facet of life, most SAS® programming problems have more than one potential solution. Each solution has tradeoffs; a complex program might execute very quickly but prove challenging to maintain, while a program designed for ease of use might require more resources for development, execution, and maintenance. Too often, it seems like those tasked to produce the results are advised on delivery date and estimated development time in advance, but are given no guidelines for efficiency expectations. This paper provides ways for improving the efficiency of your SAS® programs. It suggests coding techniques, provides guidelines for their use, and shows the results of experimentation to compare various coding techniques, with examples of acceptable and improved ways to accomplish the same task.
Andrew Kuligowski, HSN
Swati Agarwal, Optum
How can you set up SAS® Visual Analytics to present reports to the public while still showing different data based on individual access rights? How can a system like that allow for frequent changes in the user base and for individuals' access rights? This session focuses on a recent Norwegian case where SAS® Visual Analytics 7.3 is used to present reports to a large number of users in the public domain. Report data is controlled on a row-level basis for each user and is frequently changed. This poses key questions on how to design a security architecture that allows for new user and changing access rights while keeping highly available and well-performing reports.
With the ever growing global tourism sector, the hotel industry is also flourishing by leaps and bounds, and the standards of quality services by the tourists are also increasing. In today's cyber age where many on the Internet become part of the online community, word of mouth has steadily increased over time. According to a recent survey, approximately 46% of travelers look for online reviews before traveling. A one-star rating by users influences the peer consumer more than the five-star brand that the respective hotel markets itself to be. In this paper, we do the customer segmentation and create clusters based on their reviews. The next process relates to the creation of an online survey targeting the interests of the customers of those different clusters, which helps the hoteliers identify the interests and the hospitality expectations from a hotel. For this paper, we use a data set of 4096 different hotel reviews, with a total of 60,239 user reviews.
Saurabh Nandy, Oklahoma State University
Neha Singh, Oklahoma State University
Whether it is a question of security or a question of centralizing the SAS® installation to a server, the need to phase out SAS in the PC environment has never been so common. On the surface, this type of migration seems very simple and smooth. However, migrating SAS from a PC environment to a SAS server environment (SAS® Enterprise Guide®) is really easy to underestimate. This paper presents a way to set up the winning conditions to achieve this goal without falling into the common traps. Based on a successful conversion with a previous employer, I have identified a high-level roadmap with specific objectives that will guide people in this important task.
Mathieu Gaouette, Videotron
Sampling, whether with or without replacement, is an important component of the hypothesis testing process. In this paper, we demonstrate the mechanics and outcome of sampling without replacement, where sample values are not independent. In other words, what we get in the first sample affects what we can get for the second sample, and so on. We use the popular variant of poker known as No Limit Texas Hold'em to illustrate.
Dan Bretheim, Towers Watson
Similarity analysis is used to classify an entire time series by type. In this method, a distance measure is calculated to reflect the difference in form between two ordered sequences, such as are found in time series data. The mutual differences between many sequences or time series can be compiled to a similarity matrix, similar in form to a correlation matrix. When cluster methods are applied to a similarity matrix, time series with similar patterns are grouped together and placed into clusters distinct from others that contain times series with different patterns. In this example, similarity analysis is used to classify supernovae by the way the amount of light they produce changes over time.
David Corliss, Ford Motor Company
Missing data is something that we cannot always prevent. Data can be missing due to subjects' refusing to answer a sensitive question or in fear of embarrassment. Researchers often assume that their data are missing completely at random or missing at random.' Unfortunately, we cannot test whether the mechanism condition is satisfied because missing values cannot be calculated. Alternatively, we can run simulation in SAS® to observe the behaviors of missing data under different assumptions: missing completely at random, missing at random, and being ignorable. We compare the effects from imputation methods if we assign a set of variable of interests to missing. The idea of imputation is to see how efficient substituted values in a data set affect further studies. This lets the audience decide which method(s) would be best to approach a data set when it comes to missing data.
Danny Rithy, California Polytechnic State University
Soma Roy, California Polytechnic State University
In forecasting, there are often situations where several time series are interrelated: components of one time series can transition into and from other time series. A Markov chain forecast model may readily capture such intricacies through the estimation of a transition probability matrix, which enables a forecaster to forecast all the interrelated time series simultaneously. A Markov chain forecast model is flexible in accommodating various forecast assumptions and structures. Implementation of a Markov chain forecast model is straightforward using SAS/IML® software. This paper demonstrates a real-world application in forecasting a community supervision caseload in Washington State. A Markov model was used to forecast five interrelated time series in the midst of turbulent caseload changes. This paper discusses the considerations and techniques in building a Markov chain forecast model at each step. Sample code using SAS/IML is provided. Anyone interested in adding another tool to their forecasting technique toolbox will find that the Markov approach is useful and has some unique advantages in certain settings.
Gongwei Chen, Caseload Forecast Council
Retailers and wholesalers invest heavily in technology, people, processes, and data to create relevant assortments across channels. While technology and vast amounts of data help localize assortments based on consumer preferences, product attributes, and store performance, it's impossible to complete the assortment planning process down to the most granular level of size. The ability to manage millions of size and store combinations is burdensome, not scalable, and not precise. Valuable time and effort is spent creating detailed, insightful assortments only to marginalize those assortments by applying corporate averages of size selling for the purchasing and distribution of sizes to locations. The result is missed opportunity: disappointed customers, lost revenue, and lost profitability due to missing sizes and markdowns on abundant sizes. This paper shows how retailers and wholesalers can transform historical sales data into true size demand and determine the optimal size demand profile to use in the purchasing and allocation of products. You do not need to be a data scientist, statistician, or hold a PhD to augment the business process with approachable analytics and optimization to yield game-changing results!
Donna McGuckin, SAS
This paper describes a Kaiser Permanente Northwest business problem regarding tracking recent inpatient hospital utilization at external hospitals, and how it was solved with the flexibility of SAS® Enterprise Guide®. The Inpatient Indicator is an estimate of our regional inpatient hospital utilization as of yesterday. It tells us which of our members are in which hospitals. It measures inpatient admissions, which are health care interactions where a patient is admitted to a hospital for bed occupancy to receive hospital services. The Inpatient Indicator is used to produce data and create metrics and analysis essential to the decision making of Kaiser Permanente executives, care coordinators, patient navigators, utilization management physicians, and operations managers. Accurate, recent hospital inpatient information is vital for decisions regarding patient care, staffing, and member utilization. Due to a business policy change, Kaiser Permanente Northwest lost the ability to track urgent and emergent inpatient admits at external, non-plan hospitals through our referral system, which was our data source for all recent external inpatient admits. Without this information, we did not have complete knowledge of whether a member had an inpatient stay at an external hospital until a claim was received, which could be several weeks after the member was admitted. Other sources were needed to understand our inpatient utilization at external hospitals. A tool was needed with the flexibility to easily combine and compare multiple data sets with different field names, formats, and values representing the same metric. The tool needed to be able to import data from different sources and export data to different destinations. We also needed a tool that would allow this project to be scheduled. We chose to build the model with SAS Enterprise Guide.
Thomas Gant, Kaiser Permanente
Are you frustrated with manually setting options to control your SAS® Display Manager sessions but become daunted every time you look at all the places you can set options and window layouts? In this paper, we look at various files SAS accesses when starting, what can (and cannot) go into them, and what takes precedence after all are executed. We also look at the SAS Registry and how to programmatically change settings. By the end of the paper, you will be comfortable in knowing where to make the changes that best fit your needs.
Peter Eberhardt, Fernwood Consulting Group Inc.
The report looks simple enough--a bar chart and a table, like something created with GCHART and REPORT procedures. But there are some twists to the reporting requirements that make those procedures not quite flexible enough. It's often the case that the programming tools and techniques we envision using for a project or are most comfortable with aren't necessarily the best to use. Fortunately, SAS® can provide many ways to get results. Rather than procedure-based output, the solution here was to mix 'old' and 'new' DATA step-based techniques to solve the problem. Annotate data sets are used to create the bar chart and the Report Writing Interface (RWI) is used to create the table. Without a whole lot of additional code, you gain an extreme amount of flexibility.
Pete Lund, Looking Glass Analytics
Big data is often distinguished as encompassing high volume, high velocity, or high variability of data. While big data can signal big business intelligence and big business value, it can also wreak havoc on systems and software ill-prepared for its profundity. Scalability describes the ability of a system or software to adequately meet the needs of additional users or its ability to use additional processors or resources to fulfill those added requirements. Scalability also describes the adequate and efficient response of a system to increased data throughput. Because sorting data is one of the most common and most resource-intensive operations in any software language, inefficiencies or failures caused by big data often are first observed during sorting routines. Much SAS® literature has been dedicated to optimizing big data sorts for efficiency, including minimizing execution time and, to a lesser extent, minimizing resource usage (that is, memory and storage consumption). However, less attention has been paid to implementing big data sorting that is reliable and robust even when confronted with resource limitations. To that end, this text introduces the SAFESORT macro, which facilitates a priori exception-handling routines (which detect environmental and data set attributes that could cause process failure) and post hoc exception-handling routines (which detect actual failed sorting routines). If exception handling is triggered, SAFESORT automatically reroutes program flow from the default sort routine to a less resource-intensive routine, thus sacrificing execution speed for reliability. Moreover, macro modularity enables developers to select their favorite sort procedure and, for data-driven disciples, to build fuzzy logic routines that dynamically select a sort algorithm based on environmental and data set attributes.
Troy Hughes, Datmesis Analytics
Sixty percent of organizations will have Hadoop in production by 2016, per a recent TDWI survey, and it has become increasingly challenging to access, cleanse, and move these huge volumes of data. How can data scientists and business analysts clean up their data, manage it where it lives, and overcome the big data skills gap? It all comes down to accelerating the data preparation process. SAS® Data Loader leverages years of expertise in data quality and data integration, a simplified user interface, and the parallel processing power of Hadoop to overcome the skills gap and compress the time taken to prepare data for analytics or visualization. We cover some of the new capabilities of SAS Data Loader for Hadoop including in-memory Spark processing of data quality and master data management functions, faster profiling and unstructured data field extraction, and chaining multiple transforms together for improved productivity.
Matthew Magne, SAS
Scott Gidley, SAS
SAS/ETS® 14.1 delivers a substantial number of new features to researchers who want to examine causality with observational data in addition to forecasting the future. This release adds count data models with spatial effects, new linear and nonlinear models for panel data, the X13 procedure for seasonal adjustment, and many more new features. This paper highlights the many enhancements to SAS/ETS software and demonstrates how these features can help your organization increase revenue and enhance productivity.
Jan Chvosta, SAS
In Brazil, almost 70% of all loans are made based on pre-approved limits, which are established by the bank. Sicredi wanted to improve the number of loans granted through those limits. In addition, Sicredi wanted an application that focuses on the business user; one that enables business users to change system behavior with little or no IT involvement. The new system will be used in three major areas: - In the registration of a new client for whom Sicredi does not have a history. - Upon request by business users, after the customer already has a relationship with Sicredi, without customer request. - In the loan approval process, when a limit has not yet been set for the customer. The limit system will try to measure a limit for the customer based on the loan request, before sending the loan to the human approval system. Due to the impact of these changes, we turned the project into a program, and then split that program into three projects. The first project, which we have already finished, aimed to select an application that meets our requirements, and then to develop the credit measurement for the registration phase. SAS Real-Time Decision Manager was selected because it fulfills our requirements, especially those that pertain to business user operation. A drag-and-drop interface makes all the technical rules more comprehensible to the business user. So far, four months after releasing the project for implementation by the bank's branches, we have achieved more the USD 20 million granted in pre-approved loan limits. In addition, we have reduced the process time for limit measurement in the branches by 84%. The branches can follow their results and goals through reports developed in SAS Visual Analytics.
Felipe Lopes Boff
Disease prevalence is one of the most basic measures of the burden of disease in the field of epidemiology. As an estimate of the total number of cases of disease in a given population, prevalence is a standard in public health analysis. The prevalence of diseases in a given area is also frequently at the core of governmental policy decisions, charitable organization funding initiatives, and countless other aspects of everyday life. However, all too often, prevalence estimates are restricted to descriptive estimates of population characteristics when they could have a much wider application through the use of inferential statistics. As an estimate based on a sample from a population, disease prevalence can vary based on random fluctuations in that sample rather than true differences in the population characteristic. Statistical inference uses a known distribution of this sampling variation to perform hypothesis tests, calculate confidence intervals, and perform other advanced statistical methods. However, there is no agreed-upon sampling distribution of the prevalence estimate. In cases where the sampling distribution of an estimate is unknown, statisticians frequently rely on the bootstrap re-sampling procedure first given by Efron in 1979. This procedure relies on the computational power of software to generate repeated pseudo-samples similar in structure to an original, real data set. These multiple samples allow for the construction of confidence intervals and statistical tests to make statistical determinations and comparisons using the estimated prevalence. In this paper, we use the bootstrapping capabilities of SAS® 9.4 to compare statistically the difference between two given prevalence rates. We create a bootstrap analog to the two-sample t test to compare prevalence rates from two states despite the fact that the sampling distribution of these estimates is unknown using SAS®.
Matthew Dutton, Florida A&M University
Charlotte Baker, Florida A&M University
The increasing size and complexity of data in research and business applications require a more versatile set of tools for building explanatory and predictive statistical models. In response to this need, SAS/STAT® software continues to add new methods. This presentation takes you on a high-level tour of five recent enhancements: new effect selection methods for regression models with the GLMSELECT procedure, model selection for generalized linear models with the HPGENSELECT procedure, model selection for quantile regression with the HPQUANTSELECT procedure, construction of generalized additive models with the GAMPL procedure, and building classification and regression trees with the HPSPLIT procedure. For each of these approaches, the presentation reviews its key concepts, uses a basic example to illustrate its benefits, and guides you to information that will help you get started.
Bob Rodriguez, SAS
A stored process is a SAS® program that can be executed as required by different applications. Stored processes have been making SAS users' lives easier for decades. In SAS® Visual Analytics, stored processes can be used to enhance the user experience, create custom functionality and output, and expand the usefulness of reports. This paper discusses a technique for how data can be loaded on demand into memory for SAS Visual Analytics and accessed by reports as needed using stored processes. Loading tables into memory so that they can be used to create explorations or reports is a common task in SAS Visual Analytics. This task is usually done by an administrator, enabling the SAS Visual Analytics user to have a seamless transition from data to report. At times, however, users need for tables to be partially loaded or modified in memory on demand. The step-by-step instructions in this paper make it easy enough for any SAS Visual Analytics report builder to include these stored processes in their work. By using this technique, SAS Visual Analytics users have the freedom to access their data without having to work through an administrator for every little task, helping everyone be more effective and productive.
Renato Luppi, SAS
Varsha Chawla, SAS Institute Inc.
Teaching online courses is very different from teaching in the classroom setting. Developing and delivering an effective online class entails more than just transferring traditional course materials into written documents and posting them in a course shell. This paper discusses the author's experience in converting a traditional hands-on introductory SAS® programming class into an online course and presents some ideas for promoting successful learning and knowledge transfer when teaching online.
Justina Flavin, Independent Consultant
Sensors, devices, social conversation streams, web movement, and all things in the Internet of Things (IoT) are transmitting data at unprecedented volumes and rates. SAS® Event Stream Processing ingests thousands and even hundreds of millions of data events per second, assessing both the content and the value. The benefit to organizations comes from doing something with those results, and eliminating the latencies associated with storing data before analysis happens. This paper bridges the gap. It describes how to use streaming data as a portable, lightweight micro-analytics service for consumption by other applications and systems.
Fiona McNeill, SAS
David Duling, SAS
Steve Sparano, SAS
This paper compares different solutions to a data transpose puzzle presented to the SAS® User Group at the United States Census Bureau. The presented solutions range from a SAS 101 multi-step solution to an advanced solution using techniques that are not widely known, which yields run-time savings of 85 percent!
Ahmed Al-Attar, AnA Data Warehousing Consulting, LLC
Big data, small data--but what about when you have no data? Survey data commonly include missing values due to nonresponse. Adjusting for nonresponse in the analysis stage might lead different analysts to use different, and inconsistent, adjustment methods. To provide the same complete data to all the analysts, you can impute the missing values by replacing them with reasonable nonmissing values. Hot-deck imputation, the most commonly used survey imputation method, replaces the missing values of a nonrespondent unit by the observed values of a respondent unit. In addition to performing traditional cell-based hot-deck imputation, the SURVEYIMPUTE procedure, new in SAS/STAT® 14.1, also performs more modern fully efficient fractional imputation (FEFI). FEFI is a variation of hot-deck imputation in which all potential donors in a cell contribute their values. Filling in missing values is only a part of PROC SURVEYIMPUTE. The real trick is to perform analyses of the filled-in data that appropriately account for the imputation. PROC SURVEYIMPUTE also creates a set of replicate weights that are adjusted for FEFI. Thus, if you use the imputed data from PROC SURVEYIMPUTE along with the replicate methods in any of the survey analysis procedures--SURVEYMEANS, SURVEYFREQ, SURVEYREG, SURVEYLOGISTIC, or SURVEYPHREG--you can be confident that inferences account not only for the survey design, but also for the imputation. This paper discusses different approaches for handling nonresponse in surveys, introduces PROC SURVEYIMPUTE, and demonstrates its use with real-world applications. It also discusses connections with other ways of handling missing values in SAS/STAT.
Pushpal Mukhopadhyay, SAS
Cancer is the second-leading cause of deaths in the United States. About 10% to 15% of all lung cancers are non-small cell lung cancer, but they constitute about 26.8% of cancer deaths. An efficient cancer treatment plan is therefore of prime importance for increasing the survival chances of a patient. Cancer treatments are generally given to a patient in multiple sittings and often doctors tend to make decisions based on the improvement over time. Calculating the survival chances of the patient with respect to time and determining the various factors that influence the survival time would help doctors make more informed decisions about the further course of treatment and also help patients develop a proactive approach in making choices for the treatment. The objective of this paper is to analyze the survival time of patients suffering from non-small cell lung cancer, identify the time interval crucial for their survival, and identify various factors such as age, gender, and treatment type. The performances of the models built are analyzed to understand the significance of cubic splines used in these models. The data set is from the Cerner database with 548 records and 12 variables from 2009 to 2013. The patient records with loss to follow-up are censored. The survival analysis is performed using parametric and nonparametric methods in SAS® Enterprise Miner™ 13.1. The analysis revealed that the survival probability of a patient is high within two weeks of hospitalization and the probability of survival goes down by 60% between weeks 5 and 9 of admission. Age, gender, and treatment type play prominent roles in influencing the survival time and risk. The probability of survival for female patients decreases by 70% and 80% during weeks 6 and 7, respectively, and for male patients the probability of survival decreases by 70% and 50% during weeks 8 and 13, respectively.
Raja Rajeswari Veggalam, Oklahoma State University
Goutam Chakraborty, Oklahoma State University
Akansha Gupta, Oklahoma State University
So, you've encrypted your data. But is it safe? What would happen if that anonymous data you've shared isn't as anonymous as you think? Senior SAS® Consultant Andy Smith from Amadeus Software discusses the approaches hackers take to break encryption, and he shares simple, practical techniques to foil their efforts.
Andy Smith, Amadeus Software Limited
When business rules are deployed and executed--whether a rule is fired or not--if the rule-fire outcomes are not monitored or investigated for validation or challenged, over time unintended business impacts can occur because of changing data profiles or characteristics of the input data for the rules. Comparing scenarios using modified rules and visually comparing how they might impact your business can aide you in meeting compliance regulations, knowing your customers, and staying relevant or accurate in your particular business context. Visual analysis of rules outcomes is a powerful way to validate what is expected or to compare potential impact that could lead to further investigation and refinement of existing rules. This paper shows how to use SAS® Visual Analytics and other visualizations to perform various types of meaningful and useful rule-fire outcome analysis with rules created in SAS® Business Rules Manager. Using visual graphical capabilities can give organizations or businesses a straightforward way to validate, monitor, and keep rules from running wild.
Charlotte Crain, SAS
Chris Upton, SAS
Universities, government agencies, and non-profits all require various levels of data analysis, data manipulation, and data management. This requires a workforce that knows how to use statistical packages. The server-based SAS® OnDemand for Academics: Studio offerings are excellent tools for teaching SAS in both postsecondary education and in professional continuing education settings. SAS OnDemand for Academics: Studio can be used to share resources; teach users using Windows, UNIX, and Mac OS computers; and teach in various computing environments. This paper discusses why one might use SAS OnDemand for Academics: Studio instead of SAS® University Edition, SAS® Enterprise Guide®, or Base SAS® for education and provides examples of how SAS OnDemand for Academics: Studio has been used for in-person and virtual training.
Charlotte Baker, Florida A&M University
C. Perry Brown, Florida Agricultural and Mechanical University
This paper discusses a set of practical recommendations for optimizing the performance and scalability of your Hadoop system using SAS®. Topics include recommendations gleaned from actual deployments from a variety of implementations and distributions. Techniques cover tips for improving performance and working with complex Hadoop technologies such as Kerberos, techniques for improving efficiency when working with data, methods to better leverage the SAS in Hadoop components, and other recommendations. With this information, you can unlock the power of SAS in your Hadoop system.
Nancy Rausch, SAS
Wilbram Hazejager, SAS
A survey is often the best way to obtain information and feedback to help in program improvement. At a bare minimum, survey research comprises economics, statistics, and psychology in order to develop a theory of how surveys measure and predict important aspects of the human condition. Through healthcare-related research papers written by experts in the industry, I hypothesize a list of several factors that are important and necessary for patients during hospital visits. Through text mining using SAS® Enterprise Miner™ 12.1, I measured tangible aspects of hospital quality and patient care by using online survey responses and comments found on the website for the National Health Service in England. I use these patient comments to determine the majority opinion and whether it correlates with expert research on hospital quality. The implications of this research are vital in comprehending our health-care system and the factors that are important in satisfying patient needs. Analyzing survey responses can help us to understand and mitigate disparities in health-care services provided to population subgroups in the United States. Starting with online survey responses can help us to understand the overall methodology and motivation needed to analyze the surveys that have been developed for departments like the Centers for Medicare and Medicaid Services (CMS).
Divya Sridhar, Deloitte
In this session, we examine the use of text analytics as a tool for strategic analysis of an organization, a leader, a brand, or a process. The software solutions SAS® Enterprise Miner™ and Base SAS® are used to extract topics from text and create visualizations that identify the relative placement of the topics next to various business entities. We review a number of case studies that identify and visualize brand-topic relationships in the context of branding, leadership, and service quality.
Nicholas Evangelopoulos, University of North Texas
The recent controversy regarding former Secretary Hillary Clinton's use of a non-government, privately maintained email server provides a great opportunity to analyze real-world data using a variety of analytic techniques. This email corpus is interesting because of the challenges in acquiring and preparing the data for analysis as well as the variety of analyses that can be performed, including techniques for searching, entity extraction and resolution, natural language processing for topic generation, and social network analysis. Given the potential for politically charged discussion, rest assured there will be no discussion of politics--just fact-based analysis.
Michael Ames, SAS
Are you living in Heartbreak Hotel because your boss wants different statistics in the SAME column on your report? Need a currency symbol in one cell of your pre-summarized data, but a percent sign in another cell? Large blocks of text on your report have you all shook up because they wrap badly on your report? Have you hit the wall with PROC PRINT? Well, rock out of your jailhouse with ODS, DATA step, and PROC REPORT. Are you living in Heartbreak Hotel because your boss wants different statistics in the SAME column on your report? Need a currency symbol in one cell of your pre-summarized data, but a percent sign in another cell? Large blocks of text on your report have you all shook up because they wrap badly on your report? Have you hit the wall with PROC PRINT? Well, rock out of your jailhouse with ODS, DATA step, and PROC REPORT. This paper is a sequel to the popular 2008 paper Creating Complex Reports. The paper presents a nuts-and-bolts look at more complex report examples gleaned from SAS® Community Forum questions and questions from students. Examples will include use of DATA step manipulation to produce PROC REPORT and PROC SGPLOT output as well as examples of ODS LAYOUT and the new report writing interface. And PROC TEMPLATE makes a special guest appearance. Even though the King of Rock 'n' Roll won't be there for the presentation, perhaps we'll hear his ghost say Thank you very much, I always wanted to know how to do that at the end of this presentation.
Cynthia Zender, SAS
All public schools in the United States require health and safety education for their students. Furthermore, almost all states require driver education before minors can obtain a driver's license. Through extensive analysis of the Fatality Analysis Reporting System data, we have concluded that from 2011-2013 an average of 12.1% of all individuals killed in a motor vehicle accident in the United States, District of Columbia, and Puerto Rico were minors (18 years or younger). Our goal is to offer insight within our analysis in order to better road safety education to prevent future premature deaths involving motor vehicles.
Molly Funk, Bryant University
Max Karsok, Bryant University
Michelle Williams, Bryant University
It's impossible to know all of SAS® or all of statistics. There will always be some technique that you don't know. However, there are a few techniques that anyone in biostatistics should know. If you can calculate those with SAS, life is all the better. In this session you will learn how to compute and interpret a baker's dozen of these techniques, including several statistics that are frequently confused. The following statistics are covered: prevalence, incidence, sensitivity, specificity, attributable fraction, population attributable fraction, risk difference, relative risk, odds ratio, Fisher's exact test, number needed to treat, and McNemar's test. With these 13 tools in their tool chest, even nonstatisticians or statisticians who are not specialists will be able to answer many common questions in biostatistics. The fact that each of these can be computed with a few statements in SAS makes the situation all the sweeter. Bring your own doughnuts.
AnnMaria De Mars, 7 Generation Games
VBA has been described as a glue language, and has been widely used in exchanging data between Microsoft products such as Excel and Word or PowerPoint. How to trigger the VBA macro from SAS® via DDE has been widely discussed in recent years. However, using SAS to send parameters to a VBA macro was seldom reported. This paper provides a solution for this problem. Copying Excel tables to PowerPoint using the combination of SAS and VBA is illustrated as an example. The SAS program rapidly scans all Excel files that are contained in one folder, passes the file information to VBA as parameters, and triggers the VBA macro to write PowerPoint files in a loop. As a result, a batch of PowerPoint files can be generated by just one mouse-click.
Zhu Yanrong, Medtronic
Like a good pitcher and catcher in baseball, ODS layout and the ODS destination for PowerPoint are a winning combination in SAS® 9.4. With this dynamic duo, you can go straight from performing data analysis to creating a quality presentation. The ODS destination for PowerPoint produces native PowerPoint files from your output. When you pair it with ODS layout, you are able to dynamically place your output on each slide. Through code examples this paper shows you how to create a custom title slide, as well as place the desired number of graphs and tables on each slide. Don't be relegated to the sidelines--increase your winning percentage by learning how ODS layout works with the ODS destination for PowerPoint.
Jane Eslinger, SAS
Like all skilled tradespeople, SAS® programmers have many tools at their disposal. Part of their expertise lies in knowing when to use each tool. In this paper, we use a simple example to compare several common approaches to generating the requested report: the TABULATE, TRANSPOSE, REPORT, and SQL procedures. We investigate the advantages and disadvantages of each method and consider when applying it might make sense. A variety of factors are examined, including the simplicity, reusability, and extensibility of the code in addition to the opportunities that each method provides for customizing and styling the output. The intended audience is beginning to intermediate SAS programmers.
Josh Horstman, Nested Loop Consulting
SAS® Visual Analytics can display maps with your location information. However, you might need to display locations that do not match the categories found in the SAS Visual Analytics interface, such as street address locations or non-US postal code locations. You might also wish to display custom locations that are specific to your business or industry, such as the locations of power grid substations or railway mile markers. Alternatively, you might want to validate your address data. This presentation shows how PROC GEOCODE can be used to simplify the geocoding by processing your location information before putting your data into SAS Visual Analytics.
Darrell Massengill, SAS
The HPSUMMARY procedure provides data summarization tools to compute basic descriptive statistics for variables in a SAS® data set. It is a high-performance version of the SUMMARY procedure in Base SAS®. Though PROC SUMMARY is popular with data analysts, PROC HPSUMMARY is still a new kid on the block. The purpose of this paper is to provide an introduction to PROC HPSUMMARY by comparing it with its well-known counterpart, PROC SUMMARY. The comparison focuses on differences in syntax, options, and performance in terms of execution time and memory usage. Sample code, outputs, and SAS log snippets are provided to illustrate the discussion. Simulated large data is used to observe the performance of the two procedures. Using SAS® 9.4 installed on a single-user machine with four cores available, preliminary experiments that examine performance of the two procedures show that HPSUMMARY is more memory-efficient than SUMMARY when the data set is large. (For example, SUMMARY failed due to insufficient memory, whereas HPSUMMARY finished successfully). However, there is no evidence of a performance advantage of the HPSUMMARY over the SUMMARY procedures in this single-user machine.
Anh Kellermann, University of South Florida
Jeff Kromrey, University of South Florida
Omnichannel, and the omniscient customer experience, is most commonly used as a buzzword to describe the seamless customer experience in a traditional multi-channel marketing and sales environment. With more channels and methods of communication, there is a growing need to establish a more customer-centric way of dealing with all customer interactions, not only 1:1. Telenor, based out of Norway, is one of the world's major mobile operators with in excess of 200 million mobile subscriptions throughout 13 markets across Europe and Asia. The Norwegian home-market is a highly saturated and mature market in which customer demands and expectations are constantly rising. To deal with this and with increased competition, two major initiatives were established together with SAS®. The initiatives aimed to leverage both the need for real-time analytics and decision management in our inbound channel, and for creating an omnichannel experience across inbound and outbound channels. The projects were aimed at both business-to-consumer (B2C) and business-to-business (B2B) markets. With significant legacy of back-end systems and a complex value chain it was important to both improve the customer experience and simplify customer treatment, all without impacting the back-end system at large. The presentation sheds light on how the projects worked to meet the technical challenges alongside the need for an optimal customer experience. With results far exceeding expectations, the outcome has established the basis for further Customer Lifecycle Management (CLM) initiatives to strengthen both Net Promoter Score/Customer loyalty and revenue.
Jørn Tronstad, Telenor
SAS customers have a growing need for accurate and quality SAS® maps. SAS has licensed the map data from a third party to satisfy this need. This presentation explores the new map data by discussing the problems and limitations with the old map data, and the features and examples for using the new data.
Darrell Massengill, SAS
For marketers who are responsible for identifying the best customer to target in a campaign, it is often daunting to determine which media channel, offer, or campaign program is the one the customer is more apt to respond to, and therefore, is more likely to increase revenue. This presentation examines the components of designing campaigns to identify promotable segments of customers and to target the optimal customers using SAS® Marketing Automation integrated with SAS® Marketing Optimization.
Pamela Dixon, SAS
The SAS® procedure PROC TREE sketches parent-child lineage--also known as trees--from hierarchical data. Hierarchical relationships can be difficult to flatten out into a data set, but with PROC TREE, its accompanying ODS table TREELISTING, and some creative yet simple handcrafting, a de-lineage of parent-children variables can be derived. Because the focus of PROC TREE is to provide the tree structure in graphical form, it does not explicitly output the depth of the tree, although the depth is visualized in the accompanying graph. Perhaps unknown to most, the path length variable, or simply the height of the tree, can be extracted from PROC TREE merely by capturing it from the ODS output, as demonstrated in this paper.
Can Tongur, Statistics Sweden
Time flies. Thirteen years have passed since the first introduction of the hash method by SAS®. Dozens of papers have been written offering new (sometimes weird) applications for hash. Even beyond look-ups, we can use the hash object for summation, splitting files, sorting files, fuzzy matching, and other data processing problems. In SAS, almost any task can be solved by more than one method. For a class of problems where hash can be applied, is it the best tool to use? We present a variety of problems and provide several solutions, using both hash tables and more traditional methods. Comparing the solutions, you must consider computing time as well as your time to code and validate your responses. Which solution is better? You decide!
David Izrael, Abt Associates
Elizabeth Axelrod, Abt Associates
This SAS® test begins where the SAS® Advanced Certification test leaves off. It has 25 questions to challenge even the most experienced SAS programmers. Most questions are about the DATA step, but there are a few macro and SAS/ACCESS® software questions thrown in. The session presents each question on a slide, then the answer on the next. You WILL be challenged.
Glen Becker, USAA
The Scheduler is an innovative tool that maps linear data (that is, time stamps) to an intuitive three-dimensional representation. This transformation enables the user to identify relationships, conflicts, gaps, and so on, that were not readily apparent in the data's native form. This tool has applications in operations research and litigation-related analyses. This paper discusses why the Scheduler was created, the data that is available to analyze the issue, and how this code can be used in other types of applications. Specifically, this paper discusses how the Scheduler can be used by supervisors to maintain the presence of three or more employees at all times to ensure that all federal- and state- mandated work breaks are taken. Additional examples of the Scheduler include assisting construction foremen to create schedules that visualize the presence of contractors in a logical sequence while eliminating overlap and ensuring cushions of time between contractors and matching items to non-uniform information.
Ariel Kumpinsky, The Claro Group, LLC
SAS® helps people make better decisions. But what makes a decision better? How can we make sure we are not making worse decisions? There are six tenets to follow to ensure we are making better decisions. Decisions are better when they are: (1) Aligned with your mission; (2) Complete; (3) Faster; (4) Accurate; (5) Accessible; and (6) Recurring, ongoing, or productionalized. By combining all of these aspects of making a decision, you can have confidence that you are making a better decision. The breadth of SAS software is examined to understand how it can be applied toward these tenets. Scorecards are used to ensure that your business stays aligned with goals. Data Management is used to bring together all of the data you have, to provide complete information. SAS® Visual Analytics offerings are unparalleled in their speed to enable you to make faster decisions. Exhaustive testing verifies accuracy. Modern, easy-to-use user interfaces are adapted for multiple languages and designed for a variety of users to ensure accessibility. And the powerful SAS data flow architecture is built for ongoing support of decisions. Several examples from the SAS® Solutions OnDemand group are used as case studies in support of these tenets.
Dan Jahn, SAS
I like to think that SAS® error messages come in three flavors, IMPORTANT, INTERESTING, and IRRELEVANT. SAS calls its messages NOTES, WARNINGS, and ERRORS. I intend to show you that not all NOTES are IRRELEVANT nor are all ERRORS IMPORTANT. This paper walks through many different scenarios and explains in detail the meaning and impact of messages presented in the SAS log. I show you how to locate, classify, analyze, and resolve many different SAS message types. And for those brave enough, I go on to teach you how to both generate and suppress messages sent to the SAS log. This paper presents an overview of messages that can often be found in a SAS Log window or output file. The intent of this presentation is to familiarize you with common messages, the meaning of the messages, and how to determine the best way to react to the messages. Notice I said react , not necessarily correct. Code examples and log output will be presented to aid in the explanations.
William E Benjamin Jr, Owl Computer Consultancy LLC
This paper is a primer on the practice of designing, selecting, and making inferences on a statistical sample, where the goal is to estimate the magnitude of error in a book value total. Although the concepts and syntax are presented through the lens of an audit of health-care insurance claim payments, they generalize to other contexts. After presenting the fundamental measures of uncertainty that are associated with sample-based estimates, we outline a few methods to estimate the sample size necessary to achieve a targeted precision threshold. The benefits of stratification are also explained. Finally, we compare several viable estimators to quantify the book value discrepancy, making note of the scenarios where one might be preferred over the others.
Taylor Lewis, U.S. Office of Personnel Management
Julie Johnson, OPM - Office of the Inspector General
Christine Muha, U.S. Office of Personnel Management
Specifying colors based on group value is a quite popular practice in visualizing data, but it is not so easy to do, especially when there are multiple group values. This paper explores three different methods to dynamically assign colors to plots based on their group values. They are combining EVAL and IFN functions in the plot statements; bringing the DISCRETEATTRMAP block into the plot statements; and using the macro from the SAS® sample 40255.
Amos Shu, MedImmune
This paper shows how SAS® can be used to obtain a Time Series Analysis of data regarding World War II. This analysis tests whether Truman's justification for the use of atomic weapons was valid. Truman believed that by using the atomic weapons, he would prevent unacceptable levels of U.S. casualties that would be incurred in the course of a conventional invasion of the Japanese islands.
Rachael Becker, University of Central Florida
Since it makes login transparent and does not send passwords over the wire, Integrated Windows Authentication (IWA) is both extremely convenient for end users and highly secure. However, for administrators, it is not easy to set up and rarely successful on the first attempt, so being able to effectively troubleshoot is imperative. In this paper, we take a step-by-step approach to configuring IWA, explain how and where to get useful debugging output, and share our hard-earned knowledge base of known problems and deciphered error messages.
Mike Roda, SAS
Inherently, mixed modeling with SAS/STAT® procedures (such as GLIMMIX, MIXED, and NLMIXED) is computationally intensive. Therefore, considerable memory and CPU time can be required. The default algorithms in these procedures might fail to converge for some data sets and models. This encore presentation of a paper from SAS Global Forum 2012 provides recommendations for circumventing memory problems and reducing execution times for your mixed-modeling analyses. This paper also shows how the new HPMIXED procedure can be beneficial for certain situations, as with large sparse mixed models. Lastly, the discussion focuses on the best way to interpret and address common notes, warnings, and error messages that can occur with the estimation of mixed models in SAS® software.
Phil GIbbs, SAS
SAS® DS2 is a powerful new object-oriented programming language that was introduced with SAS® 9.4. Having been designed for advanced data manipulation, it enables the programmer not only to use a much wider range of data types than with the traditional Base SAS® language but it also allows for the creation of custom methods and packages. These packages are analogous to classes in other object-oriented languages such as C# or Ruby and can be used to massively improve your programming effectiveness. This paper demonstrates a number of techniques that can be used to take full advantage of DS2's object-oriented user-defined package capabilities. Examples include creating new data types, storing and reusing packages, using simple packages as building blocks in the creation of more complex packages, a technique to overcome DS2's lack of support for inheritance and building lightweight packages to facilitate method overloading and make parameter passing simpler.
Chris Brooks, Melrose Analytics Ltd
Are you going to enable HTTPS for your SAS® environment? Looking to improve the security of your SAS deployment? Do you need more details about how to efficiently configure HTTPS? This paper guides you through the configuration of SAS® 9.4 with HTTPS for the SAS middle tier. We examine how best to implement site-signed Transport Layer Security (TLS) certificates and explore how far you can take the encryption. This paper presents tips and proven practices that can help you be successful.
Stuart Rogers, SAS
Do you need a macro for your program? This paper provides some guidelines, based on user experience, and explores whether it's worth the time to create a macro (for example, a parameter-driven macro or just a simple macro variable). This paper is geared toward new users and experienced users who do not use macros.
Claudine Lougee, Dualenic
Data transformations serve many functions in data analysis, including improving normality of distribution and equalizing variance to meet assumptions and improve effective sizes. Traditionally, the first step in the analysis is to preprocess and transform the data to derive different representations for further exploration. But now in this era of big data, it is not always feasible to have transformed data available beforehand. Analysts need to conduct exploratory data analysis and subsequently transform data on the fly according to their needs. SAS® Visual Analytics has an expression builder component integrated into SAS® Visual Data Builder, SAS® Visual Analytics Explorer, and SAS® Visual Analytics Designer that helps you transform data on the fly. The expression builder enables you to create expressions that you can use to aggregate columns, perform multiple operations on data, and perform conditional processing. It supports numeric, comparison, Boolean, text, and date time operators, and different functions like Log, Ln, Mod, Exp, Power, Root, and so on. This paper demonstrates how you can use the expression builder that is integrated into the data builder, the explorer, and the designer to create different types of expressions and transform data for analysis and reporting purpose.
Atul Kachare, SAS
Sometimes data are not arranged the way you need them to be for the purpose of reporting or combining with other data. This presentation examines just such a scenario. We focus on the TRANSPOSE procedure and its ability to transform data to meet our needs. We explore this method as an alternative to using longhand code involving arrays and OUTPUT statements in a SAS® DATA step.
Faron Kincheloe, Baylor University
Educational systems at the district, state, and national levels all report possessing amazing student-level longitudinal data systems (LDS). Are the LDS systems improving educational outcomes for students? Are they guiding development of effective instructional practices? Are the standardized exams measuring student knowledge relative to the learning expectations? Many questions exist about the effective use of the LDS system and educational data, but data architecture and analytics (including the products developed by SAS®) are not designed to answer any of these questions. However, the ability to develop more effective educational interfaces, improve use of data to the classroom level, and improve student outcomes, might only be available through use of SAS. The purpose of this session and paper is to demonstrate an integrated use of SAS tools to guide the transformation of data to analytics that improve educational outcomes for all students.
Sean Mulvenon, University of Arkansas
At the University of Central Florida (UCF), we recently invested in SAS® Visual Analytics, along with the updated SAS® Business Intelligence platform (from 9.2 to 9.4), a project that took over a year to be completed. This project was undertaken to give our users the best and most updated tools available. This paper introduces the SAS Visual Analytics environment at UCF and includes projects created using this product. It answers why we selected SAS Visual Analytics for development over other SAS® applications. It explains the technical environment for our non-distributed SAS Visual Analytics: RAM, servers, benchmarking, sizing, and scaling. It discusses why we chose the non-distributed mode versus distributed mode. Challenges in the design, implementation, usage, and performance are also presented, including the reasons why Hadoop was not adopted.
Scott Milbuta, University of Central Florida
Ulf Borjesson, University of Central Florida
Carlos Piemonti, University of Central Florida
If data sets are being merged and a variable occurs in more than one of the data sets, the last value (from the variable in the right-most data set listed in the MERGE statement) is kept. It is critical, therefore, to know the name and sources of any common variables. This paper reviews simple techniques for noting overwriting in the log, summarizing the names and sources of all variables, and identifying common variables before merging files.
Christopher Bost, MDRC
Machine learning is gaining popularity, fueled by the rapid advancement of computing technology. Machine learning offers the ability to continuously adjust predictive models based on real-time data, and to visualize changes and take action on the new information. Hear from two PhD's from SAS and Intel about how SAS® machine learning, working with SAS streaming analytics and Intel microprocessor technology, makes this possible.
This session describes the construction of a program that converts any set of relational tables into a single flat file, using Base SAS® in SAS® 9.3. The program gets the information it needs from the data tables themselves, with a minimum of user configuration. It automatically detects one-to-many relationships and creates sets of replicate variables in the flat file. The program illustrates the use of macro programming and the SQL, DATASETS, and TRANSPOSE procedures, among others. The output is sent to a Microsoft Excel spreadsheet using the Output Delivery System (ODS).
Dav Vandenbroucke, HUD
SAS® High-Performance Risk distributes financial risk data and big data portfolios with complex analyses across a networked Hadoop Distributed File System (HDFS) grid to support rapid in-memory queries for hundreds of simultaneous users. This data is extremely complex and must be stored in a proprietary format to guarantee data affinity for rapid access. However, customers still desire the ability to view and process this data directly. This paper demonstrates how to use the HPRISK custom file reader to directly access risk data in Hadoop MapReduce jobs, using the HPDS2 procedure and the LASR procedure.
Mike Whitcher, SAS
Stacey Christian, SAS
Phil Hanna, SAS Institute
Don McAlister, SAS
Over the years, complex medical data has been captured and retained in a variety of legacy platforms that are characterized by special formats, hierarchical/network relationships, and relational databases. Due to the complex nature of the data structures and the capacity constraints associated with outdated technologies, high-value healthcare data locked in legacy systems has been restricted in its use for analytics. With the emergence of highly scalable, big data technologies such as Hadoop, now it's possible to move, transform, and enable previously locked legacy data economically. In addition, sourcing such data once on an enterprise data hub enables the ability to efficiently store, process, and analyze the data from multiple perspectives. This presentation illustrates how legacy data is not only unlocked but enabled for rapid visual analytics by using Cloudera's enterprise data hub for data transformation and the SAS® ecosystem.
UNC General Administration's Division of Institutional Research (UNCGA IR) collects student-level data on enrollment, graduation, courses, grades, financial aid, facilities, and personnel from the 16 public universities. The university system is transitioning from data collection and processing of flat files submitted via SAS/IntrNet® to an Oracle Data Mart that retrieves the data from the campuses' source systems. After collecting the data, either via flat files or data mart, SAS® data sets are created and used for analysis and reporting for public and private universities, policy makers, elected officials, and the general public. After a complete turnover in programming staff, UNCGA IR decided to make some major upgrades and improvements to a server and system that were quickly becoming unsupported and outdated. This presentation covers the following topics: a background of our office functions, including our SAS environment; the decision-making process to upgrade to a SAS® BI and SAS® Visual Analytics environment, using SAS® 9.3 and eventually SAS® 9.4; the factors in determining which processes should be moved to a PHP interface, to SAS® Stored Processes, and to SAS Visual Analytics (including whether the user was internal or external, whether the user had access to secure data, and how much time was involved in making the changes); the process and challenges of implementing the changes; and the lessons learned. This presentation has a universal audience but is specifically intended for anyone who has gone through or plans to go through a SAS upgrade, has inherited spaghetti code and used it to do their job, or has used SAS/IntrNet, SAS Visual Analytics, SAS Stored Processes, or a PHP interface to initiate SAS programs.
Laura Simpson, University of North Carolina General Administration Division of Institutional Resea
This paper proposes a technique to implement wavelet analysis (WA) for improving a forecasting accuracy of the autoregressive integrated moving average model (ARIMA) in nonlinear time-series. With the assumption of the linear correlation, and conventional seasonality adjustment methods used in ARIMA (that is, differencing, X11, and X12), the model might fail to capture any nonlinear pattern. Rather than directly model such a signal, we decompose it to less complex components such as trend, seasonality, process variations, and noises, using WA. Then, we use them as exogenous variables in the autoregressive integrated moving average with explanatory variable model (ARIMAX). We describe a background of WA. Then, the code and a detailed explanation of WA based on multi-resolution analysis (MRA) in SAS/IML® software are demonstrated. The idea and mathematical basis of ARIMA and ARIMAX are also given. Next, we demonstrate our technique in forecasting applications using SAS® Forecast Studio. The demonstrated time-series are nonlinear in nature from different fields. The results suggest that WA effects are good regressors in ARIMAX, which captures nonlinear patterns well.
Woranat Wongdhamma, Oklahoma State University
Administrative health databases, including hospital and physician records, are frequently used to estimate the prevalence of chronic diseases. Disease-surveillance information is used by policy makers and researchers to compare the health of populations and develop projections about disease burden. However, not all cases are captured by administrative health databases, which can result in biased estimates. Capture-recapture (CR) models, originally developed to estimate the sizes of animal populations, have been adapted for use by epidemiologists to estimate the total sizes of disease populations for such conditions as cancer, diabetes, and arthritis. Estimates of the number of cases are produced by assessing the degree of overlap among incomplete lists of disease cases captured in different sources. Two- and three-source CR models are most commonly used, often with covariates. Two important assumptions--independence of capture in each data source and homogeneity of capture probabilities, which underlie conventional CR models--are unlikely to hold in epidemiological studies. Failure to satisfy these assumptions bias the model results. Log-linear, multinomial logistic regression, and conditional logistic regression models, if used properly, can incorporate dependency among sources and covariates to model the effect of heterogeneity in capture probabilities. However, none of these models is optimal, and researchers might be unfamiliar with how to use them in practice. This paper demonstrates how to use SAS® to implement the log-linear, multinomial logistic regression, and conditional logistic regression CR models. Methods to address the assumptions of independence between sources and homogeneity of capture probabilities for a three-source CR model are provided. The paper uses a real numeric data set about Parkinson's disease involving physician claims, hospital abstract, and prescription drug records from one Canadian province. Advantages and disadvantages of each model are discus
sed.
Lisa Lix, University of Manitoba
Health care and other programs collect large amounts of information in order to administer the program. A health insurance plan, for example, probably records every physician service, hospital and emergency department visit, and prescription medication--information that is collected in order to make payments to the various providers (hospitals, physicians, pharmacists). Although the data are collected for administrative purposes, these databases can also be used to address research questions, including questions that are unethical or too expensive to answer using randomized experiments. However, when subjects are not randomly assigned to treatment groups, we worry about assignment bias--the possibility that the people in one treatment group were healthier, smarter, more compliant, etc., than those in the other group, biasing the comparison of the two treatments. Propensity score methods are one way to adjust the comparisons, enabling research using administrative data to mimic research using randomized controlled trials. In this presentation, I explain what the propensity score is, how it is used to compare two treatments, and how to implement a propensity score analysis using SAS®. Two methods using propensity scores are presented: matching and inverse probability weighting. Examples are drawn from health services research using the administrative databases of the Ontario Health Insurance Plan, the single payer for medically necessary care for the 13 million residents of Ontario, Canada. However, propensity score methods are not limited to health care. They can be used to examine the impact of any nonrandomized program, as long as there is enough information to calculate a credible propensity score.
Ruth Croxford, Institute for Clinical Evaluative Sciences
Someone has aptly said, Las Vegas looks the way one would imagine heaven must look at night. What if you know the secret to run a plethora of various businesses in the entertainment capital of the world? Nothing better, right? Well, we have what you want, all the necessary ingredients for you to precisely know what business to target in a particular locality of Las Vegas. Yelp, a community portal, wants to help people finding great local businesses. They cover almost everything from dentists and hair stylists through mechanics and restaurants. Yelp's users, Yelpers, write reviews and give ratings for all types of businesses. Yelp then uses this data to make recommendations to the Yelpers about which institutions best fit their individual needs. We have the yelp academic data set comprising 1.6 million reviews and 500K tips by 366K in 61K businesses across several cities. We combine current Yelp data from all the various data sets for Las Vegas to create an interactive map that provides an overview of how a business runs in a locality and how the ratings and reviews tickers a business. We answer the following questions: Where is the most appropriate neighborhood to start a new business (such as cafes, bars, and so on)? Which category of business has the greatest total count of reviews that is the most talked about (trending) business in Las Vegas? How does a business' working hours affect the customer reviews and the corresponding rating of the business? Our findings present research for further understanding of perceptions of various users, while giving reviews and ratings for the growth of a business by encompassing a variety of topics in data mining and data visualization.
Anirban Chakraborty, Oklahoma State University
The Statistical Graphics (SG) procedures and the Graph Template Language (GTL) are capable of generating powerful individual data displays. What if one wanted to visualize how a distribution changes with different parameters, or to view multiple aspects of a three-dimensional plot, just as two examples? By using macros to generate a graph for each frame, combined with the ODS PRINTER destination, it is possible to create GIF files to create effective animated data displays. This paper outlines the syntax and strategy necessary to generate these displays and provides a handful of examples. Intermediate knowledge of PROC SGPLOT, PROC TEMPLATE, and the SAS® macro language is assumed.
Jesse Pratt, Cincinnati Children's Hospital Medical Center
Many SAS® procedures can be used to analyze large amounts of correlated data. This study was a secondary analysis of data obtained from the South Carolina Revenue and Fiscal Affairs Office (RFA). The data includes medical claims from all health care systems in South Carolina (SC). This study used the SAS procedure GENMOD to analyze a large amount of correlated data about Military Health Care (MHS) system beneficiaries who received behavioral health care in South Carolina Health Care Systems from 2005 to 2014. Behavioral health (BH) was defined by Major Diagnostic Code (MDC) 19 (mental disorders and diseases) and 20 (alcohol/drug use). MDCs are formed by dividing all possible principal diagnoses from the International Classification Diagnostic (ICD-9) codes into 25 mutually exclusive diagnostic categories. The sample included a total of 6,783 BH visits and 4,827 unique adult and child patients that included military service members, veterans, and their adult and child dependents who have MHS insurance coverage. PROC GENMOD included a multivariate GEE model with type of BH visit (mental health or substance abuse) as the dependent variable, and with gender, race group, age group, and discharge year as predictors. Hospital ID was used in the repeated statement with different correlation structures. Gender was significant with both independent correlation (p = .0001) and exchangeable structure (p = .0003). However, age group was significant using the independent correlation (p = .0160), but non-significant using the exchangeable correlation structure (p = .0584). SAS is a powerful statistical program for analyzing large amounts of correlated data with categorical outcomes.
Abbas Tavakoli, University of South Carolina
Jordan Brittingham, USC/ Arnold School of Public Health
Nikki R. Wooten, USC/College of Social Work
Sensitive data requires elevated security requirements and the flexibility to apply logic that subsets data based on user privileges. Following the instructions in SAS® Visual Analytics: Administration Guide gives you the ability to apply row-level permission conditions. After you have set the permissions, you have to prove through audits who has access and row-level security. This paper provides you with the ability to easily apply, validate, report, and audit all tables that have row-level permissions, along with the groups, users, and conditions that will be applied. Take the hours of maintenance and lack of visibility out of row-level secure data and build confidence in the data and analytics that are provided to the enterprise.
Brandon Kirk, SAS
The HPSPLIT procedure, a powerful new addition to SAS/STAT®, fits classification and regression trees and uses the fitted trees for interpretation of data structures and for prediction. This presentation shows the use of PROC HPSPLIT to analyze a number of data sets from different subject areas, with a focus on prediction among subjects and across landscapes. A critical component of these analyses is the use of graphical tools to select a model, interpret the fitted trees, and evaluate the accuracy of the trees. The use of different kinds of cross validation for model selection and model assessment is also discussed.
For SAS® users, PROC TABULATE and PROC REPORT (and its compute blocks) are probably among the most common procedures for calculating and displaying data. It is, however, pretty difficult to calculate and display changes from one column to another using data from other rows with just these two procedures. Compute blocks in PROC REPORT can calculate additional columns, but it would be challenging to pick up values from other rows as inputs. This presentation shows how PROC TABULATE can work with the lag(n) function to calculate rates of change from one period of time to another. This offers the flexibility of feeding into calculations the data retrieved from other rows of the report. PROC REPORT is then used to produce the desired output. The same approach can also be used in a variety of scenarios to produce customized reports.
Khoi To, Office of Planning and Decision Support, Virginia Commonwealth University
Stephen Curry, James Harden, and LeBron James are considered to be three of the most gifted professional basketball players in the National Basketball Association (NBA). Each year the Kia Most Valuable Player (MVP) award is given to the best player in the league. Stephen Curry currently holds this title, followed by James Harden and LeBron James, the first two runners-up. The decision for MVP was made by a panel of judges comprised of 129 sportswriters and broadcasters, along with fans who were able to cast their votes through NBA.com. Did the judges make the correct decision? Is there statistical evidence that indicates that Stephen Curry is indeed deserving of this prestigious title over James Harden and LeBron James? Is there a significant difference between the two runners-up? These are some of the questions that are addressed through this project. Using data collected from NBA.com for the 2014-2015 season, a variety of parametric and nonparametric k-sample methods were used to test 20 quantitative variables. In an effort to determine which of the three players is the most deserving of the MVP title, post-hoc comparisons were also conducted on the variables that were shown to be significant. The time-dependent variables were standardized, because there was a significant difference in the number of minutes each athlete played. These variables were then tested and compared with those that had not been standardized. This led to significantly different outcomes, indicating that the results of the tests could be misleading if the time variable is not taken into consideration. Using the standardized variables, the results of the analyses indicate that there is a statistically significant difference in the overall performances of the three athletes, with Stephen Curry outplaying the other two players. However, the difference between James Harden and LeBron James is not so clear.
Sherrie Rodriguez, Kennesaw State University
Most businesses have benefited from using advanced analytics for marketing and other decision making. But to apply analytical techniques to pharmaceutical marketing is challenging and emerging as it is critical to ensure that the analysis makes sense from the medical side. The drug for a specific disease finally consumed is directly or indirectly influenced by many factors, including the disease origins, health-care system policy, physicians' clinical decisions, and the patients' perceptions and behaviors. The key to pharmaceutical marketing is in identifying the targeted populations for specific diseases and to focus on those populations. Because the health-care environment consistently changes, the predictive models are important to predict the change of the targeted population over time based on the patient journey and epidemiology. Time series analysis is used to forecast the number of cases of infectious diseases; correspondingly, over the counter and prescribed medicines for the specific disease could be predicted. The accurate prediction provides valuable information for the strategic plan of campaigns. For different diseases, different analytical techniques are applied. By taking the medical features of the disease and epidemiology into account, the prediction of the potential and total addressable markets can reveal more insightful marketing trends. And by simulating the important factors and quantifying how they impact the patient journey within the typical health-care system, the most accurate demand for specific medicines or treatments could be discovered. Through monitoring the parameters in the dynamic simulation, the smart decision can be made using what-if comparisons to optimize the marketing result.
Xue Yao, Winnipeg Regional Health Aurthority
An important strength of observational studies is the ability to estimate a key behavior's or treatment's effect on a specific health outcome. This is a crucial strength as most health outcomes research studies are unable to use experimental designs due to ethical and other constraints. Keeping this in mind, one drawback of observational studies (that experimental studies naturally control for) is that they lack the ability to randomize their participants into treatment groups. This can result in the unwanted inclusion of a selection bias. One way to adjust for a selection bias is through the use of a propensity score analysis. In this study, we provide an example of how to use these types of analyses. Our concern is whether recent substance abuse has an effect on an adolescent's identification of suicidal thoughts. In order to conduct this analysis, a selection bias was identified and adjustment was sought through three common forms of propensity scoring: stratification, matching, and regression adjustment. Each form is separately conducted, reviewed, and assessed as to its effectiveness in improving the model. Data for this study was gathered through the Youth Risk Behavior Surveillance System, an ongoing nation-wide project of the Centers for Disease Control and Prevention. This presentation is designed for any level of statistician, SAS® programmer, or data analyst with an interest in controlling for selection bias, as well as for anyone who has an interest in the effects of substance abuse on mental illness.
Deanna Schreiber-Gregory, National University
Have you questioned the Read throughput or Write throughput for your Windows system drive? What about the input/output (I/O) throughput of a non-system drive? One solution is use SASIOTEST.EXE to measure the Read or Write throughput for any drive connected to your system. Since SAS® 9.2, SASIOTEST.EXE has been included with each release of SAS for Windows. This paper explains the options supported by SASIOTEST and the various ways to use SASIOTEST. It also describes how the I/O relates to SAS I/O on Windows.
Mike Jones, SAS
The increasing popularity and affordability of wearable devices, together with their ability to provide granular physical activity data down to the minute, have enabled researchers to conduct advanced studies on the effects of physical activity on health and disease. This provides statistical programmers the challenge of processing data and translating it into analyzable measures. One such measure is the number of time-specific bouts of moderate to vigorous physical activity (MVPA) (similar to exercise), which is needed to determine whether the participant meets current physical activity guidelines (for example, 150 minutes of MVPA per week performed in bouts of at least 20 minutes). In this paper, we illustrate how we used SAS® arrays to calculate the number of 20-minute bouts of MVPA per day. We provide working code on how we processed Fitbit Flex data from 63 healthy volunteers whose physical activities were monitored daily for a period of 12 months.
Faith Parsons, Columbia University Medical Center
Keith M Diaz, Columbia University Medical Center
Jacob E Julian, Columbia University Medical Center
Our daily work in SAS® involves manipulation of many independent data sets, and a lot of time can be saved if independent data sets can be manipulated simultaneously. This paper presents our interface RunParallel, which opens multiple SAS sessions and controls which SAS procedures and DATA steps to run on which sessions by parsing comments such as /*EP.SINGLE*/ and /*EP.END*/. The user can easily parallelize any code by simply wrapping procedure steps and DATA steps in such comments and executing in RunParallel. The original structure of the SAS code is preserved so that it can be developed and run in serial regardless of the RunParallel comments. When applied in SAS programs with many external data sources and heavy computations, RunParallel can give major performance boosts. Among our examples we include a simulation that demonstrates how to run DATA steps in parallel, where the performance gain greatly outweighs the single minute it takes to add RunParallel comments to the code. In a world full a big data, a lot of time can be saved by running in parallel in a comprehensive way.
Jingyu She, Danica Pension
Tomislav Kajinic, Danica Pension
The Add Health Parent Study is using a new and innovative method to augment our other interview verification strategies. Typical verification strategies include calling respondents to ask questions about their interview, recording pieces of interaction (CARI - Computer Aided Recorded Interview), and analyzing timing data to see that each interview was within a reasonable length. Geocoding adds another tool to the toolbox for verifications. By applying street-level geocoding to the address where an interview is reported to be conducted and comparing that to a captured latitude/longitude reading from a GPS tracking device, we are able to compute the distance between two points. If that distance is very small and time stamps are close to each other, then the evidence points to the field interviewer being present at the respondent's address during the interview. For our project, the street-level geocoding to an address is done using SAS® PROC GEOCODE. Our paper describes how to obtain a US address database from the SAS website and how it can be used in PROC GEOCODE. We also briefly compare this technique to using the Google Map API and Python as an alternative.
Chris Carson, RTI International
Lilia Filippenko, RTI International
Mai Nguyen, RTI International
The Coordination for the Improvement of Higher Education Personnel (CAPES) is a foundation within the Ministry of Education in Brazil whose central purpose is to coordinate efforts to promote high standards for postgraduate programs inside the country. Structured in a SAS® data warehouse, vast amounts of information about the National Postgraduate System (SNPG) is collected and analyzed daily. This data must be accessed by different operational and managerial profiles, on desktops and mobile devices (in this case, using SAS® Mobile BI). Therefore, accurate and fresh data must be maintained so that is possible to calculate statistics and indicators about programs, courses, teachers, students, and intellectual productions. By using SAS programming within SAS® Enterprise Guide®, all statistical calculations are performed and the results become available for exploration and presentation in SAS® Visual Analytics. Using the report designing tool, an excellent user experience is created by integrating the reports into Sucupira Platform, an online tool designed to provide greater data transparency for the academic community and the general public. This integration is made possible through the creation of public access reports with automatic authentication of guest users, presented within iframes inside the Foundation's platform. The content of the reports is grouped by scope, which makes it possible to view the indicators in different forms of presentation, to apply filters (including from URL GET parameters), and to execute stored processes.
Leonardo de Lima Aguirre, Coordination for the Improvement of Higher Education Personnel
Sergio da Costa Cortes, Coordination for the Improvement of Higher Education Personnel
Marcus Vinicius de Olivera Palheta, Capes
In many discrete-event simulation projects, the chief goal is to investigate the performance of a system. You can use output data to better understand the operation of the real or planned system and to conduct various what-if analyses. But you can also use simulation for validation--specifically, to validate a solution found by an optimization model. In optimization modeling, you almost always need to make some simplifying assumptions about the details of the system you are modeling. These assumptions are especially important when the system includes random variation--for example, in the arrivals of individuals, their distinguishing characteristics, or the time needed to complete certain tasks. A common approach holds each random element at some nominal value (such as the mean of its observed values) and proceeds with the optimization. You can do better. In order to test an optimization model and its underlying assumptions, you can build a simulation model of the system that uses the optimal solution as an input and simulates the system's detailed behavior. The simulation model helps determine how well the optimal solution holds up when randomness and perhaps other logical complexities (which the optimization model might have ignored, summarized, or modeled only approximately) are accounted for. Simulation might confirm the optimization results or highlight areas of concern in the optimization model. This paper describes cases in which you can use simulation and optimization together in this manner and discusses the positive implications of this complementary analytic approach. For the reader, prior experience with optimization and simulation is helpful but not required.
Ed Hughes, SAS
Emily Lada, SAS Institute Inc.
Leo Lopes, SAS Institute Inc.
Imre Polik, SAS
SAS® Studio includes tasks that can be used to generate SAS® programs to process SAS data sets. The graph tasks generate SAS programs that use ODS Graphics to produce a range of plots and charts. SAS® Enterprise Guide 7.1 and SAS® Add-In for Microsoft Office 7.1 can also make use of these SAS Studio tasks to generate graphs with ODS Graphics, even though their built-in tasks use SAS/GRAPH®. This paper describes these SAS Studio graph tasks.
Philip Holland, Holland Numerics Ltd
Cycling is one of the fastest growing recreational activities and sports in India. Many non-government organizations (NGOs) support this emission-free mode of transportation that helps protect the environment from air pollution hazards. Lots of cyclist groups in metropolitan areas organize numerous ride events and build social networks. Although India was a bit late for joining the Internet, the social networking sites are getting popular in every Indian state and are expected to grow after the announcement of Digital India Project. Many networking groups and cycling blogs share tons of information and resources across the globe. However, these blogs posts are difficult to categorize according to their relevance, making it difficult to access required information quickly on the blogs. This paper provides ways to categorize the content of these blog posts and classify them in meaningful categories for easy identification. The initial data set is created from scraping the online cycling blog posts (for example, Cyclists.in, velocrushindia.com, and so on) using Python. The initial data set consists of 1,446 blog posts with titles and blog content since 2008. Approximately 25% of the blog posts are manually categorized into six different categories, such as Ride stories, Events, Nutrition, Bicycle hacks, Bicycle Buy and Sell, and Others, by three independent raters to generate a training data set. The common blog-post categories are identified, and the text classification model is built to identify the blog-post classification and generate the text rules to classify the blogs based on those categories using SAS® Text Miner. To improve the look and feel of the social networking blog, a tool developed in JavaScript automatically classifies the blog posts into six categories and provides appropriate suggestions to the blog user.
Heramb Joshi, Oklahoma State University
Goutam Chakraborty, Oklahoma State University
Seeing business metrics in real time enables a company to understand and respond to ever-changing customer demands. In reality, though, obtaining such metrics in real time is not always easy. However, SAS Australia and New Zealand Technical Support solved that problem by using SAS® Visual Analytics to develop a 16-display command center in the Sydney office. Using this center to provide real-time data enables the Sydney office to respond to customer demands across the entire South Asia region. The success of this deployment makes reporting capabilities and data available for Technical Support hubs in Wellington, Mumbai, Kuala Lumpur, and Singapore--covering a total distance of 12,360 kilometers (approximately 7,680 miles). By sharing SAS Visual Analytics report metrics on displays spanning multiple time zones that cover a 7-hour time difference, SAS Australia and New Zealand Technical Support has started a new journey of breaking barriers, collaborating more closely, and providing fuel for innovation and change for an entire region. This paper is aimed at individuals or companies who want to learn how SAS Australia & New Zealand Technical Support developed its command center and who are inspired to do the same for their offices!
Chris Blake, SAS
Marriage laws have changed in 38 states, including the District of Columbia, permitting same-sex couples to marry since 2004. However, since 1996, 13 states have banned same-sex marriages, some of which are under legal challenge. A U.S. animated map, made using the SAS® GMAP procedure in SAS® 9.4, shows the changes in the acceptance, rejection, or undecided legal status of this issue, year-by-year. In this study, we also present other SAS data visualization techniques to show how concentrations (percentages) of married same-sex couples and their relationships have evolved over time, thereby instigating changes in marriage laws. Using a SAS GRAPH display, we attempt to show how both the total number of same-sex-couple households and the percent that are married have changed on a state-by-state basis since 2005. These changes are also compared to the year in which marriage laws permitting same-sex marriages were enacted, and are followed over time since that year. It is possible to examine trends on an annual basis from 2005 to 2013 using the American Community Survey one-year estimates. The SAS procedures and SAS code used to create the data visualization and data manipulation techniques are provided.
Bula Ghose, HHS/ASPE
Susan Queen, National Center for Health Statistics
Joan Turek, Health and Human Services
Multivariate statistical analysis plays an increasingly important role as the number of variables being measured increases in educational research. In both cognitive and noncognitive assessments, many instruments that researchers aim to study contain a large number of variables, with each measured variable assigned to a specific factor of the bigger construct. Recalling the educational theories or empirical research, the factor of each instrument usually emerges the same way. Two types of factor analysis are widely used in order to understand the latent relationships among these variables based on different scenarios. (1) Exploratory factor analysis (EFA), which is performed by using the SAS® procedure PROC FACTOR, is an advanced statistical method used to probe deeply into the relationship among the variables and the larger construct and then develop a customized model for the specific assessment. (2) When a model is established, confirmatory factor analysis (CFA) is conducted by using the SAS procedure PROC CALIS to examine the model fit of specific data and then make adjustments for the model as needed. This paper presents the application of SAS to conduct these two types of factor analysis to fulfill various research purposes. Examples using real noncognitive assessment data are demonstrated, and the interpretation of the fit statistics is discussed.
Jun Xu, Educational Testing Service
Steven Holtzman, Educational Testing Service
Kevin Petway, Educational Testing Service
Lili Yao, Educational Testing Service
The objective of this study is to use the GLM procedure in SAS® to solve a complex linkage problem with multiple test forms in educational research. Typically, the ABSORB option in the GLM procedure makes this task relatively easy to implement. Note that for educational assessments, to apply one-dimensional combinations of two-parameter logistic (2PL) models (Hambleton, Swaminathan, and Rogers 1991, ch. 1) and generalized partial credit models (Muraki 1997) to a large-scale high-stakes testing program with very frequent administrations requires a practical approach to link test forms. Haberman (2009) suggested a pragmatic solution of simultaneous linking to solve the challenging linking problem. In this solution, many separately calibrated test forms are linked by the use of least-squares methods. In SAS, the GLM procedure can be used to implement this algorithm by the use of the ABSORB option for the variable that specifies administrations, as long as the data are sorted by order of administration. This paper presents the use of SAS to examine the application of this proposed methodology to a simple case of real data.
Lili Yao, Educational Testing Service
The LACE readmission risk score is a methodology used by Kaiser Permanente Northwest (KPNW) to target and customize readmission prevention strategies for patients admitted to the hospital. This presentation shares how KPNW used SAS® in combination with Epic's Datalink to integrate the LACE score into its electronic health record (EHR) for usage in real time. The LACE score is an objective measure, composed of four components including: L) length of stay; A) acuity of admission; C) pre-existing co-morbidities; and E) Emergency department (ED) visits in the prior six months. SAS was used to perform complex calculations and combine data from multiple sources (which was not possible for the EHR alone), and then to calculate a score that was integrated back into the EHR. The technical approach includes a trigger macro to kick off the process once the database ETL completes, several explicit and implicit proc SQL statements, a volatile temp table for filtering, and a series of SORT, MEANS, TRANSPOSE, and EXPORT procedures. We walk through the technical approach taken to generate and integrate the LACE score into Epic, as well as describe the challenges we faced, how we overcame them, and the beneficial results we have gained throughout the process.
Delilah Moore, Kaiser Permanente
This paper introduces an extremely fast and simple implementation of the survey cell collapsing process. Prior implementations had used either several SQL queries or numerous DATA step arrays, with multiple data reads. This new approach uses a single hash object with a maximum of two data reads. The hash object provides an efficient and convenient mechanism for quick data storage and retrieval (sub-second total run time).
Ahmed Al-Attar, AnA Data Warehousing Consulting, LLC
In health care and epidemiological research, there is a growing need for basic graphical output that is clear, easy to interpret, and easy to create. SAS® 9.3 has a very clear and customizable graphic called a Radar Graph, yet it can only display the unique responses of one variable and would not be useful for multiple binary variables. In this paper we describe a way to display multiple binary variables for a single population on a single radar graph. Then we convert our method into a macro with as few parameters as possible to make this procedure available for everyday users.
Kevin Sundquist, Columbia University Medical Center
Jacob E Julian, Columbia University Medical Center
Faith Parsons, Columbia University Medical Center
The DOCUMENT procedure is a little known procedure that can save you vast amounts of time and effort when managing the output of your SAS® programming efforts. This procedure is deeply associated with the mechanism by which SAS controls output in the Output Delivery System (ODS). Have you ever wished you didn't have to modify and rerun the report-generating program every time there was some tweak in the desired report? PROC DOCUMENT enables you to store one version of the report as an ODS Document Object and then call it out in many different output forms, such as PDF, HTML, listing, RTF, and so on, without rerunning the code. Have you ever wished you could extract those pages of the output that apply to certain BY variables such as State, StudentName, or CarModel? With PROC DOCUMENT, you have where capabilities to extract these. Do you want to customize the table of contents that assorted SAS procedures produce when you make frames for the table of contents with HTML, or use the facilities available for PDF? PROC DOCUMENT enables you to get to the inner workings of ODS and manipulate them. This paper addresses PROC DOCUMENT from the viewpoint of end results, rather than provide a complete technical review of how to do the task at hand. The emphasis is on the benefits of using the procedure, not on detailed mechanics.
Roger Muller, Data-To-Events, Inc.
Average yards per reception, as well as number of touchdowns, are commonly used to rank National Football League (NFL) players. However, scoring touchdowns lowers the player's average since it stops the play and therefore prevents the player from gaining more yardage. If yardages are tabulated in a life table, then yardage from a touchdown play is denoted as a right-censored observation. This paper discusses the application of the SAS/STAT® Kaplan-Meier product-limit estimator to adjust these averages. Using 15 seasons of NFL receiving data, the relationship between touchdown rates and average yards per reception is compared, before and after adjustment. The modification of adjustments when a player incurred a 2-point safety during the season is also discussed.
Keith Curtis, USAA
Because optimization models often do not capture some important real-world complications, a collection of optimal or near-optimal solutions can be useful for decision makers. This paper uses various techniques for finding the k best solutions to the linear assignment problem in order to illustrate several features recently added to the OPTMODEL procedure in SAS/OR® software. These features include the network solver, the constraint programming solver (which can produce multiple solutions), and the COFOR statement (which allows parallel execution of independent solver calls).
Rob Pratt, SAS
A number of SAS® tools can be used to report data, such as the PRINT, MEANS, TABULATE, and REPORT procedures. The REPORT procedure is a single tool that can produce many of the same results as other SAS tools. Not only can it create detailed reports like PROC PRINT can, but it can summarize and calculate data like the MEANS and TABULATE procedures do. Unfortunately, despite its power, PROC REPORT seems to be used less often than the other tools, possibly due to its seemingly complex coding. This paper uses PROC REPORT and the Output Delivery System (ODS) to export a big data set into a customized XML file that a user who is not familiar with SAS can easily read. Several options for the COLUMN, DEFINE, and COMPUTE statements are shown that enable you to present your data in a more colorful way. We show how to control the format of the selected columns and rows, make column headings more meaningful, and how to color selected cells differently to bring attention to the most important data.
Guihong Chen, TCF Bank
The SAS® Deployment Backup and Recovery tool in the third maintenance release of SAS® 9.4 helps SAS administrators to collectively take backups of important data artifacts in SAS deployments, which include SAS® Metadata Server, SAS® Content Server, SAS® Web Infrastructure Platform Data Server, and physical data in the SAS configuration directory. The tool supports all types of deployments, from single-tier to multi-tier clustered heterogeneous host deployments. The new configuration options in the tool give administrators more control over the data that is being backed up, from the SAS tiers to the individual directory level. The new options allow administrators to filter out old log directories and to choose the databases to back up. This paper talks about not only how to use the configuration options but also how to mediate the effects that the SAS deployment configuration changes have on the backups and how to optimize the backups in terms of size and time.
Bhaskar Kulkarni, SAS
By default, the SAS® hash object permits only entries whose keys, defined in its key portion, are unique. While in certain programming applications this is a rather utile feature, there are also others for which being able to insert and manipulate entries with duplicate keys is imperative. Such an ability, facilitated in SAS since SAS® 9.2, was a welcome development: It vastly expanded the functionality of the hash object and eliminated the necessity to work around the distinct-key limitation using custom code. However, nothing comes without a price, and the ability of the hash object to store duplicate key entries is no exception. In particular, additional hash object methods had to be--and were--developed to handle specific entries sharing the same key. The extra price is that using these methods is surely not quite as straightforward as the simple corresponding operations on distinct-key tables, and the documentation alone is a rather poor help for making them work in practice. Rather extensive experimentation and investigative coding is necessary to make that happen. This paper is a result of such endeavor, and hopefully, it will save those who delve into it a good deal of time and frustration.
Paul Dorfman, Dorfman Consulting
The SURVEYSELECT procedure is useful for sample selection for a wide variety of applications. This paper presents the application of PROC SURVEYSELECT to a complex scenario involving a state-wide educational testing program, namely, drawing interdependent stratified cluster samples of schools for the field-testing of test questions, which we call items. These stand-alone field tests are given to only small portions of the testing population and as such, a stratified procedure was used to ensure representativeness of the field-test samples. As the field test statistics for these items are evaluated for use in future operational tests, an efficient procedure is needed to sample schools, while satisfying pre-defined sampling criteria and targets. This paper provides an adaptive sampling application and then generalizes the methodology, as much as possible, for potential use in more industries.
Chuanjie Liao, Pearson
Brian Patterson, Pearson
One of the truths about SAS® is that there are, at a minimum, three approaches to achieve any intended task and each approach has its own pros and cons. Identifying and using efficient SAS programming techniques are recommended and efficient programming becomes mandatory when larger data sets are accessed. This paper describes the efficiency of virtual access and various situations to use virtual access of data sets using OPEN, FETCH and CLOSE functions with %SYSFUNC and %DO loops. There are several ways to get data set attributes like number of observations, variables, types of variables, and so on. It becomes more efficient to access larger data sets using the OPEN function with %SYSFUNC. The FETCH with OPEN function gives the ability to access the values of the variables from a SAS data set for conditional and iterative executions with %DO loops. In the following situations, virtual access of the SAS data set becomes more efficient on larger data sets. Situation 1: It is often required to split the master data set into several pieces depending upon a certain criterion for the batch submission, as it is required that data sets be independent. Situation 2: For many reports and dashboards and for the array process, it is required to keep the relevant variables together instead of maintaining the original data set order. Situation 3: For most of the statistical analysis, particularly for correlation and regression, a widely and frequently used SAS procedure requires a dynamic variable list to be passed. Creating a single macro variable list might run into an issue with the macro variable length on the larger transactional data sets. It is recommended that you prepare the list of the variables as a data set and access them using the proposed approach in many places, instead of creating several macro variables.
Amarnath Vijayarangan, Emmes Services Pvt Ltd, India
Few data visualizations are as striking or as useful as heat maps, and fortunately there are many applications for them. A heat map displaying eye-tracking data is a particularly potent example: the intensity of a viewer's gaze is quantified and superimposed over the image being viewed, and the resulting data display is often stunning and informative. This paper shows how to use Base SAS® to prepare, smooth, and transform eye-tracking data and ultimately render it on top of a corresponding image. By customizing a graphical template and using specific Graph Template Language (GTL) options, a heat map can be drawn precisely so that the user maintains pixel-level control. In this talk, and in the related paper, eye-tracking data is used primarily, but the techniques provided are easily adapted to other fields such as sports, weather, and marketing.
Matthew Duchnowski, Educational Testing Service
SAS® Web Application Server goes down and the user is presented with an error message. The error messages in the SAS® 9.4 middle tier are the default ones that are shipped with the underlying VMware vFabric Web Server and are seen by many users as too technical and uninformative. This paper describes an application called 'Errors' that was developed at the Royal Bank of Scotland that has been implemented across its 9.4 estate to provide a much better user experience for when things go wrong. In addition, regardless of communications, users always try to access an application if it is available. This paper goes into detail about a feature of the Errors application that RBS uses to prevent this. This feature is used to control access to the web applications during scheduled outage windows and it provides capability for IP and location-based access as well as others. This paper also documents features and capabilities that RBS would like to introduce to the application.
Christopher Blake, RBS
Making a data delivery to a client is a complicated endeavor. There are many aspects that must be carefully considered and planned for: de-identification, public use versus restricted access, documentation, ancillary files such as programs, formats, and so on, and methods of data transfer, among others. This paper provides a blueprint for planning and executing your data delivery.
Louise Hadden, Abt Associates
The latest releases of SAS® Data Integration Studio, SAS® Data Management Studio and SAS® Data Integration Server, SAS® Data Governance, and SAS/ACCESS® software provide a comprehensive and integrated set of capabilities for collecting, transforming, and managing your data. The latest features in the product suite include capabilities for working with data from a wide variety of environments and types including Hadoop, cloud, RDBMS, files, unstructured data, streaming, and others, and the ability to perform ETL and ELT transformations in diverse run-time environments including SAS®, database systems, Hadoop, Spark, SAS® Analytics, cloud, and data virtualization environments. There are also new capabilities for lineage, impact analysis, clustering, and other data governance features for enhancements to master data and support metadata management. This paper provides an overview of the latest features of the SAS® Data Management product suite and includes use cases and examples for leveraging product capabilities.
Nancy Rausch, SAS
SAS® Federation Server is a scalable, threaded, multi-user data access server providing seamlessly integrated data from various data sources. When your data becomes too large to move or copy and too sensitive to allow direct access, the powerful set of data virtualization capabilities allows you to effectively and efficiently manage and manipulate data from many sources, without moving or copying the data. This agile data integration framework allows business users the ability to connect to more data, reduce risks, respond faster, and make better business decisions. For technical users, the framework provides central data control and security, reduces complexity, optimizes commodity hardware, promotes rapid prototyping, and increases staff productivity. This paper provides an overview of the latest features of the product and includes use cases and examples for leveraging product capabilities.
Tatyana Petrova, SAS
Each night on the news we hear the level of the Dow Jones Industrial Average along with the 'first difference,' which is today's price-weighted average minus yesterday's. It is that series of first differences that excites or depresses us each night as it reflects whether stocks made or lost money that day. Furthermore, the differences form the data series that has the most addressable statistical features. In particular, the differences have the stationarity requirement, which justifies standard distributional results such as asymptotically normal distributions of parameter estimates. Differencing arises in many practical time series because they seem to have what are called 'unit roots,' which mathematically indicate the need to take differences. In 1976, Dickey and Fuller developed the first well-known tests to decide whether differencing is needed. These tests are part of the ARIMA procedure in SAS/ETS® in addition to many other time series analysis products. I'll review a little of what is was like to do the development and the required computing back then, say a little about why this is an important issue, and focus on examples.
David Dickey, NC State University
Software quality comprises a combination of both functional and performance requirements that specify not only what software should accomplish, but also how well it should accomplish it. Recoverability--a common performance objective--represents the timeliness and efficiency with which software or a system can resume functioning following a catastrophic failure. Thus, requirements for high availability software often specify the recovery time objective (RTO), or the maximum amount of time that software might be down following an unplanned failure or a planned outage. While systems demanding high or near perfect availability require redundant hardware and network resources, and additional infrastructure, software must also facilitate rapid recovery. And, in environments in which system or hardware redundancy is infeasible, recoverability can be improved only through effective software development practices. Because even the most robust code can fail under duress or due to unavoidable or unpredictable circumstances, software reliability must incorporate recoverability principles and methods. This text introduces the TEACH mnemonic that describes guiding principles that software recovery should be timely, efficient, autonomous, constant, and harmless. Moreover, the text introduces the SPICIER mnemonic that describes discrete phases in the recovery period, each of which can benefit from and be optimized with TEACH principles. Software failure is inevitable, but negative impacts can be minimized through SAS® development best practices.
Troy Hughes, Datmesis Analytics
For many organizations, the answer to whether to manage their data and analytics in a public or private cloud is going to be both. Both can be the answer for many different reasons: common sense logic not to replace a system that already works just to incorporate something new; legal or corporate regulations that require some data, but not all data, to remain in place; and even a desire to provide local employees with a traditional data center experience while providing remote or international employees with cloud-based analytics easily managed through software deployed via Amazon Web Services (AWS). In this paper, we discuss some of the unique technical challenges of managing a hybrid environment, including how to monitor system performance simultaneously for two different systems that might not share the same infrastructure or even provide comparable system monitoring tools; how to manage authorization when access and permissions might be driven by two different security technologies that make implementation of a singular protocol problematic; and how to ensure overall automation of two platforms that might be independently automated, but not originally designed to work together. In this paper, we share lessons learned from a decade of experience implementing hybrid cloud environments.
Ethan Merrill, SAS
Bryan Harkola, SAS
Even if you're not a GIS mapping pro, it pays to have some geographic problem-solving techniques in your back pocket. In this paper we illustrate a general approach to finding the closest location to any given US zip code, with a specific, user-accessible example of how to do it, using only Base SAS®. We also suggest a method for implementing the solution in a production environment, as well as demonstrate how parallel processing can be used to cut down on computing time if there are hardware constraints.
Andrew Clapson, MD Financial Management
Annmarie Smith, HomeServe USA
In the world of big data, real-time processing and event stream processing are becoming the norm. However, there are not many tools available today that can do this type of processing. SAS® Event Stream Processing aims to process this data. In this paper, we look at using SAS Event Stream Processing to read multiple data sets stored in big data platforms such as Hadoop and Cassandra in real time and to perform transformations on the data such as joining data sets, filtering data based on preset business rules, and creating new variables as required. We look at how we can score the data based on a machine learning algorithm. This paper shows you how to use the provided Hadoop Distributed File System (HDFS) publisher and subscriber to read and push data to Hadoop. The HDFS adapter is discussed in detail. We look at the Streamviewer to see how data flows through SAS Event Stream Processing.
Krishna Sai Kishore Konudula, Kavi Associates
This paper shows how to create maps from the output of the SAS/STAT® SPP procedure using the same layer of the data. This cut can be done with PROC GINSIDE and the final plot with PROC GMAP. The result is a map that is nicer than the plot produced by PROC SPP.
Alan Silva, University of Brasilia
SAS/IML® 14.1 enables you to author, install, and call packages. A package consists of SAS/IML source code, documentation, data sets, and sample programs. Packages provide a simple way to share SAS/IML functions. An expert who writes a statistical analysis in SAS/IML can create a package and upload it to the SAS/IML File Exchange. A nonexpert can download the package, install it, and immediately start using it. Packages provide a standard and uniform mechanism for sharing programs, which benefits both experts and nonexperts. Packages are very popular with users of other statistical software, such as R. This paper describes how SAS/IML programmers can construct, upload, download, and install packages. They're not wrapped in brown paper or tied up with strings, but they'll soon be a few of your favorite things!
Rick Wicklin, SAS
I get annoyed when macros tread all over my SAS® environment, macro variables and data sets. So how do you write macros that play nicely and then clean up afterwards? This paper describes techniques for writing macros that do just that.
Philip Holland, Holland Numerics Ltd
Do you write reports that sometimes have missing categories across all class variables? Some programmers write all sorts of additional DATA step code in order to show the zeros for the missing rows or columns. Did you ever wonder whether there is an easier way to accomplish this? PROC MEANS and PROC TABULATE, in conjunction with PROC FORMAT, can handle this situation with a couple of powerful options. With PROC TABULATE, we can use the PRELOADFMT and PRINTMISS options in conjunction with a user-defined format in PROC FORMAT to accomplish this task. With PROC SUMMARY, we can use the COMPLETETYPES option to get all the rows with zeros. This paper uses examples from Census Bureau tabulations to illustrate the use of these procedures and options to preserve missing rows or columns.
Chris Boniface, Census Bureau
Janet Wysocki, U.S. Census Bureau