SAS Global Forum 2017 Proceedings

Analyzing big data and visualizing trends in social media is a challenge that many companies face as large sources of publicly available data become accessible. While the sheer size of usable data can be staggering, knowing how to find trends in unstructured textual data is just as important an issue. At a big data conference, data scientists from several companies were invited to participate in tackling this challenge by identifying trends in cancer using unstructured data from Twitter users and presenting their results. This paper explains how our approach using SAS^® analytical methods were superior to other big data approaches in investigating these trends.

Read the paper (PDF)

Migration to a SAS^® Grid Computing environment provides many advantages. However, such migration might not be free from challenges especially considering users' pre-migration routines and programming practices. While SAS^® provides good graphical user interface solutions (for example, SAS^® Enterprise Guide^®) to develop and submit the SAS code to SAS Grid Computing, some situations might need command-line batch submission of a group of related SAS programs in a particular sequence. Saving individual log and output files for each program might also be a favorite routine in many organizations. SAS has provided the SAS Grid Manager Client Utility and SASGSUB commands to enable command-line submission of SAS programs to the grid. However, submitting a sequence of SAS programs in a conventional batch program style and getting individual logs and outputs for them needs a customized approach. This paper presents such an approach. In addition, an HTML and JavaScript tool developed in-house is introduced. This tool automates the generation of a SAS program that almost emulates a conventional scenario of submitting a batch program in a command-line shell using SASGSUB commands.

Read the paper (PDF)

This poster shows how to predict a past-due amount using traditional and machine learning techniques: logistic analysis, k-nearest neighbors, and random forest. The data set that was analyzed is about real-world commerce. It contains 305 categories of financial information from more than 11,787,287 unique businesses, from 2006 to 2014. The big challenge is how to handle the big and noisy real-world data sets. The first step of any model-building exercise is to define the outcome. A common prediction method in the financial services industry is to use binary outcomes, such as Good and Bad. For our research problem, we reduced past-due amounts into two cases, Good and Bad. Next, we built a two-stage model using the logistic regression method; that is, the first stage predicts the likelihood of a Bad outcome, and the second predicts a past-due amount, given a Bad outcome. Logistic analysis as a traditional statistical technique is commonly used for prediction and classification in the financial services industry. However, for analyzing big, noisy, or complex data sets, machine learning techniques are typically preferred to detect hard-to-discern patterns. To compare with both techniques, we use predictive accuracy, ROC index, sensitivity, and specificity as criteria.

The Clinical Data Interchange Standards Consortium (CDISC) encompasses a variety of standards for medical research. Amongst the several standards developed by the CDISC organization are standards for data collection (Clinical Data Acquisition Standards Harmonization CDASH), data submission (Study Data Tabulation Model SDTM), and data analysis (Analysis Data Model ADaM). These standards were originally developed with drug development in mind. Therapeutic Area User Guides (TAUGs) have been a recent focus to provide advice, examples, and explanations for collecting and submitting data for a specific disease. Non-subjects even have a way to collect data using the Associated Persons Implementation Guide (APIG). SDTM domains for medical devices were published in 2012. Interestingly, the use of device domains in the TAUGs occurs in 14 out of 18 TAUGs, providing examples of the use of various device domains. Drug-device studies also provide a contrast on adoption of CDISC standards for drug submissions versus device submissions. Adoption of SDTM in general and the seven device SDTM domains by the medical device industry has been slow. Reasons for the slow adoption are discussed in this paper.

Read the paper (PDF)

Automatic loading, tracking, and visualization of data readiness in SAS^® Visual Analytics is easy when you combine SAS^® Data Integration Studio with the DATASET and LASR procedures. This paper illustrates the simple method that the University of North Carolina at Chapel Hill (Enterprise Reporting and Departmental Systems) uses to automatically load tables into the SAS^® LASR Analytic Servers, and then store reportable data about the HDFS tables created, the LASR tables loaded, and the ETL job execution times. This methodology gives the department the ability to longitudinally visualize system loading performance and identify changes in system behavior, as well as providing a means of measuring how well we are serving our customers over time.

Read the paper (PDF)

With the increasing amount of educational data, educational data mining has become more and more important for uncovering the hidden patterns within institutional data so as to support institutional decision making (Luan 2012). However, only very limited studies have been done on educational data mining for institutional decision support. At the University of Connecticut (UCONN), organic chemistry is a required course for undergraduate students in a STEM discipline. It has a very high DFW rate (D=Drop, F=Failure, W=Withdraw). Take Fall 2014 as an example: the average DFW% for the Organic Chemistry lectures was 24% at UCONN, and there were over 1200 students enrolled in this class. In this study, undergraduate students enrolled during School Year 2010 2011 were used to build up the model. The purpose of this study was to predict student success in the future so as to improve the education quality in our institution. The Sample, Explore, Modify, Model, and Assess (SEMMA) method introduced by SAS was applied to develop the predictive model. The freshmen SAT scores, campus, semester GPA, financial aid, and other factors were used to predict students' performance in this course. In the predictive modeling process, several modeling techniques (decision tree, neural network, ensemble models, and logistic regression) were compared with each other in order to find an optimal one for our institution.

Read the paper (PDF)

A propensity score is the probability that an individual will be assigned to a condition or group, given a set of baseline covariates when the assignment is made. For example, the type of drug treatment given to a patient in a real-world setting might be non-randomly based on the patient's age, gender, geographic location, and socioeconomic status when the drug is prescribed. Propensity scores are used in many different types of observational studies to reduce selection bias. Subjects assigned to different groups are matched based on these propensity score probabilities, rather than matched based on the values of individual covariates. Although the underlying statistical theory behind the use of propensity scores is complex, implementing propensity score matching with SAS^® is relatively straightforward. An output data set of each subject's propensity score can be generated with SAS using PROC LOGISTIC. And, a generalized SAS macro can generate optimized N:1 propensity score matching of subjects assigned to different groups using the radius method. Matching can be optimized either for the number of matches within the maximum allowable radius or by the closeness of the matches within the radius. This presentation provides the general PROC LOGISTIC syntax to generate propensity scores, provides an overview of different propensity score matching techniques, and discusses how to use the SAS macro for optimized propensity score matching using the radius method.

Read the paper (PDF)

Creating sophisticated, visually stunning reports is imperative in today s business environment, but is your fancy report really accessible to all? Let s explore some simple enhancements that the fourth maintenance release of SAS^® 9.4 made to Output Delivery System (ODS) layout and the Report Writing Interface that will truly empower you to accommodate people who use assistive technology. ODS now provides the tools for you to meet Section 508 compliance and to create an engaging experience for all who consume your reports.

Read the paper (PDF)

As the open-source community has been taking the technology world by storm, especially in the big data space, large corporations such as SAS, IBM, and Oracle have been working to embrace this new, quickly evolving ecosystem to continue to foster innovation and to remain competitive. For example, SAS, IBM, and others have aligned with the Open Data Platform initiative and are continuing to build out Hadoop and Spark solutions. And, Oracle has partnered with Cloudera to create the Big Data Appliance. This movement challenges companies that are consuming these products to select the right products and support partners. The hybrid approach using all tools available seems to be the methodology chosen by most successful companies. West Corporation, an Omaha-based provider of technology-enabled communication solutions, is no exception. West has been working with SAS for 10 years in the ETL, BI, and advanced analytics space, and West began its Hadoop journey a year ago. This paper focuses on how West data teams use both technologies to improve customer experience in the interactive voice response (IVR) system by storing massive semi-structure call logs in HDFS and by building models that predict a caller s intent to route the caller more efficiently and to reduce customer effort using familiar SAS code and the very user friendly SAS^® Enterprise Miner .

Read the paper (PDF)

When a large and important project with a strict deadline hits your desk, it's easy to revert to those tried-and-true SAS^® programming techniques that have been successful for you in the past. In fact, trying to learn new techniques at such a time can prove to be distracting and a waste of precious time. However, the lull after a project's completion is the perfect time to reassess your approach and see whether there are any new features added to the SAS arsenal since the last time you looked that could be of great use the next time around. Such a post-project post-mortem has provided me with the opportunity to learn about several new features that will prove to be hugely valuable in the next release of my project. For example: 1) The PRESENV option and procedure 2) Fuzzy matching with the COMPGED function 3) The ODS POWERPOINT statement 4) SAS^® Enterprise Guide^® enhancements, including copying and pasting process flows and the SAS Macro Variable Viewer

Read the paper (PDF)

In this paper, a SAS^® macro is introduced that can search and replace any string in a SAS program. To use the macro, the user needs only to pass the search string to a folder. If the user wants to use the replacement function, the user also needs to pass the replacement string. The macro checks all of the SAS programs in the folder and subfolders to find out which files contain the search string. The macro generates new SAS files for replacements so that the old files are not affected. An HTML report is generated by the macro to include the original file locations, the line numbers of the SAS code that contain the search string, and the SAS code with search strings highlighted in yellow. If you use the replacement function, the HTML report also includes the location information for the new SAS files. The location information in the HTML report is created with hyperlinks so that the user can directly open the files from the report.

Read the paper (PDF) | View the e-poster or slides (PDF)

This paper introduces a macro that can generate the keyhole markup language (KML) files for U.S. states and counties. The generated KML files can be used directly by Google Maps to add customized state and county layers with user-defined colors and transparencies. When someone clicks on the state and county layers in Google Maps, customized information is shown. To use the macro, the user needs to prepare only a simple SAS^® input data set. The paper includes all the SAS codes for the macro and provides examples that show you how to use the macro as well as how to display the KML files in Google Maps.

Read the paper (PDF)

We have a lot of chances to use time-to-event (survival) analysis, especially in the biomedical and pharmaceutical fields. SAS^® provides the LIFETEST procedure to calculate Kaplan-Meier estimates for survival function and to delineate a survival plot. The PHREG procedure is used in Cox regression models to estimate the effect of predictors in hazard rates. Programs with ODS tables that are defined by PROC LIFETEST and PROC PHREG can provide more statistical information from the generated data sets. This paper provides a macro that uses PROC LIFETEST and PROC PHREG with ODS. It helps users to have a survival plot with estimates that include the subject at risk, events and total subject number, survival rate with median and 95% confidence interval, and hazard ratio estimates with 95% confidence interval. Some of these estimates are optional in the macro, so users can select what they need to display in the output. (Subject at risk and event and subject number are not optional.) Users can also specify the tick marks in the X-axis and subject at risk table, for example, every 10 or 20 units. The macro dynamic calculates the maximum for the X-axis and uses the interval that the user specified. Finally, the macro uses ODS and can be output in any document files, including JPG, PDF, and RTF formats.

Read the paper (PDF) | View the e-poster or slides (PDF)

A man with one watch always knows what time it is...but a man with two watches is never sure. Contrary to this adage, load forecasters at electric utilities would gladly wear an armful of watches. With only one model to choose from, it is certain that some forecasts will be wrong. But with multiple models, forecasters can have confidence about periods when the forecasts agree and can focus their attention on periods when the predictions diverge. Having a second opinion is preferred, and that's one of the six classic rules for forecasters as per Dr. Tao Hong of the University of North Carolina at Charlotte. Dr. Hong is the premiere thought leader and practitioner in the field of energy forecasting. This presentation discusses Dr. Hong's six rules, how they relate to the increasingly complex problem of forecasting electricity consumption, and the role that predictive analytics plays.

Read the paper (PDF)

Disseminating data to potential collaborators can be essential in the development of models, algorithms, and innovative research opportunities. However, it is often time-consuming to get approval to access sensitive data such as health data. An alternative to sharing the real data is to use synthetic data, which has similar properties to the original data but does not disclose sensitive information. The collaborators can use the synthetic data to make preliminary models or to work out bugs in their code while waiting to get approval to access the original data. A data owner can also use the synthetic data to crowdsource solutions from the public through competitions like Kaggle and then test those solutions on the original data. This paper implements a method that generates fully synthetic data in a way that matches the statistical moments of the true data up to a specified moment order as a SAS^® macro. Variables in the synthetic data set are of the same data type as the true data (for example, integer, binary, continuous). The implementation uses the linear programming solver within a column generation algorithm and the mixed integer linear programming solver from the OPTMODEL procedure in SAS/OR^® software. The COFOR statement in PROC OPTMODEL automatically parallelizes a portion of the algorithm. This paper demonstrates the method by using the Sashelp.Heart data set to generate fully synthetic data copies.

Read the paper (PDF)

Microsoft Windows 10 is a new operating system that is increasingly being adopted by enterprises around the world. SAS has planned to expand SAS^® Mobile BI, which is currently available on Apple iOS and Google Android, to the Microsoft Windows 10 platform. With this new application, customers can download business reports from SAS^® Visual Analytics to their desktop, laptop, or Microsoft Surface device, and use these reports both online and offline in their day-to-day business life. With Windows 10, users have the option of pinning a report to the desktop for quick access. This paper demonstrates this new SAS mobile application. We also demonstrate the cool new functionality on iOS and Android platforms, and compare them with the Windows 10 application.

Read the paper (PDF)

The hospital Medicare readmission rate has become a key indicator for measuring the quality of health care in the US. This rate is currently used by major health-care stakeholders including the Centers for Medicare and Medicaid Services (CMS), the Agency for Healthcare Research and Quality (AHRQ), and the National Committee for Quality Assurance (NCQA) (Fan and Sarfarazi, 2014). Although many papers have been written about how to calculate readmissions, this paper provides updated code that includes ICD-10 (International Classification of Diseases) code and offers a novel and comprehensive approach using SAS^® DATA step options and PROC SQL. We discuss: 1) De-identifying patient data 2) Calculating sequential admissions 3) Subsetting criteria required to report for CMS 30-day readmissions. In addition, this papers demonstrates: 1) Using the output delivery system (ODS) to create a labeled and de-identified data set 2) Macro variables to examine data quality 3) Summary statistics for further reporting and analysis.

Read the paper (PDF) | View the e-poster or slides (PDF)

This presentation gives you the tools to begin using propensity scoring in SAS^® to answer research questions involving observational data. It is for both those attendees who have never used propensity scores and those who have a basic understanding of propensity scores but are unsure how to begin using them in SAS. It provides a brief introduction to the concept of propensity scores, and then turns its attention to giving you the tips and resources you need to get started. The presentation walks you through how the code in the book 'Analysis of Observational Health Care Data Using SAS^®', which was published by SAS Institute, is used to answer how a particular health care intervention impacted a health care outcome. It details how propensity scores are created and how propensity score matching is used to balance covariates between treated and untreated observations. With this case study in hand, you will feel confident that you have the tools necessary to begin answering some of your own research questions using propensity scores.

Read the paper (PDF)

Healthcare is weird. Healthcare data is even more so. The digitization of healthcare data that describes the patient experience is a modern phenomenon, with most healthcare organizations still in their infancy. While the business of healthcare is already a century old, most organizations have focused their efforts on the financial aspects of healthcare and not on stakeholder experience or clinical outcomes. Think of the workflow that you might have experienced such as scheduling an appointment through doctor visits, obtaining lab tests, or obtaining prescriptions for interventions such as surgery or physical therapy. The modern healthcare system creates a digital footprint of administrative, process, quality, epidemiological, financial, clinical, and outcome measures, which range in size, cleanliness, and usefulness. Whether you are new to healthcare data or are looking to advance your knowledge of healthcare data and the techniques used to analyze it, this paper serves as a practical guide to understanding and using healthcare data. We explore common methods for how we structure and access data, discuss common challenges such as aggregating data into episodes of care, describe reverse engineering real world events, and talk about dealing with the myriad of unstructured data found in nursing notes. Finally, we discuss the ethical uses of healthcare data and the limits of informed consent, which are critically important for those of us in analytics.

Read the paper (PDF)

Specifying the functional form of a covariate is a fundamental part of developing a regression model. The choice to include a variable as continuous, categorical, or as a spline can be determined by model fit. This paper offers an efficient and user-friendly SAS^® macro (%SPECI) to help analysts determine how best to specify the appropriate functional form of a covariate in a linear, logistic, and survival analysis model. For each model, our macro provides a graphical and statistical single-page comparison report of the covariate as a continuous, categorical, and restricted cubic spline variable so that users can easily compare and contrast results. The report includes the residual plot and distribution of the covariate. You can also include other covariates in the model for multivariable adjustment. The output displays the likelihood ratio statistic, the Akaike Information Criterion (AIC), as well as other model-specific statistics. The %SPECI macro is demonstrated using an example data set. The macro includes the PROC REG, PROC LOGISTIC, PROC PHREG, PROC REPORT, PROC SGPLOT, and more procedures in SAS^® 9.4.

Read the paper (PDF) | Download the data file (ZIP)

Duplicates in a clinical trial or survey database could jeopardize data quality and integrity, and they can induce biased analysis results. These complications often happen in clinical trials, meta analyses, and registry and observational studies. Common practice to identify possible duplicates involves sensitive personal information, such as name, Social Security number (SSN), date of birth, address, telephone number, etc. However, access to this sensitive information is limited. Sometimes, it is even restricted. As a measure of data quality control, a SAS^® program was developed to identify duplicated individuals using non-sensitive information, such as age, gender, race, medical history, vital signs, and laboratory measurements. A probabilistic approach was used by calculating weights for data elements used to identify duplicates based on two probabilities (probability of agreement for an element among matched pairs and probability of agreement purely by chance among non-matched pairs). For elements with categorical values, agreement was defined as matching pairs sharing the same value. For elements with interval values, agreement was defined as matching values within 1% of measurement precision range. Probabilities used to compute matching element weights were estimated using an expectation-maximization (EM) algorithm. The method was then tested on a survey and clinical trial data from hypertension studies.

View the e-poster or slides (PDF)

Visual+D2:D18ization is a critical part of turning data into knowledge. A customized graph is essential to make data visualization meaningful, powerful, and interpretable. Furthermore, customizing grouped data into a desired layout with specific requirements such as clusters, colors, symbols, and patterns for each group can be challenging. This paper provides a start-from-scratch, step-by-step solution to create a customized graph for grouped data using SAS^® Graph Template Language (GTL). By analyzing the data and target graph with the available tools and options that GTL provided, this paper demonstrates GTL is a powerful and flexible tool to create a customized, complex graph.

Read the paper (PDF)

Visualization is a critical part to turn data into knowledge. A customized graph is essential to make data visualization meaningful, powerful, and interpretable. Furthermore, customizing grouped data into a desired layout with specific requirements such as clusters, colors, symbols, and patterns for each group can be challenging. This paper provides a start-from-scratch, step-by-step solution to create a customized graph for grouped data using the Graph Template Language (GTL). From analyzing the data to creating the target graph with the tools and options that are available with GTL, this paper demonstrates GTL is a powerful and flexible tool for creating a customized, complex graph.

Read the paper (PDF)

SAS^® functions provide amazing power to your DATA step programming. Some of these functions are essential some of them help you avoid writing volumes of unnecessary code. This talk covers some of the most useful SAS functions. Some of these functions might be new to you, and they will change the way you program and approach common programming tasks. The majority of the functions described in this talk work with character data. There are functions that search for strings, and others that can find and replace strings or join strings together. Still others can measure the spelling distance between two strings (useful for 'fuzzy' matching). Some of the newest and most amazing functions are not functions at all, but call routines. Did you know that you can sort values within an observation? Did you know that not only can you identify the largest or smallest value in a list of variables, but you can identify the second- or third- or nth-largest or smallest value? A knowledge of the functions described here will make you a much better SAS programmer.

Read the paper (PDF)

Accelerate your data preparation by having your DS2 execute without translation inside the Teradata database or on the Hadoop platform with SAS^® Code Accelerator. This presentation shows how easy it is to use SAS Code Accelerator via a live demonstration.

Read the paper (PDF)

In this presentation, we describe a research project that got started with the goal of measuring the performance of SAS^® programs on graphics processing units (GPUs). Some programming, either in Base SAS or C, is required to replicate the results that we describe in this paper.

Read the paper (PDF)

Accessibility has become a hot topic on campus due to a flurry of recent investigations of discrimination against students with disabilities by the U.S. Department of Justice and the U.S. Department of Education. This paper provides an update on the latest improvements in SAS^® University Edition that are specifically targeted to enable students with disabilities to excel in the classroom and beyond. This paper covers the entire SAS University Edition user experience including installation, documentation, training, support, using SAS^® Studio, and the new accessibility features in the fourth maintenance release of SAS^® 9.4.

Read the paper (PDF)

Many organizations that use SAS^® Visual Analytics must conform with accessibility requirements such as Section 508, the Americans with Disabilities Act, and the Accessibility for Ontarians with Disabilities Act. SAS Visual Analytics provides a number of different ways to view reports, including the SAS^® Report Viewer and SAS^® Mobile BI native applications for Apple iOS and Google Android. Each of these options has its own strengths and weaknesses when it comes to accessibility a one-size-fits-all approach is unlikely to work well for the people in your audience who have disabilities. This paper provides a comprehensive assessment of the latest versions of all SAS Visual Analytics report viewers, using Web Content Accessibility Guidelines (WCAG) 2.0 as a benchmark to evaluate accessibility. You can use this paper to direct the end users of your reports to the viewer that best meets their individual needs.

Read the paper (PDF) | Download the data file (ZIP)

SAS/ACCESS^® software grants access to data in third-party database management systems (DBMS), but how do you access data in DBMS not supported by SAS/ACCESS products? The introduction of the GROOVY procedure in SAS^® 9.3 lets you retrieve this formerly inaccessible data through a JDBC connection. Groovy is an object-oriented, dynamic programming language executed on the Java Virtual Machine (JVM). Using Microsoft Azure HDInsight as an example, this paper demonstrates how to access and read data into a SAS data set using PROC GROOVY and a JDBC connection.

Read the paper (PDF)

Monitoring server events to proactively identify future outages. Looking at financial transactions to check for money laundering. Analyzing insurance claims to detect fraud. These are all examples of the many applications that can use the power of SAS^® analytics to identify threats to a business. Using SAS^® Visual Investigator, users can now add a workflow to control how these threats are managed. Using the administrative tools provided, users can visually design the workflow that the threat would be routed through. In this way, the administrator can control the tasks within the workflow, as well as which users or groups those tasks are assigned to. This presentation walks through an example of using the administrative tools of SAS Visual Investigator to create a ticketing system in response to threats to a business. It shows how SAS Visual Investigator can easily be adapted to meet the changing nature of the threats the business faces.

Read the paper (PDF)

Hierarchical models, also known as random-effects models, are widely used for data that consist of collections of units and are hierarchically structured. Bayesian methods offer flexibility in modeling assumptions that enable you to develop models that capture the complex nature of real-world data. These flexible modeling techniques include choice of likelihood functions or prior distributions, regression structure, multiple levels of observational units, and so on. This paper shows how you can fit these complex, multilevel hierarchical models by using the MCMC procedure in SAS/STAT^® software. PROC MCMC easily handles models that go beyond the single-level random-effects model, which typically assumes the normal distribution for the random effects and estimates regression coefficients. This paper shows how you can use PROC MCMC to fit hierarchical models that have varying degrees of complexity, from frequently encountered conditional independent models to more involved cases of modeling intricate interdependence. Examples include multilevel models for single and multiple outcomes, nested and non-nested models, autoregressive models, and Cox regression models with frailty. Also discussed are repeated measurement models, latent class models, spatial models, and models with nonnormal random-effects prior distributions.

Read the paper (PDF)

Location information plays a big role in business data. Everything that happens in a business happens somewhere, whether it s sales of products in different regions or crimes that happened in a city. Business analysts typically use the historic data that they have gathered for years for analysis. One of the most important pieces of data that can help answer more questions qualitatively, is the demographic data along with the business data. An analyst can match the sales or the crimes with the population metrics like gender, age groups, family income, race, and other pieces of information, which are part of the demographic data, for better insight. This paper demonstrates how a business analyst can bring the demographic and lifestyle data from Esri into SAS^® Visual Analytics and join the data with business data. The integration of SAS Visual Analytics with Esri allows this to happen. We demonstrate different methods of accessing Esri demographic data from SAS Visual Analytics. We also demonstrate how you can use custom shape files and integrate with Esri Portal for ArcGIS.

Read the paper (PDF)

The SQL procedure has a number of powerful and elegant language features for SQL users. This hands-on workshop emphasizes highly valuable and widely usable advanced programming techniques that will help users of Base SAS^® harness the power of PROC SQL. Topics include using PROC SQL to identify FIRST.row, LAST.row, and Between.rows in BY-group processing; constructing and searching the contents of a value-list macro variable for a specific value; data validation operations using various integrity constraints; data summary operations to process down rows and across columns; and using the MSGLEVEL= system option and _METHOD SQL option to capture vital processing and the algorithm selected and used by the optimizer when processing a query.

Read the paper (PDF) | Download the data file (ZIP)

SAS^® Visual Analytics provides a robust platform to perform business intelligence through a high-end and advanced dashboarding style. In today's technology era, dashboards not only help in gaining insight into an organization's operations, but they also are a key performance indicator. In this paper, I discuss five important and frequently used objects in SAS Visual Analytics. These objects are used to get the most out of dashboards in an effective and efficient way. This paper covers the use of dates (as a format) in the date slider and gauges, cascading filters, custom graphs, linking reports within sections of the same report or with other reports, and associating buttons with graphs for dynamic functionality.

Read the paper (PDF)

Would you agree that the value of SAS^® for your organization comes from transforming data into actionable information, using well-prepared human resources? This paper presents seven areas where this potential SAS value can be lost by inefficient data access, limited reporting and visualization, poor data cleansing, obsolete predictive analytics, incomplete SAS solutions, limited hardware use, and lack of governance. This paper also suggests what to do to overcome these issues.

Read the paper (PDF)

Technology plays an integral role in every aspect of daily life. As a result, educators should leverage technology-based learning to ensure that students are provided with authentic, engaging, and meaningful learning experiences (Pringle, Dawson, and Ritzhaupt, 2015).The significance and value of computer science understanding continue to increase. A major resource that can be credited with spreading support for computer science is the site Code.org. Its mission is to enable every student in every school to have the opportunity to learn computer science (https://code.org/about). Two years ago, our mentor partnered with Code.org to conduct workshops within the Charlotte, NC area to educate teachers on how to teach computer science activities and concepts in their classrooms. We had the opportunity to assist during the workshops to provide student perspectives and opinions. As we look back on the workshops, we wondered, How are the teachers who attended the workshops implementing the concepts they were taught? After each workshop, a survey was distributed to the attendees to receive workshop feedback and to follow up. We collected the data from the surveys sent to participants and analyzed it using SAS^® University Edition. The results of the survey concluded that the workshops were beneficial and that the educators had implemented a concept that they learned. We believe that computer science activity implementations will assist students across the curriculum.

View the e-poster or slides (PDF)

To determine whether there is a correlation between the repetitiveness of a song s lyrics and its popularity, the top 10 songs from the Billboard Hot 100 songs chart from 2006 to 2015 were collected. Song lyrics were assessed to determine the count of the top 10 words used. Word counts were used to predict the number of weeks the song was on the chart. The prediction model was analyzed to determine the quality of the model and whether word count was a significant predictor of a song s popularity. To investigate whether song lyrics are becoming more simplistic over time, several tests were performed to see whether the average word count has been changing over the years. All analysis was completed in SAS^® using various procedures.

View the e-poster or slides (PDF)

Are you tired of copying PROC FREQ or PROC MEANS output and pasting it into your tables? Do you need to produce summary tables repeatedly? Are you spending a lot of your time generating the same summary tables for different subpopulations? This paper introduces an easy-to-use macro to generate a descriptive statistics table. The table reports counts and percentages for categorical variables, and means, standard deviations, medians, and quantiles for continuous variables. For variables with missing values, the table also includes the count and percentage missing. Customization options allow for the analysis of stratified data, specification of variable output order, and user-defined formats. In addition, this macro incorporates the SAS^® Output Delivery System (ODS) to automatically produce a Rich Text Format (RTF) file, which can be further edited by a word processor for the purpose of publication.

Read the paper (PDF) | View the e-poster or slides (PDF)

Significant growth of the Internet has created an enormous volume of unstructured text data. In recent years, the amount of this type of data that is available for analysis has exploded. While the amount of textual data is increasing rapidly, an ability to obtain key pieces of information from such data in a fast, flexible, and efficient way is still posing challenges. This paper introduces SAS^® Contextual Analysis In-Database Scoring for Hadoop, which integrates SAS^® Contextual Analysis with the SAS^® Embedded Process. SAS^® Contextual Analysis enables users to customize their text analytics models in order to realize the value of their text-based data. The SAS^® Embedded Process enables users to take advantage of SAS^® Scoring Accelerator for Hadoop to run scoring models. By using these key SAS^® technologies, the overall experience of analyzing unstructured text data can be greatly improved. The paper also provides guidelines and examples on how to publish and run category, concept, and sentiment models for text analytics in Hadoop.

Read the paper (PDF)

The University of Memphis has been a SAS^® customer since the mainframe computing era. Our deployments have included various SAS products involving web-based applications, client/server implementations, desktop installations, and virtualized services. This paper uses an information technology (IT) perspective to discuss how the University has leveraged SAS, as well as the latest benefits and challenges for our most recent deployment involving SAS^® Visual Analytics.

Read the paper (PDF)

The SAS^® code looks perfect. You submit it and to your amazement, there is a problem with the CREATE TABLE statement. You need to change the table definition, ever so slightly, but how? Explicit pass-through? That's not an option. Fortunately, there are a handful of SAS options that can save the day. This presentation covers everything you need to know in order to adjust the SAS CREATE TABLE statements using SAS options. This presentation covers the following SAS options: DBCREATE_TABLE_OPTS=, POST_STMT_OPTS=, POST_TABLE_OPTS=, PRE_STMT_OPTS=, and PRE_TABLE_OPTS=. We use Hadoop and Oracle examples to show why these options can make your life easier. From there, we use real code to show you how to use them.

Read the paper (PDF)

Whether you are an existing SAS^® Visual Analytics user or you are exploring SAS Visual Analytics for the first time, the first release of SAS^® Visual Analytics 8.1 on SAS^® Viya has something exciting for everyone. The latest version is a clean, modern HTML5 interface. SAS^® Visual Analytics Designer, SAS^® Visual Analytics Explorer, and SAS^® Visual Statistics are merged into a single web application. Whether you are designing reports, exploring data, or running interactive, predictive models, everything is integrated into one seamless experience. The application delivers on the same basic promise: get pertinent answers from any-size data. The paper walks you through key features that you have come to count on, from auto charting, to display rules, and more. It acclimates you to the new interface and highlights a few exciting new features like web content and donut pie charts. Finally, the paper touches upon the ability to promote your existing reports to the new environment.

Read the paper (PDF)

Interactively redeploying SAS^® Data Integration Studio jobs can be a slow and tedious process. The updated batch deployment utility gives the ETL Tech Lead a more efficient and repeatable method for administering batch jobs. This improved tool became available in SAS^® Data Integration Studio 4.901.

Read the paper (PDF)

Data is generated every second. The term big data refers to the volume, variety, and velocity of data that is being produced. Now woven into every sector, its size and complexity has left organizations faced with difficulties in being able to create, manipulate, and manage big data. This research identifies and reviews a range of big data techniques within SAS^®, highlighting the fundamental opportunities that SAS provides for overcoming a variety of business challenges. Insurance is a data-dependent industry. This research focuses on understanding what SAS can offer to insurance companies and how it could interact with existing customer databases and online, user-generated content. A range of data sources have been identified for this purpose. The research demonstrates how models can be built based on existing relationships found in past data and then used to identify prospective customers. Principal component analysis, cluster analysis, and neural networks are all considered. You will learn how these techniques can be used to help capture valuable insight, create firm relationships, and support customer feedback. Whether it is prescriptive, predictive, descriptive, or diagnostic analytics, harnessing big data can add background and depth, providing insurance companies with a more complete story. You will see that you can reduce the complexity and dimensionality of data, provide actionable intelligence, and essentially make more informed business decisions.

Read the paper (PDF)

UNIX and Linux SAS^® administrators, have you ever been greeted by one of these statements as you walk into the office before you have gotten your first cup of coffee? Power outage! SAS servers are down. I cannot access my reports. Have you frantically tried to restart the SAS servers to avoid loss of productivity and missed one of the steps in the process, causing further delays while other work continues to pile up? If you have had this experience, you understand the benefit to be gained from a utility that automates the management of these multi-tiered deployments. Until recently, there was no method for automatically starting and stopping multi-tiered services in an orchestrated fashion. Instead, you had to use time-consuming manual procedures to manage SAS services. These procedures were also prone to human error, which could result in corrupted services and additional time lost, debugging and resolving issues injected by this process. To address this challenge, SAS Technical Support created the SAS Local Services Management (SAS_lsm) utility, which provides automated, orderly management of your SAS^® multi-tiered deployments. The intent of this paper is to demonstrate the deployment and usage of the SAS_lsm utility. Now, go grab a coffee, and let's see how SAS_lsm can make life less chaotic.

Read the paper (PDF)

Machine learning is in high demand. Whether you are a citizen data scientist who wants to work interactively or you are a hands-on data scientist who wants to code, you have access to the latest analytic techniques with SAS^® Visual Data Mining and Machine Learning on SAS^® Viya . This offering surfaces in-memory machine learning techniques such as gradient boosting, factorization machines, neural networks, and much more through its interactive visual interface, SAS^® Studio tasks, procedures, and a Python client. Learn about this multi-faceted new product and see it in action.

Read the paper (PDF)

A major issue in America today is the growing gap between the rich and the poor. Even though the basic concept has entered the public consciousness, the effects of highly concentrated wealth are hotly debated and poorly understood by the general public. The goal of this paper is to get a fair picture of the wealth gap and its ill effects on American society. Before visualizing the financial gap, an exploration and descriptive analysis is carried out. By considering the data (gross annual income, taxable income, and taxes paid), which is available on the website of United States Census Bureau, we try to find out the actual spending capacity of the people in America. We visualize the financial gap on the basis of the spending capacity. With the help of this analysis we try to answer the following questions. Why is it important to have a fair idea of this gap? At what rate is the average wealth of the American population increasing? How does it affect the tax system? Insights generated from answering these questions will be used for further analysis.

View the e-poster or slides (PDF)

Manufacturers of any product from toys to medicine to automobiles must create items that are, above all else, safe to use. Not only is this essential to long-term brand value and corporate success, but it's also required by law. Although perfection is the goal, defects are bound to occur, especially in advanced products such as automobiles. Automobiles are the largest purchase most people make, next to a house. When something that costs tens of thousands of dollars runs into problems, you tend to remember. Recalls in part reflect growing pains after decades of consolidation in the auto industry. Many believe that recalls are the culmination of years of neglect by manufacturers and the agencies that regulate them. For several reasons, automakers are acting earlier and more often in issuing recalls. In the past 20 years, the number of voluntarily recalled vehicles has steadily grown. The automotive-recall landscape changed dramatically in 2000 with the passage of the federal TREAD Act. Before that, federal law required that automakers issue a recall only when a consumer reported a problem. TREAD requires that companies identify potential problems and promptly notify the NHTSA. This is largely due to stricter laws, heavier fines, and more cautious car makers. This study helps automobile manufacturers understand customers who are talking about defects in their cars and to be proactive in recalling the product at the right time before the Government acts.

Read the paper (PDF) | View the e-poster or slides (PDF)

This presentation illustrates new technology that has been added to the SAS^® Customer Intelligence 360 analytics testing suite for digital campaign marketing. SAS Customer Intelligence 360 now has a multivariate testing tool (MVT). In digital marketing, MVT has become an increasingly popular process by which multiple components of a campaign can be tested in a live environment with the goal of finding an optimal mix, which drives a defined response metric. In simple terms, MVT is equivalent to running numerous A/B tests performed simultaneously. In theory, MVT can test the effectiveness of limitless combinations of factors. The number of factor-level combinations determines the test duration and the number of samples required to make statistically sound predictions for all permutations of a full factorial design. SAS^® applies experimental design analytics, driven by an interactive process that considers constraints and control cell definition, to guide the user toward an optimal reduced design of the test that can still adequately predict all factor-level combinations, given available resources. The results of the analysis enable the marketer to compare responses for both observed factor-level combinations to the predicted responses to the untested combinations.

Read the paper (PDF)

As you know, real world data (RWD) provides highly valuable and practical insights. But as valuable as RWD is, it still has limitations. It is encounter-based, and we are largely blind to what happens between encounters in the health-care system. The encounters generally occur in a clinical setting that might not reflect actual patient experience. Many of the encounters are subjective interviews, observations, or self-reports rather than objective data. Information flow can be slow (even real time is not fast enough in health care anymore). And some data that could be transformative cannot be captured currently. Select Internet of Things (IoT) data can fill the gaps in our current RWD for certain key conditions and provide missing components that are key to conducting Analytics of Healthcare Things (AoHT), such as direct, objective measurements; data collected in usual patient settings rather than artificial clinical settings; data collected continuously in a patient s setting; insights that carry greater weight in Regulatory and Payer decision-making; and insights that lead to greater commercial value. Teradata has partnered with an IoT company whose technology generates unique data for conditions impacted by mobility or activity. This data can fill important gaps and provide new insights that can help distinguish your value in your marketplace. Join us to hear details of successful pilots that have been conducted as well as ongoing case studies.

Read the paper (PDF)

Correlated data is extensively used across disciplines when modeling data with any type of correlation that might exist among observations due to clustering or repeated measurements. When modeling clustered data, hierarchical linear modeling (HLM) is a popular multilevel modeling technique that is widely used in different fields such as education and health studies (Gibson and Olejnik, 2003). A typical example of multilevel data involves students nested within classrooms that behave similarly due to shared situational factors. Ignoring their correlation might result in underestimated standard errors and inflated type-I error (Raudenbush and Bryk, 2002). When modeling longitudinal data, many studies have been conducted on continuous outcomes. However, fewer studies on discrete responses over time have been completed. These studies require models within conditional, transitional, and marginal models (Fitzmaurice et al., 2009). Examples of such models that enable researchers to account for the autocorrelation among repeated observations include generalized linear mixed model (GLMM), generalized estimating equations (GEE), alternating logistic regression (ALR), and fixed effects with conditional logit analysis. This study explores the aforementioned methods as well as several other correlated modeling options for longitudinal and hierarchical data within SAS^® 9.4 using real data sets. These procedures include PROC GLIMMIX, PROC GENMOD, PROC NLMIXED, PROC GEE, PROC PHREG, and PROC MIXED.

Read the paper (PDF)

Data from an extensive survey conducted by the National Center for Education Statistics (NCES) is used for predicting qualified secondary school teachers across public schools in the U.S. The sample data includes socioeconomic data at the county level, which is used as a predictor for hiring a qualified teacher. The resultant model is used to score other regions and is presented on a heat map of the U.S. The survey family of procedures that SAS^® offers, such as PROC SURVEYFREQ and PROC SURVEYLOGISTIC, are used in the analyses since the data involves replicate weights. In looking at residuals from a logistic regression, since all the outcomes (observed values) are either 0 or 1, the residuals do not necessarily follow the normal distribution that is so often assumed in residual analysis. Furthermore, in dealing with survey data, the weights of the observations must be accounted for, as these affect the variance of the observations. To adjust for this, rather than looking at the difference in the observed and predicted values, the difference between the expected and actual counts is calculated by using the weights on each observation, and the predicted probability from the logistic model for the observation. Three types of residuals are analyzed: Pearson, Deviance, and Studentized residuals. The purpose is to identify which type of residuals best satisfy the assumption of normality when investigating residuals from a logistic regression.

Read the paper (PDF)

Uber has changed the face of taxi ridership, making it more convenient and comfortable for riders. But, there are times when customers are dissatisfied because of a shortage of Uber vehicles, which ultimately leads to Uber surge pricing. It's a very difficult task to forecast the number of riders at different locations in a city at different points in time. This gets more complicated with changes in weather. In this paper, we attempt to estimate the number of trips per borough on a daily basis in New York City. We add an exogenous factor weather to this analysis to see how it impacts the changes in the number of trips. We fetched six months worth of data (approximately 9.7 million records) of Uber rides in New York City ranging from January 2015 to June 2015 from GitHub. We gathered weather data (about 3.5 million records) for New York City for the same period from the National Climatic Data Center. We analyzed Uber data and weather data together to estimate the change in the number of trips per borough due to changing weather conditions. We built a model to predict the number of trips per day for a one-week-ahead forecast for each borough of New York City. As part of a further analysis, we got the number of trips on a particular day for each borough. Using time series analysis, we forecast the number of trips that might be required in the near future (probably one week).

Read the paper (PDF) | View the e-poster or slides (PDF)

Chronic obstructive pulmonary disease (COPD) is the third leading cause of death in the US. An estimated 24 million people suffer from COPD, and the medical cost associated with it stands at a whopping $36 billion. Besides the emotional and physical impact, a patient with COPD has to undergo severe economic burden to pay for the medication. Hospitals are subjected to heavy penalties for high re-admissions. Identifying the best medicine combinations to treat COPD benefits patients and hospitals. This paper deals with analyzing the effectiveness of three popular drugs prescribed for COPD patients in terms of mortality rates and re-admission within 30 days of discharge. The data from Cerner Health Facts consists of over 1 million real-world, anonymized patient records collected in a real-world health environment. Base SAS^® is used to perform statistical analysis and data processing; re-admission of patients is analyzed using a lag function. The preliminary results show a re-admission rate of 5.96% and a mortality rate of 3.3% among all patients. The odds ratios computed using logistic regression show an increased mortality rate 2.4 times more for patients using Symbicort compared to Spiriva and Advair. This paper also uses text mining of social media, drug portals, and blogs to gauge the sentiments of patients using these drugs. The results obtained through sentiment analysis are then compared with the statistical analysis to determine the effectiveness of drugs prescribed to the COPD patients.

Read the paper (PDF)

Sovereign risk rating and country risk rating are conceptually distinct in that the former captures the risk of a country defaulting on its commercial debt obligations using economic variables while the latter covers the downside of a country's business environment including political and social variables alongside economic variables. Through this paper we would like to understand the differences between these risk approaches in assessing a country's credit worthiness by statistically examining the predictive power of political and social variables in determining country risk. To do this, we wish to build two models, first model with economic variables as regressors (sovereign risk model) and the second model with economic, political and social variables as regressors (country risk model) to compare the predictive power of regressors and model performance metrics between both the models. This will be an OLS regression model with country risk rating obtained from S&P as the target variable. With a general assumption that economic variables are driven by political processes and social factors, we would like to see if the second model has better predictive power. The economic, political and social indicators data that will be used as independent variables in the model will be obtained from world bank open data and target variable (country risk rating) will be obtained from S&P country risk ratings data.

View the e-poster or slides (PDF)

Customer churn is an important area of concern that affects not just the growth of your company, but also the profit. Conventional survival analysis can provide a customer's likelihood to churn in the near term, but it does not take into account the lifetime value of the higher-risk churn customers you are trying to retain. Not all customers are equally important to your company. Recency, frequency, and monetary (RFM) analysis can help companies identify customers that are most important and most likely to respond to a retention offer. In this paper, we use the IML and PHREG procedures to combine the RFM analysis and survival analysis in order to determine the optimal number of higher-risk and higher-value customers to retain.

Read the paper (PDF)

The Consumer Financial Protection Bureau (CFPB) collects tens of thousands of complaints against companies each year, many of which result in the companies in question taking action, including making payouts to the individuals who filed the complaints. Given the volume of the complaints, how can an overseeing organization quantitatively assess the data for various trends, including the areas of greatest concern for consumers? In this presentation, we propose a repeatable model of text analytics techniques to the publicly available CFPB data. Specifically, we use SAS^® Contextual Analysis to explore sentiment, and machine learning techniques to model the natural language available in each free-form complaint against a disposition code for the complaint, primarily focusing on whether a company paid out money. This process generates a taxonomy in an automated manner. We also explore methods to structure and visualize the results, showcasing how areas of concern are made available to analysts using SAS^® Visual Analytics and SAS^® Visual Statistics. Finally, we discuss the applications of this methodology for overseeing government agencies and financial institutions alike.

Read the paper (PDF)

Research frequently shows that exposure to sunlight contributes to non-melanoma skin cancer. But, it also shows that sunlight might protect you against multiple sclerosis and breast, ovarian, prostate, and colon cancer. In my study, I explored whether mortality from skin cancer, myocardial infarction, atrial fibrillation, and stroke is associated with exposure to sunlight. I used SAS^® 9.4 and RStudio to conduct the entire study. I collected mortality data including cause of death in Los Angeles from 2000 to 2003. In addition, I collected sunlight data for Los Angeles for the same period. There are three types of sunlight in my data global sunlight, diffuse sunlight, and direct sunlight. Data was collected at three different times morning, middle of day, and afternoon. I used two models the Poisson time series regression model and a logistic regression model to investigate the association. I considered a one-year and two-year lag of sunlight association with the types of diseases. I adjusted for age, sex, race, education, temperature, and day of week. Results show that stroke is statistically and significantly associated with a one-year lag of sunlight (p<0.001). Previous epidemiological studies have found that sunlight exposure can ameliorate osteoporosis in stroke patients, and my study provides the protective effects of sunlight on stroke patients.

View the e-poster or slides (PDF)

Many organizations are using SAS^® Visual Analytics for their daily reporting. But as more users gain access to the visual tool, it is easy to lose track of what data is being used, what reports are being accessed, and what elements of the system are classified as critical. With SAS Visual Analytics comes a governance exercise that all organizations should provision for, as otherwise it jeopardizes its maintenance and performance. This paper explores the three different auditing areas that can be configured with SAS Visual Analytics and the different metrics that are associated with them. It presents how to configure the auditing, the data sources that are being populated on the background, and how to exploit them to expand your reports beyond the pre-created audit reports. Consideration is also given to the IT and infrastructure side of enabling auditing mechanisms, with data volumes and archiving practices being at the heart of the discussion.

Read the paper (PDF)

The use of telematics data within the insurance industry is becoming prevalent as insurers use this data to give discounts, categorize drivers, and provide feedback to improve customers' driving. The data captured through in-vehicle or mobile devices includes acceleration, braking, speed, mileage, and many other events. Data elements are analyzed to determine high-risk events such as rapid acceleration, hard braking, quick turning, and so on. The time between these successive high-risk events is a function of the mileage driven and time in the telematics program. Our discussion highlights how we treated these high-risk events as recurrent events and analyzed them using the RELIABILITY procedure within SAS/QC^® software. The RELIABILITY procedure is used to determine a nonparametric mean cumulative function (MCF) of high-risk events. We illustrate the use of the MCF for identifying and categorizing average driver behavior versus individual driver behavior. We also discuss the use of the MCF to evaluate how a loss event or driver feedback can affect future driving behavior.

Read the paper (PDF)

There are many good validation tools for Clinical Data Interchange Standards Consortium (CDISC) Study Data Tabulation Model (SDTM) such as Pinnacle 21. However, the power and customizability of SAS^® provide an effective tool for validating SDTM data sets used in clinical trials FDA submissions. This paper presents three distinct methods of using SAS to validate the transformation from Electronic Data Capture (EDC) data into CDISC SDTM format. This includes: duplicate programming, an independent SAS program used to transform EDC data with PROC COMPARE; rules checker, a SAS program to verify a specific SDTM or regulatory rules applied to SDTM SAS data sets; and transformation validation, a SAS macro used to compare EDC data and SDTM using PROC FREQ to identify outliers. The three examples illustrate the diverse approaches to applying SAS programs to catch errors in data standard compliance or identify inconsistencies that would otherwise be missed by other general purpose utilities. The stakes are high when preparing for an FDA submission. Catching errors in SDTM during validation prior to a submission can mean the difference between success or failure for a drug or medical device.

Read the paper (PDF)

Machine learning predictive modeling algorithms are governed by hyperparameters that have no clear defaults agreeable to a wide range of applications. A few examples of quantities that must be prescribed for these algorithms are the depth of a decision tree, number of trees in a random forest, number of hidden layers and neurons in each layer in a neural network, and degree of regularization to prevent overfitting. Not only do ideal settings for the hyperparameters dictate the performance of the training process, but more importantly they govern the quality of the resulting predictive models. Recent efforts to move from a manual or random adjustment of these parameters have included rough grid search and intelligent numerical optimization strategies. This paper presents an automatic tuning implementation that uses SAS/OR^® local search optimization for tuning hyperparameters of modeling algorithms in SAS^® Visual Data Mining and Machine Learning. The AUTOTUNE statement in the NNET, TREESPLIT, FOREST, and GRADBOOST procedures defines tunable parameters, default ranges, user overrides, and validation schemes to avoid overfitting. Given the inherent expense of training numerous candidate models, the paper addresses efficient distributed and parallel paradigms for training and tuning in SAS^® Viya . It also presents sample tuning results that demonstrate improved model accuracy over default configurations and offers recommendations for efficient and effective model tuning.

Read the paper (PDF)

The singular spectrum analysis (SSA) method of time series analysis applies nonparametric techniques to decompose time series into principal components. SSA is particularly valuable for long time series, in which patterns (such as trends and cycles) are difficult to visualize and analyze. An important step in SSA is determining the spectral groupings; this step can be automated by analyzing the w-correlations of the spectral components. This paper provides an introduction to singular spectrum analysis and demonstrates how to use SAS/ETS^® software to perform it. To illustrate, monthly data on temperatures in the United States over the last century are analyzed to discover significant patterns.

Read the paper (PDF)

I have come up with a way to use the output of the SCAPROC procedure to produce DOT directives, which are then put through the Graphviz engine to produce a diagram. This allows the production of flowcharts of SAS^® code automatically. I have enhanced the charts to also show the longest steps by run time, so even if you look at thousands of steps in a complex program, you can easily see the structure and flow of it, and where most of the time is spent, just by having a look for a few seconds. Great for documentation, benchmarking, tuning, understanding, and more.

Read the paper (PDF)

In the pharmaceutical industry, the Clinical Data Interchange Standards Consortium s (CDISC) Study Data Tabulation Model (SDTM) is required by the US Food and Drug Administration (FDA) as the standard data structure for regulatory submission of clinical data. Manually mapping raw data to SDTM domains can be time consuming and error prone, considering the increasing complexity of clinical data. However, this process can be much more efficient if the raw data is collected using the Clinical Data Acquisition Standards Harmonization (CDASH) standard, allowing for the automatic conversion to the SDTM data structure. This paper introduces a macro that can automatically create a SAS^® program for each SDTM domain (for example, dm.sas for Demography [DM]), that maps CDASH data to SDTM data. The macro compares the attributes of CDASH raw data sets with SDTM domains to generate SAS code that performs the mapping. Each SAS program does the following: 1) sets up variables and assigns their proper order; 2) converts date and time to ISO8601 standard; 3) converts numeric variables to character variables; and 4) transposes the data sets from wide to long for the Findings and Events domains. This macro, which sets up the basic frame of SDTM mapping, can minimize the manual work for SAS programmers or, in some cases, completely handle some simple domains without any further modifications. This greatly increases the efficiency and speed of the SDTM conversion process.

Read the paper (PDF)

A lot of time and effort goes into creating presentations or dashboards for the purposes of management business reviews. Data for the presentation is produced from a variety of tools, and the output is cut and pasted into Microsoft PowerPoint or Microsoft Excel. Time is spent not only on the data preparation and reporting, but also on the finishing and touching up of these presentations. In previous years, SAS^® Global Forum authors have described the automation capabilities of SAS^® and Microsoft Office. The default look and feel of SAS output in Microsoft PowerPoint and Microsoft Excel is not always adequate for the more polished requirement of an executive presentation. This paper focuses on how to combine the capabilities of SAS^® Enterprise Guide^®, SAS^® Visual Analytics, and Microsoft PowerPoint into a finished, professional presentation. We will build and automate a beautiful finished end product that can be refreshed by anyone with the click of a mouse.

Read the paper (PDF)

Let's walk through an example of communicating from the SAS^® client to SAS^® Viya . The demonstration focuses on how to use SAS^® language to establish a session, transport and persist data, and receive results. Learn how to establish communication with SAS Viya. Explore topics such as: What is a session? How do I make requests? What does my SAS log tell me? Get a deeper understanding of data location on the client and the server side. Learn about applying existing user formats, how to get listings or reports, and how to query sessions, data, and properties.

Read the paper (PDF)

You have SAS^® software. You have databases or data platforms like Hadoop, possibly with some large distributed data. If you already know how to make SAS code talk to your data platforms, you have already taken a solid step toward a successful integration. But you might also want to know how to take this communication to a different level. If your data platform is the one that is built for massively parallel processing, chances are that SAS code has already created the SAS^® Embedded Process framework that allows SAS tasks to be embedded next to your data sources for execution. SAS^® In-Database Technologies is a family of products that use this framework and provide an accelerated level of integration. This paper explains core principles behind these technologies and presents application scenarios for each of these products. We use a variety of examples to highlight the specifics of individual SAS accelerators (SAS^® Scoring Accelerator, SAS^® In-Database Code Accelerator, and others) across the data platforms.

Read the paper (PDF)

Understanding customer behavior profiles is of great value to companies. Customer behavior is influenced by a multitude of elements-some are capricious, presumably resulting from environmental, economic, and other factors, while others are more fundamentally aligned with value and belief systems. In this paper, we use unstructured textual cheque card data to model and estimate latent spending behavioral profiles of banking customers. These models give insight into unobserved spending habits and patterns. SAS^® Text Miner is used in an atypical manner to determine the buying segments of customers and the latent buying profile using a clustering approach. Businesses benefit in the way the behavioral spend model is used. The model can be used for market segmentation, where each cluster is seen as a target marketing segment, leads optimization, or product offering where products are specifically compiled to align to each customer's requirements. It can also be used to predict future spend or to align customer needs with business offerings, supported by signing customers onto loyalty programs. This unique method of determining the spend behavior of customers makes it ideal for companies driving retention and loyalty in their customers.

Read the paper (PDF) | View the e-poster or slides (PDF)

Connecting database schemas to libraries in the SAS^® metadata is a very important part of setting up a functional and useful environment for business users. This task can be quite difficult for the untrained administrator. This paper addresses the key configuration items that often go unnoticed but that can make a big difference. Using the wrong options can lead to poor database performance or even to a total lockdown, depending on the number of connections to the database.

Read the paper (PDF)

It's essential that SAS^® users enhance their skills to implement best-practice programming techniques when using Base SAS^® software. This presentation illustrates core concepts with examples to ensure that code is readable, clearly written, understandable, structured, portable, and maintainable. Attendees learn how to apply good programming techniques including implementing naming conventions for data sets, variables, programs, and libraries; code appearance and structure using modular design, logic scenarios, controlled loops, subroutines and embedded control flow; code compatibility and portability across applications and operating platforms; developing readable code and program documentation; applying statements, options, and definitions to achieve the greatest advantage in the program environment; and implementing program generality into code to enable its continued operation with little or no modifications.

Read the paper (PDF)

Nearly every SAS^® program includes logic that causes certain code to be executed only when specific conditions are met. This is commonly done using the IF-THEN/ELSE syntax. This paper explores various ways to construct conditional SAS logic, some of which might provide advantages over the IF statement in certain situations. Topics include the SELECT statement, the IFC and IFN functions, and the CHOOSE and WHICH families of functions, as well as some more esoteric methods. We also discuss the intricacies of the subsetting IF statement and explain the difference between a regular IF and the %IF macro statement.

Read the paper (PDF)

Soon after the advent of the SAS^® hash object in SAS^®9, its early adopters realized that its potential functionality is much broader than merely using its fast table lookup capability for file matching. This is because in reality, the hash object is a versatile data storage structure with a roster of standard table operations such as create, drop, insert, delete, clear, search, retrieve, update, order, and enumerate. Since it is memory-resident and its key-access operations execute in O(1) time, it runs them as fast as or faster than other canned SAS techniques, with the added bonus of not having to code around their inherent limitations. Another advantage of the hash object as compared to the methods that had existed before its implementation is its dynamic, run-time nature and the ability to handle I/O all by itself, independently of the intrinsic statements of a DATA step or DS2 program calling its methods. The hash object operations, or their combination thereof, lend themselves to diverse SAS programming functionalities well beyond the original focus on data search and retrieval. In this paper, which can be thought of as a preview of a SAS book being written by the authors, we aim to present this logical connection using the power of example.

Read the paper (PDF)

Data that are gathered in modern data collection processes are often large and contain geographic information that enables you to examine how spatial proximity affects the outcome of interest. For example, in real estate economics, the price of a housing unit is likely to depend on the prices of housing units in the same neighborhood or nearby neighborhoods, either because of their locations or because of some unobserved characteristics that these neighborhoods share. Understanding spatial relationships and being able to represent them in a compact form are vital to extracting value from big data. This paper describes how to glean analytical insights from big data and discover their big value by using spatial econometric methods in SAS/ETS^® software.

Read the paper (PDF)

As in any analytical process, data sampling is a definitive step for unstructured data analysis. Sampling is of paramount importance if your data is fed from social media reservoirs such as Twitter, Facebook, Amazon, and Reddit, where information proliferation happens minute by minute. Unless you have a sophisticated analytical engine and robust physical servers to handle billions of pieces of data, you can't use all your data for analysis without sampling. So, how do you sample textual data? The standard method is to generate either a simple random sample, or a stratified random sample if a stratification variable exists in the data. Neither of these two methods can reliably produce a representative sample of documents from the population data simply because the process does not encompass a step to ensure that the distribution of terms between the population and sample sets remains similar. This shortcoming can cause the supervised or unsupervised learning to yield inaccurate results. If the generated sample is not representative of the population data, it is difficult to train and validate categories or sub-categories for those rare events during taxonomy development. In this paper, we show you new methods for sampling text data. We rely on a term-by-document matrix and SAS^® macros to generate representative samples. Using these methods helps generate sufficient samples to train and validate every category in rule-based modeling approaches using SAS^® Contextual Analysis.

Read the paper (PDF)

Often the burden of productionisation of analytical models falls upon the analyst, so every Monday morning the analyst comes in and presses the Run button. Now this is obviously fraught with danger (for example, the source data isn't available, the analyst goes on holidays, or the analyst resigns), and might lead to invalid results being consumed by downstream systems. There are many reasons that this might occur, but the most common one is that it takes IT too long to put a model into full production (especially if that model contains new data sources). In this presentation, I show a tested architecture that allows for the typical rapid development of models (and in fact it actually significantly speeds up the discovery phase), as well as allows for an orderly handover to IT for them to productionise without disrupting the regular run of the models. This allows for notification of downstream users if there is a delay in the arrival of data, as well as rapid IT Operations response if there is a problem during the loading and creation.

Whether you are calculating a credit risk, a health risk, or something entirely different, you need instant, on-the-fly risk score calculation across multiple industries. This paper demonstrates how you can produce individualized risk scores through interactive dashboards. Your risk scores are backed by powerful SAS^® analytics because they leverage score code that you produce in SAS^® Visual Statistics. Advanced topics, including the use of calculated items and parameters in your dashboards, as well as how to develop SAS^® Stored Processes capable of accepting parameters that are passed through your SAS^® Visual Analytics Dashboard are covered in detail.

Read the paper (PDF)

SAS^® is perfect for building enterprise apps. Think about it: SAS speaks to almost any database you can think of and is probably already hooked in to most of the data sources in your organization. A full-fledged metadata security layer happens to already be integrated with your single sign-on authentication provider, and every time a user interacts with the system, their permissions are checked and the data their app asks for is automatically encrypted. SAS ticks all the boxes required by IT, and the skills required to start developing apps already sit within your department. Your team most likely already knows what your app needs to do, so instead of writing lists of requirements, give them an HTML5 resource, and together they can write and deploy the back-end code themselves. The apps run in the browser, the server-side logic is deployed using SAS .spk packages, and permissions are managed via SAS^® Management Console. Best of all, the infrastructure that would normally take months to integrate is already there, eliminating barriers to entry and letting you demonstrate the value of your solution to internal customers with zero up-front investment. This paper shows how SAS integrates with open-source tools like H54S, AngularJS, and PostGIS, together with next-generation developer-centric analytical platforms like SAS^® Viya , to build secure, enterprise-class apps that can support thousands of users. This presentation includes lots of app demos. This presentation was included at SAS^® Forum UK 2016.

Read the paper (PDF)

Cascading Style Sheets (CSS) frameworks like Bootstrap, and JavaScript libraries such as jQuery and h54s, have made it faster than ever before to develop enterprise-grade web apps on top of the SAS^® platform. Hailing the benefits of using SAS as a back end (authentication, security, ease of data access), this paper navigates the configuration issues to consider for maximum responsiveness to client web requests (pooled sessions, load balancing, multibridge connections). Cherry picking from the whirlwind of front end technologies and approaches, the author presents a framework that enables the novice programmer to build a simple web app in minutes. The exact steps necessary to achieve this are described, alongside a hurricane of practical tips like the following: dealing with CORS; logging in SAS; debugging AJAX calls; and SAS http responses. Beware this approach is likely to cause a storm of demand in your area! Server requirements: SAS^® Business Intelligence Platform (SAS^® 9.2 or later); SAS^® Stored Process Web Application (SAS^® Integration Technologies). Client requirements: HTML5 browser (Microsoft Internet Explorer 8 or later); access to open-source libraries (which can be hosted on-premises if Internet access is an issue).

Read the paper (PDF)

A Bayesian network is a directed acyclic graphical model that represents probability relationships and conditional independence structure between random variables. SAS^® Enterprise Miner implements a Bayesian network primarily as a classification tool; it includes na ve Bayes, tree-augmented na ve Bayes, Bayesian-network-augmented na ve Bayes, parent-child Bayesian network, and Markov blanket Bayesian network classifiers. The HPBNET procedure uses a score-based approach and a constraint-based approach to model network structures. This paper compares the performance of Bayesian network classifiers to other popular classification methods, such as classification tree, neural network, logistic regression, and support vector machines. The paper also shows some real-world applications of the implemented Bayesian network classifiers and a useful visualization of the results.

Read the paper (PDF)

The SAS^® Macro Language gives you the power to create tools that, to a large extent, think for themselves. How often have you used a macro that required your input, and you thought to yourself, Why do I need to provide this information when SAS^® already knows it? SAS might already know most of this information, but how does SAS direct your macro programs to self-discern the information that they need? Fortunately, there are a number of functions and tools in SAS that can intelligently enable your programs to find and use the information that they require. If you provide a variable name, SAS should know its type and length. If you provide a data set name, SAS should know its list of variables. If you provide a library or libref, SAS should know the full list of data sets that it contains. In each of these situations, functions can be used by the macro language to determine and return information. By providing a libref, functions can determine the library's physical location and the list of data sets it contains. By providing a data set, they can return the names and attributes of any of the variables that it contains. These functions can read and write data, create directories, build lists of files in a folder, and build lists of folders. Maximize your macro's intelligence; learn and use these functions.

Read the paper (PDF)

Historically, the risk and finance functions within a bank have operated within different rule sets and structures. Within its function, risk enjoys the freedom needed to properly estimate various types of risk. Finance, on the other hand, operates within the well-defined and structured rules of accounting, which are required for standardized reporting. However, the International Financial Reporting Standards (IFRS) newest standard, IFRS 9, brings these two worlds together: risk, to estimate credit losses, and finance, to determine their impact on the balance sheet. To help achieve this integration, SAS^® has introduced SAS^® Expected Credit Loss. SAS Expected Credit Loss enables customers to perform risk calculations in a controlled environment, and to use those results for financial reporting within the same managed environment. The result is an integrated and scalable risk and finance platform, providing the end-to-end control, auditability, and flexibility needed to meet the IFRS 9 challenge.

Read the paper (PDF)

Health insurers have terabytes of transactional data. However, transactional data does not tell a member-level story. Humana Inc. is often faced with requirements for tagging (identifying) members with various clinical conditions such as diabetes, depression, hypertension, hyperlipidemia, and various member-level utilization metrics. For example, Consumer Health Tags are built to identify the condition (that is, diabetes, hypertension, and so on) and to estimate the intensity of the disease using medical and pharmacy administrative claims data. This case study takes you on an analytics journey from the initial problem diagnosis and analytics solution using SAS^®.

Read the paper (PDF)

Coming off a recent smart grid implementation, OGE Energy Corp. was collecting more data than at any time in its history. This data held the potential to help the organization uncover new insights and chart new paths. Find out how OGE Energy is building a culture of data analytics by using SAS^® tools, a distributed analytics model, and an analytics center of excellence.

Digital transformation and analytics for incumbents isn't a question of choice or strategy. It's a question of business survival. Go analytics!

When new technologies, workflows, or processes are implemented, an organization and its employees must embrace changes in order to ensure long-term success. This paper provides guidelines and best practices in change management that the SAS Advanced Analytics Division uses with customers when it implements prescriptive analytics solutions (provided by SAS/OR^® software). Highlights include engaging technical leaders in defining project scope and providing functional design documents. The paper also highlights SAS' approach in engaging business leaders on business scope, garnering executive-level project involvement, establishing steering committees, defining use cases, developing an effective communication strategy, training, and implementing of SAS/OR solutions.

Read the paper (PDF)

Rapid advances in technology have empowered musicians all across the globe to share their music easily, resulting in intensified competition in the music industry. For this reason, musicians and record labels need to be aware of factors that can influence the popularity of their songs. The focus of our study is to determine how themes, topics, and terms within song lyrics have changed over time and how these changes might have influenced the popularity of songs. Moreover, we plan to run time series analysis on the numeric attributes of Billboard Top 100 songs in order to determine the appropriate combination of relevant attributes that influences a song's popularity. The findings of our study can potentially benefit musicians and record labels in understanding the necessary lyrical construction, overall themes, and topics that might enable a song to reach the highest chart position on the Billboard Top 100. The Billboard Top 100 is an optimal source of data, as it is an objective measure of popularity. Our data has been collected from open sources. Our data set consists of all 334,784 Billboard Top 100 observations for the years 1955-2015, with metadata covering all 26,869 unique songs that have appeared on the chart for that period. Our expanding lyric data set currently contains 18,002 of those songs, which were used to conduct our analysis. SAS^® Enterprise Miner and SAS^® Sentiment Analysis Studio were the primary tools of our analysis.

View the e-poster or slides (PDF)

SAS^® Output Delivery System (ODS) Graphics started appearing in SAS^® 9.2. Collectively these new tools were referred to as 'ODS Graphics,' 'SG Graphics' and 'Statistical Graphics'. When first starting to use these tools, the traditional SAS/GRAPH^® software user might come upon some very significant challenges in learning the new way to do things. This is further complicated by the lack of simple demonstrations of capabilities. Most graphs in training materials and publications are rather complicated graphs that, while useful, are not good teaching examples for starting purposes. This paper contains many examples of very simple ways to get very simple things accomplished. Many different graphs are developed using only a few lines of code each, using data from the SASHELP data sets. The use of the SGPLOT, SGPANEL, and SGSCATTER procedures are shown. In addition, the paper addresses those situations in which the user must alternatively use a combination of the TEMPLATE and SGRENDER procedures to accomplish the task at hand. Most importantly, the use of the 'ODS Graphics Designer' as a teaching tool and a generator of sample graphs and code are covered. This tool makes use of the TEMPLATE and SGRENDER Procedures, generating Graphics Template Language (GTL) code. Users get extremely productive fast. The emphasis in this paper is the simplicity of the learning process. Users will be able to take the generated code and run it immediately on their personal machines.

Read the paper (PDF) | View the e-poster or slides (PDF)

In the pharmaceutical industry, we find ourselves having to re-run our programs repeatedly for each deliverable. These programs can be run individually in an interactive SAS^® session, which enables us to review the logs as we execute the programs. We could run the individual programs in batch and open each individual log to review for unwanted log messages, such as ERROR, WARNING, uninitialized, have been converted to, and so on. Both of these approaches are fine if there are only a handful of programs to execute. But what do you do if you have hundreds of programs that need to be re-run? Do you want to open every single one of the programs and search for unwanted messages? This manual approach could take hours and is prone to accidental oversight. This paper discusses a macro that searches a specified directory and checks either all the logs in the directory, only logs with a specific naming convention, or only the files listed. The macro then produces a report that lists all the files checked and indicates whether issues were found.

Read the paper (PDF)

Designing a survey is a meticulous process involving a number of steps and many complex choices. For most survey researchers, the choice of a probability or non-probability sample is somewhat simple. However, throughout the sample design process, there are more complex choices each of which can introduce bias into the survey estimates. For example, the sampling statistician must decide whether to stratify the frame. And, if so, he has to decide how many strata, whether to explicitly stratify, and how should a stratum be defined. He also has to decide whether to use clusters. And, if so, how to define a cluster and what should be the ideal cluster size. The factors affecting these choices, along with the impact of different sample designs on survey estimates, are explored in this paper. The SURVEYSELECT procedure in SAS/STAT^® 14.1 is used to select a number of samples based on different designs using data from Jamaica's 2011 Population and Housing Census. Census results are assumed to be equal to the true population parameter. The estimates from each selected sample are evaluated against this parameter to assess the impact of different sample designs on point estimates. Design-adjusted survey estimates are computed using the SURVEYMEANS and SURVEYFREQ procedures in SAS/STAT 14.1. The resultant variances are evaluated to determine the sample design that yields the most precise estimates.

Read the paper (PDF)

SAS^® is often deployed in a client/server architecture in which SAS^® Foundation is installed on a server and is accessed from each user's workstation. Many system administrators prefer that users not log on directly to the server to run SAS, nor do they want to set up a complex Citrix environment. SAS client applications are an attractive alternative for this type of architecture. But with the advent of multiple SAS^® Studio editions and ongoing enhancements to SAS^® Enterprise Guide^®, choosing the most suitable client application presents a challenge for many system administrators. To help guide you in this choice, this paper compares the administration of three SAS Foundation client applications that can be used in a client/server architecture: SAS Enterprise Guide, SAS^® Studio Basic, and SAS^® Studio Mid-Tier. The usage differences between SAS Studio and SAS Enterprise Guide have been addressed elsewhere. In this paper, we focus on differences that pertain specifically to system administration, including deployment, maintenance, and authentication. The information presented here will help system administrators determine which application best fits the needs of their users and their environment.

Read the paper (PDF)

It takes months to find a customer and only seconds to lose one Unknown. Though the Business-to-Business (B2B) churn problem might not be as common as Business-to-Consumer (B2C) churn, it has become crucial for companies to address this effectively as well. Using statistical methods to predict churn is the first step in the process of retaining customers, which also includes model evaluation, prescriptive analytics (including outreach optimization), and performance reporting. Providing visibility into model and treatment performance enables the Data and Ops teams to tune models and adjust treatment strategy. West Corporation's Center for Data Science (CDS) has partnered with one of the lines of businesses in order to measure and prevent B2B customer churn. CDS has coupled firmographic and demographic data with internal CRM and past outreach data to build a Propensity to Churn model using SAS^®. CDS has provided the churn model output to an internal Client Success Team (CST), who focuses on high-risk/high-value customers in order to understand and provide resolution to any potential concerns that might be expressed by such customers. Furthermore, CDS automated weekly performance reporting using SAS and Microsoft Excel that not only focuses on model statistics, but also on CST actions and impact. This paper focuses on all of the steps involved in the churn-prevention process, including building and reviewing the model, treatment design and implementation, as well as performance reporting.

Today it is vital for an organization to manage, distribute, and secure content for its employees. In most cases, different groups of employees are interested in different content, and some content should not be available to everyone. It is the SAS^® administrator's job to design a metadata group structure that makes managing content easier. SAS enables you to create any metadata group organizational structure imaginable, and it is common to define a metadata group structure that mimics the organization's hierarchy. Circular group memberships are frequently the cause of unexpected issues with SAS web applications. A circular group relationship can be as simple as two groups being members of one another. You might not be aware that you have defined this type of recursive association between groups. The paper identifies some problems that are caused by recursive group memberships and provides tools to investigate your metadata group structure that help identify recursive metadata group relationships. We explain the process of extracting group associations from the SAS^® Metadata Server, and we show how to organize this data to investigate group relationships. We use a stored process to generate a report and SAS^® Visual Analytics to generate a network diagram that provides a graphical representation of an organization's group relationship structure, to easily identify circular group structures.

Read the paper (PDF)

In this paper, we introduce a SAS/IML^® program of Classification Accuracy and Classification Consistency (CA/CC) that provides useful resources to test analysts or psychometricians. Our program optimizes functions of SAS^® by offering the CA/CC statistics not only with dichotomous items, but also with polytomous items. Classification Decision (CD) is a method to categorize examinees into achievement groups based on cut scores (Quinn and Cheng, 2013). CD has been predominantly used in educational and vocational situations such as admissions, selection, placement, or certification. This method needs to be accurate because its use has been important to examinees' professional and academic futures. Classification Accuracy and Classification Consistency (CA/CC) statistics are indices representing the precision of CD, and they need to be reported in order to affirm the validity of the CD. Classification Accuracy is referred to as the degree to which the classification of observed scores matches with the classification of true scores, and Classification Consistency is defined as the degree to which examinees are classified in the same category when taking two parallel test forms (Lee, 2010). Under item response theory (IRT), there are two methods to calculate CA/CC: Rudner (2001) and Lee (2010) approaches. This research deals with these two approaches for CA/CC with the examinee level.

View the e-poster or slides (PDF)

The Institute for Advanced Analytics struggled to provide student computing environments capable of analyzing increasingly larger data sets for its Master of Science in Analytics program. For the fast-paced practicum, the centerpiece of the curriculum, waiting 24 hours for a FREQ procedure to complete was unacceptable. Practicum proposals from industry were pared down (or turned down) because the data sets were too large, depriving students of exciting and relevant learning experiences. By augmenting the practicum architecture with an 18-node computing cluster running SAS^® Grid Manager, SAS^® Visual Analytics, and the latest high-performance SAS^® procedures, we were able to dramatically increase performance and begin accepting terabyte-scale practicum proposals from industry. In this paper, we discuss the benefits and lessons learned through adding these SAS products to our analytics degree program including capability versus complexity tradeoffs, and the state of our current capabilities and limitations with this architecture.

Read the paper (PDF)

A/B testing is a form of statistical hypothesis testing on two business options (A and B) to determine which is more effective in the modern Internet age. The challenge for startups or new product businesses leveraging A/B testing are two-fold: a small number of customers and poor understanding of their responses. This paper shows you how to use the IML and POWER procedures to deal with the reassessment of sample size for adaptive multiple business stage designs based on conditional power arguments, using the data observed at the previous business stage.

Read the paper (PDF)

The LUA procedure is a relatively new SAS^® procedure, having been available since SAS^® 9.4. It allows for the Lua language to be used as an interface to SAS, as an alternative scripting language to the SAS macro facility. This paper compares and contrasts PROC LUA with the SAS macro facility, showing examples of approaches and highlighting the advantages and disadvantages of each.

Read the paper (PDF)

Epidemiologists and other health scientists are often tasked with solving health problems but find collecting original data prohibitive for a multitude of reasons. For this reason, it is common to instead use secondary data such as that from emergency departments (ED) or inpatient hospital stays. In order to use some of these secondary data sets to study problems over time, it is necessary to link them together using common identifiers and still keep all the unique information about each ED visit or hospitalization. This paper discusses a method that was used to combine five years worth of individual ED visits and five years worth of individual hospitalizations to create a single and (much) larger data set for longitudinal analysis.

Read the paper (PDF)

Regarding a human disease network, most studies have estimated the associations of disorders primarily with gene or protein information. Those studies, however, have some difficulties in the data because of the massive volume of data and the huge computational cost. Instead, we constructed a human disease network that can describe the associations between diseases, using the claim data of Korean health insurance. Through several statistical analyses, we show the applicability and suitability of the disease network. Furthermore, we develop a statistical model that can predict a prevalence rate for dementia by using significant associations of the network in a statistical perspective.

Read the paper (PDF)

This presentation discusses the options for including continuous covariates in regression models. In his book, 'Clinical Prediction Models,' Ewout Steyerberg presents a hierarchy of procedures for continuous predictors, starting with dichotomizing the variable and moving to modeling the variable using restricted cubic splines or using a fractional polynomial model. This presentation discusses all of the choices, with a focus on the last two. Restricted cubic splines express the relationship between the continuous covariate and the outcome using a set of cubic polynomials, which are constrained to meet at pre-specified points, called knots. Between the knots, each curve can take on the shape that best describes the data. A fractional polynomial model is another flexible method for modeling a relationship that is possibly nonlinear. In this model, polynomials with noninteger and negative powers are considered, along with the more conventional square and cubic polynomials, and the small subset of powers that best fits the data is selected. The presentation describes and illustrates these methods at an introductory level intended to be useful to anyone who is familiar with regression analyses.

Read the paper (PDF)

Learn how SAS^® works with analytics, big data, and the cloud all in one product: SAS^® Analytics for Containers. This session describes the architecture of containers running in the public, private, or hybrid cloud. The reference architecture also shows how SAS leverages the distributed compute of Hadoop. Topics include how SAS products such as Base SAS^®, SAS/STAT^® software, and SAS/GRAPH^® software can all run in a container in the cloud. This paper discusses how to work with a SAS container running in a variety of Infrastructure as a Service (IaaS) models, including Amazon Web Services and OpenStack cloud. Additional topics include provisioning web-browser-based clients via Jupyter Notebooks and SAS^® Studio to provide data scientists with the tool of their choice. A customer use case is discussed that describes how SAS Analytics for Containers enables an IT department to meet the ad hoc, compute-intensive, and scaling demands of the organization. An exciting differentiator for the data scientist is the ability to send some or all of the analytic workload to run inside their Hadoop cluster by using the SAS accelerators for Hadoop. Doing so enables data scientists to dive inside the data lake and harness the power of all the data.

Read the paper (PDF)

An interactive voice response (IVR) system is a powerful tool that automates routine inbound call tasks. Companies leverage this system and make substantial savings by cutting down call center costs while customers from do-it-yourselfers to non-tech-savvies take advantage of this technology rather than wait in line to speak to a Customer Care Representative (CSR). The flip side of the coin is that customers often see IVR as a barrier to overcome in order to talk to a real person. So it is important that IVR is managed in such a way that it is mutually beneficial for both a business and their customers. If managing IVR is critical, then measuring Customer Satisfaction (CSAT) scores is paramount as it helps in understanding customers better. The first section of this paper discusses analysis of different use cases on how CSAT scores correlate with customers' journeys inside IVR. West Corporation's leading financial services client offers a survey to their customers, and customers rate questions on a scale of 1 to 10 based on their IVR experience (10 being extremely satisfied). Analysis of survey ratings using SAS^® helped Operations understand challenges faced by customers traversing different sections of the IVR. The second section of the paper discusses how the research helped Operations to identify a population specification error that occurred while surveying customers. The error was rectified, and IVR CSAT scores improved by 3%.

Read the paper (PDF)

This end-to-end capability demonstration illustrates how SAS^® Viya can aid intelligence, homeland security, and law enforcement agencies in counter radicalization. There are countless examples of agency failure to apportion significance to isolated pieces of information which, in context, are indicative of an escalating threat and require intervention. Recent terrorist acts have been carried out by radicalized individuals who should have been firmly on the organizational radar. Although SAS^® products enable analysis and interpretation of data that enables the law enforcement and homeland security community to recognize and triage threats, intelligence information must be viewed in full context. SAS Viya can rationalize previously disconnected capabilities in a single platform, empowering intelligence, security, and law enforcement agencies. SAS^® Visual Investigator provides a hub for SAS^® Event Stream Processing, SAS^® Visual Scenario Designer, and SAS^® Visual Analytics, combining network analysis, triage, and, by leveraging the mobile capability of SAS, operational case management to drive insights, leads, and investigation. This hub provides the capability to ingest relevant external data sources, and to cross reference both internally held data and, crucially, operational intelligence gained from normal policing activities. This presentation chronicles the exposure and substantiation of a radical network and informs tactical and strategic disruption.

Read the paper (PDF)

This paper shows how to use Base SAS^® to create unique datetime stamps that can be used for naming external files. These filenames provide automatic versioning for systems and are intuitive and completely sortable. In addition, they provide enhanced flexibility compared to generation data sets, which can be created by SAS^® or by the operating system.

Read the paper (PDF)

Clinical research study enrollment data consists of subject identifiers and enrollment dates that are used by investigators to monitor enrollment progress. Meeting study enrollment targets is critical to ensuring there will be enough data and end points to enable the statistical power of the study. For clinical trials that do not experience heavy, nearly daily enrollment, there will be a number of dates on which no subjects were enrolled. Therefore, plots of cumulative enrollment represented by a smoothed line can give a false impression, or imprecise reading, of study enrollment. A more accurate display would be a step function plot that would include dates where no subjects were enrolled. Rolling average plots often start with summing the data by month and creating a rolling average from the monthly sums. This session shows how to use the EXPAND procedure, along with the SQL and GPLOT procedures and the INTNX function, to create plots that display cumulative enrollment and rolling 6-month averages for each day. This includes filling in the dates with no subject enrollment and creating a rolling 6-month average for each date. This allows analysis of day-to-day variation as well as the short- and long-term impacts of changes, such as adding an enrollment center or initiatives to increase enrollment. This technique can be applied to any data that has gaps in dates. Examples include service history data and installation rates for a newly launched product.

Read the paper (PDF)

Computer and video games are complex these days. Events in video games are in some cases recorded automatically in text files, creating a history or story of game play. There are countable items in these event records that can be used as data for statistics and other types of modeling. This E-Poster shows you how to statistically analyze text files for video game events using SAS^®. Two games are analyzed. EVE Online, a massive multi-user online role-playing spaceship game, is one. The other game is Ingress, a cell phone game that combines exercise with a GPS and real-world environments. In both examples, the techniques involve parsing large amounts of text data to examine recurring patterns in text that describe events in the game play.

View the e-poster or slides (PDF)

This presentation describes an ongoing effort to standardize and simplify SAS^® coding across a rapidly growing analytics team in the health care industry. The number of SAS analysts in Kaiser Permanente's Data and Information Management Enhancement (DIME) department has nearly doubled in the past two years, going from approximately 20 to 40 analysts. The level of experience and technical skill varies greatly within the department. Analysts are required to provide quick turn-around on a large volume of analytical requests in this dynamic and high-demand environment. An effort was initiated in 2016 to create a SAS^® Enterprise Guide^® Template to standardize and simplify SAS coding across the department. The SAS Enterprise Guide^® template is designed to be a standard project file containing predefined code shells and examples that can be used as a basis for all new SAS Enterprise Guide^® projects. The primary goals of the template are to: 1) Effectively onboard new analysts to department standards; 2) Increase the efficiency of SAS development; 3) Bring consistency to how SAS is used; and 4) Simplify the transitioning of SAS jobs to the department's Production Support team. This presentation focuses on the process in which the template was initiated, drafted, and socialized across a large and diverse team of SAS analysts. It also highlights plans for ongoing maintenance of and improvements to the original template.

Read the paper (PDF)

We've learned a great deal about how to develop great reports and about business intelligence (BI) tools and how to use them to create reports, but have we figured out how to create true BI reports? Not every report that comes out of a BI tool provides business intelligence! In pursuit of the perfect BI report, this paper explores how we can combine the best of lessons learned about developing and running traditional reports and about applying business analytics in order to create true BI reports that deliver integrated analytics and intelligence.

Read the paper (PDF)

The DATA step is the familiar and powerful data processing language in SAS^® and now SAS Viya . The DATA step's simple syntax provides row-at-a-time operations to edit, restructure, and combine data. New to the DATA step in SAS Viya are a varying-size character data type and parallel execution. Varying-size character data enables intuitive string operations that go beyond the 32KB limit of current DATA step operations. Parallel execution speeds the processing of big data by starting the DATA step on multiple machines and dividing data processing among threads on these machines. To avoid multi-threaded programming errors, the run-time environment for the DATA step is presented along with potential programming pitfalls. Come see how the DATA step in SAS Viya makes your data processing simpler and faster.

Read the paper (PDF)

The DATA Step has served SAS^® programmers well over the years. Although the DATA step is handy, the new, exciting, and powerful DS2 provides a significant alternative to the DATA step by introducing an object-oriented programming environment. It enables users to effectively manipulate complex data and efficiently manage the programming through additional data types, programming structure elements, user-defined methods, and shareable packages, as well as providing threaded execution. This tutorial was developed based on our experiences with getting started with DS2 and learning to use it to access, manage, and share data in a scalable and standards-based way. It facilitates SAS users of all levels to easily get started with DS2 and understand its basic functionality by practicing how to use the features of DS2.

Read the paper (PDF) | Download the data file (ZIP)

For all business analytics projects big or small, the results are used to support business or managerial decision-making processes, and many of them eventually lead to business actions. However, executives or decision makers are often confused and feel uninformed about contents when presented with complicated analytics steps, especially when multi-processes or environments are involved. After many years of research and experiment, a web reporting framework based on SAS^® Stored Processes was developed to smooth the communication between data analysts, researches, and business decision makers. This web reporting framework uses a storytelling style to present essential analytical steps to audiences, with dynamic HTML5 content and drill-down and drill-through functions in text, graph, table, and dashboard formats. No special skills other than SAS^® programming are needed for implementing a new report. The model-view-controller (MVC) structure in this framework significantly reduced the time needed for developing high-end web reports for audiences not familiar with SAS. Additionally, the report contents can be used to feed to tablet or smartphone users. A business analytical example is demonstrated during this session. By using this web reporting framework based on SAS Stored Processes, many existing SAS results can be delivered more effectively and persuasively on a SAS^® Enterprise BI platform.

Read the paper (PDF)

Do your reports effectively communicate the message you intended? Are your reports aesthetically pleasing? An attractive report does not ensure the accurate delivery of a data story, nor does a logical data story guarantee visual appeal. This paper provides guidance for SAS^® Visual Analytics Designer users to facilitate the creation of compelling data stories. The primary goal of a report is to enable readers to quickly and easily get answers to their questions. Achieving this goal is strongly influenced by the choice of visualizations for the data, the quantity and arrangement of the information that is included, and the use or misuse of color. This paper describes how to guide readers' movement through a report to support comprehension of the data story; provides tips on how to express quantitative data using the most appropriate graphs; suggests ways to organize content through the use of visual and interactive design techniques; and instructs report designers about the meaning of colors, presenting the notion that even subtle changes in color can evoke feelings that are different from those intended. A thoughtfully designed report can educate the viewer without compromising visual appeal. Included in this paper are recommendations and examples which, when applied to your own work, will help you create reports that are both informative and beautiful.

Read the paper (PDF)

Users want more power. SAS^® delivers. Data grids are a new data type available to users of SAS^® Business Rules Manager and SAS^® Decision Manager. These data grids can be deployed to both batch and web service scoring for data mining models and business decisions. Users will learn how to construct data with grid data types, create business rules using high-level expressions, and deploy decisions to both batch and web services for scoring.

Read the paper (PDF)

SAS^® Visual Analytics is a very powerful tool for users to visually explore data, but in some organizations not all data should be available for everybody. And although it is relatively easy to scale up a SAS Visual Analytics environment when the need for data increases, it still would be beneficial to set up a structure where the organization can keep control over who actually has the right to load data versus providing everybody the right to load data into a SAS Visual Analytics environment. Within this breakout session a potential solution is shown by providing a high-level overview of the SAS Visual Analytics data access management solution at ING bank in the Netherlands for the Risk Services Organization.

Read the paper (PDF)

As an information security or data professional, you have seen and heard about how advanced analytics has impacted nearly every business domain. You recognize the potential of insights derived from advanced analytics to improve the information security of your organization. You want to realize these benefits, and to understand their pitfalls. To successfully apply advanced analytics to the information security business problem, proper application of data management processes and techniques is of paramount importance. Based on professional services experience in implementing SAS^® Cybersecurity, this session teaches you about the data sources used, the activities involved in properly managing this data, and the means to which these processes address information security business problems. You will come to appreciate how using advanced analytics in the information security domain requires more than just the application of tools or modeling techniques. Using a data management regime for information security concerns can benefit your organization by providing insights into IT infrastructure, enabling successful data science activities, and providing greater resilience by way of improved information security investigations.

Read the paper (PDF)

The discipline of data science has seen an unprecedented evolution from primordial darkness to becoming the academic equivalent of an apex predator on university campuses across the country. But, survival of the discipline is not guaranteed. This session explores the genetic makeup of programs that are likely to survive, the genetic makeup of those that are likely to become extinct, and the role that the business community plays in that evolutionary process.

Read the paper (PDF)

Data validation plays a key role as an organization engages in a data governance initiative. Better data leads to better decisions. This applies to public schools as well as business entities. Each Local Educational Agency (LEA) in Pennsylvania reports children with disabilities to the Pennsylvania Department of Education (PDE) in compliance with IDEA (Individuals with Disabilities Education Act). PDE provides a Comparison Report to each LEA to assist in their data validation process. This Comparison Report provides counts of various categories for the most recent and previous year. LEAs use the Comparison Report to validate data submitted to PDE. This paper discusses how the Base SAS^® SORT procedure and MERGE statement extract hidden information behind the counts to assist LEAs in their data validation process.

Read the paper (PDF)

Microsoft SharePoint is a popular web application framework and platform that is widely used for content and document management by companies and organizations. Connecting SAS^® with SharePoint combines the power of these two into one. As a continuation of my SAS^® Global Forum Paper 11520-2016 titled Releasing the Power of SAS^® into Microsoft SharePoint, this paper expands on how to implement data visualization from SAS to SharePoint. This paper shows users how to use SAS/GRAPH^® software procedures, Output Delivery System (ODS), and emails to create and send visualization output files from SAS to SharePoint Document Library. Several SAS code examples are included to show how to create tables, bar charts (with PROC GCHART), line plots (with PROC SGPLOT) and maps (with PROC GMAP) from SAS to SharePoint. The paper also demonstrates how to create data visualization based on JavaScript by feeding SAS data into HTML pages on SharePoint. A couple of examples on how to export SAS data to JSON formats and create data visualization in SharePoint based on JavaScript are provided.

Read the paper (PDF)

We modeled an eight-level ordinal life insurance risk response on a pre-cleansed and pre-normalized Prudential data set. The data set consists of 59,381 observations and 128 predictors, of which 13 were continuous, 5 discrete, and the remainder categorical. The overall objective of the project was to develop a scoring formula to simplify the life insurance application process in order to encourage more customers to apply for, and therefore purchase, life insurance. Comparison of average square errors (ASEs), misclassification rates, lift, and relative parsimony led us to choose a 13-predictor logistic regression model from a pool of nine candidates. Although the model, in which Body Mass Index (BMI) figures prominently, is globally better than chance at classifying applicants, its misclassification error rates for response levels lower than the highest level (representing lowest insurance risk) are higher than 50 percent. The high error rates call for additional data, subject-matter expertise, and further work to refine the model.

Read the paper (PDF)

The SAS^® 9.4 SGPLOT procedure is a great tool for creating all types of graphs, from business graphs to complex clinical graphs. The goal for such graphs is to convey the data in a simple and direct manner with minimal distractions. But often, you need to grab the attention of a reader in the midst of a sea of data and graphs. For such cases, you need a visual that can stand out above the rest of the noise. Such visuals insert a decorative flavor into the graph to attract the eye of the reader and to encourage them to spend more time studying the visual. This presentation discusses how you can create such attention-grabbing visuals using the SGPLOT procedure.

Read the paper (PDF)

This paper presents considerations for deploying SAS^® Foundation across software-defined storage (SDS) infrastructures, and within virtualized storage environments. There are many new offerings on the market that offer easy, point-and-click creation of storage entities, with simplified management. Internal storage area network (SAN) virtualization also removes much of the hands-on management for defining storage device pools. Automated tier software further attempts to optimize data placement across performance tiers without manual intervention. Virtual storage provisioning and automated tier placement have many time-saving and management benefits. In some cases, they have also caused serious unintended performance issues with heavy large-block workloads, such as those found in SAS Foundation. You must follow best practices to get the benefit of these new technologies while still maintaining performance. For SDS infrastructures, this paper offers specific considerations for the performance of applications in SAS Foundation, workload management and segregation, replication, high availability, and disaster recovery. Architecture and performance ramifications and advice are offered for virtualized and tiered storage systems. General virtual storage pros and cons are also discussed in detail.

Read the paper (PDF)

The Analysis Data Model (ADaM) Basic Data Structure (BDS) can be used for many analysis needs. We all know that the SAS^® DATA step is a very flexible and powerful tool for data processing. In fact, the DATA step is very useful in the creation of a non-trivial BDS data set. This paper walks through a series of examples showing use of the SAS DATA step when deriving rows in BDS. These examples include creating new parameters, new time points, and changes from multiple baselines.

Read the paper (PDF)

As a report designer using SAS^® Visual Analytics, your goal is to create effective data visualizations that quickly communicate key information to report readers. But what makes a dashboard or report effective? How do you ensure that key points are understood quickly? One of the most common questions asked about SAS Visual Analytics is: what are the best practices for designing a report? Experts like Stephen Few and Edward Tufte have written extensively about successful visual design and data visualization. This paper focuses mainly on a different aspect of visual reports-the speed with which online reports render. In today's world, instant results are almost always expected. And the faster your report renders, the sooner decisions can be made and actions taken. Based on proven best practices and existing customer implementations, this paper focuses on server-side performance, client-side performance, and design performance. The end result is a set of design techniques that you can put into practice immediately and optimize your report performance.

Read the paper (PDF)

Detection and adjustment of structural breaks are an important step in modeling time series and panel data. In some cases, such as studying the impact of a new policy or an advertising campaign, structural break analysis might even be the main goal of a data analysis project. In other cases, the adjustment of structural breaks is a necessary step to achieve other analysis objectives, such as obtaining accurate forecasts and effective seasonal adjustment. Structural breaks can occur in a variety of ways during the course of a time series. For example, a series can have an abrupt change in its trend, its seasonal pattern, or its response to a regressor. The SSM procedure in SAS/ETS^® software provides a comprehensive set of tools for modeling different types of sequential data, including univariate and multivariate time series data and panel data. These tools include options for easy detection and adjustment of a wide variety of structural breaks. This paper shows how you can use the SSM procedure to detect and adjust structural breaks in many different modeling scenarios. Several real-world data sets are used in the examples. The paper also includes a brief review of the structural break detection facilities of other SAS/ETS procedures, such as the ARIMA, AUTOREG, and UCM procedures.

Read the paper (PDF)

This paper describes specific actions to be taken to increase the usability, data consistency, and performance of an advanced SAS^® Customer Intelligence solution for marketing and analytic purposes. In addition, the paper focuses on the establishment of a data governance program to support the processes that take place within this environment. This paper presents our experiences developing a data governance light program for the enterprise data warehouse and its sources as well as for the data marts created downstream to address analytic and campaign management purposes. The challenge was to design a data governance program for this system in 90 days.

Read the paper (PDF)

The ever growing volume of data challenges us to keep pace in ensuring that we use it to its full advantage. Unfortunately, often our response to new data sources, data types, and applications is somewhat reactionary. There exists a misperception that organizations have precious little time to consider a purposeful strategy without disrupting business continuity. Strategy is a phrase that is often misused and ill-defined. However, it is nothing more than a set of integrated choices that help position an initiative for future success. This presentation covers the key elements defining data strategy. The following key topics are included: What data should we keep or toss? How should we structure data (warehouse versus data lake versus real-time event streaming)? How do we store data (cloud, virtualization, federation, cloud, Hadoop)? What is the approach we use to integrate and cleanse data (ETL versus cognitive/ automated profiling)? How do we protect and share data? These topics ensure that the organization gets the most value from our data. They explore how we prioritize and adapt our strategy to meet unanticipated needs in the future. As with any strategy, we need to make sure that we have a roadmap or plan for execution, so we talk specifically about the tools, technologies, methods, and processes that are useful as we design a data strategy that is both relevant and actionable to your organization.

Read the paper (PDF)

Standard SAS^® Studio tasks already include many advanced analytic procedures for data mining and other high-performance models, enabling point-and-click generation and execution of SAS^® code. However, you can extend the power of tasks by creating tasks of your own to enable point-and-click access to the latest SAS statistical procedures, to your own default model definitions, or to your previously developed SAS/STAT^® or SAS macro code. Best of all, these point-and-click tasks can be developed directly in SAS Studio without the need to compile binaries or build DLL files using third-party software. In this paper, we demonstrate three approaches to developing custom tasks. First, we build a custom task to provide point-and-click access to PROC IRT, including recently added functionality to PROC IRT used to analyze educational test and opinion survey data. Second, we build a custom task that calls a macro for previously developed SAS code, and we show how point-and-click options can be set up to allow users to guide the execution of complex macro code. Third, we demonstrate just enough of the underlying Apache Velocity Template Language code to enable developers to take advantage of the benefits of that language to support their SAS process. Finally, we show how these tasks can easily be shared with a user community, increasing the efficiency of analytic modeling across the organization.

Read the paper (PDF)

For all healthcare systems, considerable attention and resources are directed at gauging and improving patient satisfaction. Dignity Health has made considerable efforts in improving most areas of patient satisfaction. However, improving metrics around physician interaction with patients has been challenging. Failure to improve these publicly reported scores can result in reimbursement penalties, damage to Dignity's brand and an increased risk of patient harm. One possible way to improve these scores is to better identify the physicians that present the best opportunity for positive change. Currently, the survey tool mandated by the Centers for Medicare and Medicaid Services (CMS), the Hospital Consumer Assessment of Healthcare Providers and Systems (HCAHPS), has three questions centered on patient experience with providers, specifically concerning listening, respect, and clarity of conversation. For purposes of relating patient satisfaction scores to physicians, Dignity Health has assigned scores based on the attending physician at discharge. By conducting a manual record review, it was determined that this method rarely corresponds to the manual review (PPV = 20.7%, 95% CI: 9.9% -38.4%). Using a variety of SAS^® tools and predictive modeling programs, we developed a logistic regression model that had better agreement with chart abstractors (PPV = 75.9%, 95% CI: 57.9% - 87.8%). By attributing providers based on this predictive model, opportunities for improvement can be more accurately targeted, resulting in improved patient satisfaction and outcomes while protecting fiscal health.

Read the paper (PDF)

Applying solutions for recommending products to final customers in e-commerce is already a known practice. Crossing consumer profile information with their behavior tends to generate results that are more than satisfactory for the business. Natura's challenge was to create the same type of solution for their sales representatives in the platform used for ordering. The sales representatives are not buying for their own consumption, but rather are ordering according to the demands of their customers. That is the difference, because in this case the analysts does not have information about the behavior or preferences of the final client. By creating a basket product concept for their sales representatives, Natura developed a new solution. Natura developed an algorithm using association analysis (Market Basket) and implemented this directly in the sales platform using SAS^® Real-Time Decision Manager. Measuring the results in indications conversion (products added in the requests), the amount brought in by the new solution was 53% higher than indications that used random suggestions, and 38% higher than those that used business rules.

Read the paper (PDF)

Hash objects have been supported in the DATA step and in the FCMP procedure for a while, but have you ever felt that hash objects could do a little more? For example, what if you needed to store more than doubles and character strings? Introducing PROC FCMP dictionaries. Dictionaries allow you to create references not only to numeric and character data, but they also give you fast in-memory hashing to arrays, other dictionaries, and even PROC FCMP hash objects. This paper gets you started using PROC FCMP dictionaries, describes usage syntax, and explores new programming patterns that are now available to your PROC FCMP programs, functions, and subroutines in the new SAS^® Viya platform environment.

Read the paper (PDF)

Until recently, psychometric analyses of test data within the Item Response Theory (IRT) framework were conducted using specialized, commercial software. However, with the inclusion of the IRT procedure in the suite of SAS^® statistical tools, SAS users can explore the psychometric properties of test items using modern test theory or IRT. Considering the item as the unit of analysis, the relationship between test items and the constructs they measure can be modeled as a function of an unobservable or latent variable. This latent variable or trait (for example, ability or proficiency), vary in the population. However, when examinees having the same trait level do not have the same probability to answering correctly or endorsing an item, we said that such an item might be functioning differently or exhibiting differential item functioning or DIF (Thissen, Steinberg, and Wainer, 2012). This study introduces the implementation of PROC IRT for conducting a DIF analysis for graded responses, using Samejima's graded response model (GRM; Samejima, 1969, 2010). The effectiveness of PROC IRT for evaluation of DIF items is assessed in terms of the Type I error and statistical power of the likelihood ratio test for testing DIF in graded responses.

In highly competitive markets, the response rates to economically reasonable marketing campaigns are as low as a few percentage points or less. In that case, the direct measure of the delta between the average key performance indicators (KPIs) of the treated and control groups is heavily 'contaminated' by non-responders. This paper focuses on measuring promotional marketing campaigns with two properties: (1) price discounts or other benefits, which are changing profitability of the targeted group for at least the promotion periods, and (2) impact of self-responders. The paper addresses the decomposition of the KPI measurement between responders and non-responders for both groups. Assuming that customers who rejected promotional offers will not change their behavior and that non-responders of both treated and control groups are not biased, the delta of the average KPIs for non-responders should be equal to zero. In practice, this component might be significantly deviated from zero. It might be caused by an initial nonzero delta of KPI values despite a random split between groups or by existence of outliers, especially for non-balanced campaigns. In order to address the deviation of the delta from zero, it might require running additional statistical tests comparing not just the means but the distributions of KPIs as well. The decomposition of the measurement between responders and non-responders for both groups can then be used in differential modeling.

Read the paper (PDF)

SAS^® has a very efficient and powerful way to get distances between an event and a customer. Using the tables and code located at http://support.sas.com/rnd/datavisualization/mapsonline/html/geocode.html#street, you can load latitude and longitude to addresses that you have for your events and customers. Once you have the tables downloaded from SAS, and you have run the code to get them into SAS data sets, this paper helps guide you through the rest using PROC GEOCODE and the GEODIST function. This can help you determine to whom to market an event. And, you can see how far a client is from one of your facilities.

Read the paper (PDF) | View the e-poster or slides (PDF)

Creating an effective style for your graphics can make the difference between clearly conveying your message to your audience and hiding your message in a sea of lines, markers, and text. A number of books explain the concepts of effective graphics, but you need an understanding of how styles work in your environment to correctly apply those principles. The goal of this paper is to give you an in-depth discussion of how styles are applied to Output Delivery System (ODS) graphics, from the ODS style level all the way down to the graph syntax. This discussion includes information about differences in grouped versus non-grouped plots, precedence order of style application, using style references, and much more. Don't forget your scuba gear!

Read the paper (PDF)

Are you prepared if a disaster happens? If your company relies on SAS^® applications to stay in business, you should have a Disaster Recovery Plan (DRP) in place. By a DRP, we mean documentation of the process to recover and protect your SAS infrastructure (SAS binaries, the operating system that is tuned to run your SAS applications, and all the pertinent data that the SAS applications require) in the event of a disaster. This paper discusses what needs to be in this plan to ensure that your SAS infrastructure not only works after it is recovered, but is able to be maintained on the recovery hardware infrastructure.

Read the paper (PDF)

Discover how to document your SAS^® programs, data sets, and catalogs with a few lines of code that include SAS functions, macro code, and SAS metadata. Do you start every project with the best of intentions to document all of your work, and then fall short of that aspiration when deadlines loom? Learn how SAS system macro variables can provide valuable information embedded in your programs, logs, lists, catalogs, data sets and ODS output; how your programs can automatically update a processing log; and how to generate two different types of codebooks.

Read the paper (PDF) | View the e-poster or slides (PDF)

This paper illustrates proper applications of multidimensional item response theory (MIRT), which is available in SAS^® PROC IRT. MIRT combines item response theory (IRT) modeling and factor analysis when the instrument carries two or more latent traits. Although it might seem convenient to accomplish two tasks simultaneously by using one procedure, users should be cautious of misinterpretations. This illustration uses the 2012 Program for International Student Assessment (PISA) data set collected by Organisation for Economic Co-operation and Development (OECD). Because there are two known sub-domains in the PISA test (reading and math), PROC IRT was programmed to adopt a two-factor solution. In additional, the loading plot, dual plot, item difficulty/discrimination plot, and test information function plot in JMP^® were used to examine the psychometric properties of the PISA test. When reading and math items were analyzed in SAS MIRT, seven to 10 latent factors are suggested. At first glance, these results are puzzling because ideally all items should be loaded into two factors. However, when the psychometric attributes yielded from a two-parameter IRT analysis are examined, it is evident that both the reading and math test items are well written. It is concluded that even if factor indeterminacy is present, it is advisable to evaluate its psychometric soundness based on IRT because content validity can supersede construct validity.

Read the paper (PDF) | View the e-poster or slides (PDF)

Learn neat new (and not so new) methods for joining text, graphs, and tables in a single document. This paper shows how you can create a report that includes all three with a single solution: SAS^®. The text portion is taken from a Microsoft Word document and joined with output from the GPLOT and REPORT procedures.

Read the paper (PDF)

Whether you have been programming in SAS^® for years, or you are new to it, or you have dabbled with SAS^® Enterprise Guide^® before, this hands-on workshop sheds some light on the depth, breadth, and power of the SAS Enterprise Guide environment. With all the demands on your time, you need powerful tools that are easy to learn and that deliver end-to-end support for your data exploration, reporting, and analytics needs. Included in this workshop are data exploration tools; formatting code (cleaning up after your coworkers); enhanced programming environment (and how to calm it down); easily creating reports and graphics; producing the output formats you need (XLS, PDF, RTF, HTML); workspace layout; and productivity tips. This workshop uses SAS Enterprise Guide 7.1, but most of the content is applicable to earlier versions.

Read the paper (PDF) | Download the data file (ZIP)

Some data is best visualized in a polar orientation, particularly when the data is directional or cyclical. Although the SG procedures and Graph Template Language (GTL) do not directly support polar coordinates, they are quite capable of drawing such graphs with a little bit of data processing. We demonstrate how to convert your data from polar coordinates to Cartesian coordinates and use the power of SG procedures to create graphs that retain the polar nature of your data. Stop going around in circles: let us show you the way out with SG procedures!

Read the paper (PDF)

Join this breakout session hosted by the Customer Value Management team from Saudi Telecommunications Company to understand the journey we took with SAS^® to evolve from simple below-the-line campaign communication to advanced customer value management (CVM). Learn how the team leveraged SAS tools ranging from SAS^® Enterprise Miner to SAS^® Customer Intelligence Suite in order to gain a deeper understanding of customers and move toward targeting customers with the right offer at the right time through the right channel.

Read the paper (PDF)

Customer feedback is a critical aspect of businesses in today's world as it is invaluable in determining what customers like and dislike about the business' service. This loop of regularly listening to customers' voice through survey comments and improving services based on it leads to better business, and more importantly, to an enhancement in customer experience. The challenge is to classify and analyze these unstructured text comments to gain insights and to focus on areas of improvement. The purpose of this paper is to illustrate how text mining in SAS^® Enterprise Miner 14.1 helped one of our clients a leading financial services company convert their customers problems into opportunities. The customers' feedback pertaining to their experience with an Interactive Voice Response (IVR) system is collected by an enterprise feedback management (EFM) company. The comments are then split into two groups, which helps us differentiate customer opinions. This grouping is based on customers who have given a rating of 0 6 and a rating of 9 10 on a Likert scale of 0 10 (10 being extremely satisfied) in the survey questionnaire. Text mining is performed on both these groups, and an algorithm creates clusters that are consequentially used to segment customers based on opinions they are interested in voicing. Furthermore, sentiment scores are calculated for each one of the segments. The scores classify the polarity of customer feedback and prioritizes the problems the client needs to focus on.

Read the paper (PDF)

Sometimes it might be beneficial to share a BI environment with multiple tenants within an enterprise, but at the same time this might also introduce additional complexity with regard to the administration of data access. In this breakout session, one possible setup is shown by sharing a high-level overview of such an environment within the ING bank in the Netherlands for the Risk Services organization.

Read the paper (PDF)

The Base SAS^® 9.4 Output Delivery System (ODS) EPUB destination enables users to deliver SAS^® reports as e-books on Apple mobile devices. ODS EPUB e-books are truly mobile you don't need an Internet connection to read them. Just install Apple's free iBooks app, and you're good to go. This paper shows you how to create an e-book with ODS EPUB and sideload it onto your Apple device. You will learn new SAS^® 9.4 techniques for including text, images, audio, and video in your ODS EPUB e-books. You will understand how to customize your e-book's table of contents (TOC) so that readers can easily navigate the e-book. And you will learn how to modify the ODS EPUB style to create specialized presentation effects. This paper provides beginning to intermediate instruction for writing e-books with ODS EPUB. Please bring your iPad, iPhone, or iPod to the presentation so that you can download and read the examples.

Read the paper (PDF)

Creating an environment that enables and empowers self-service and agile analytic capabilities requires a tremendous amount of working together and extensive agreements between IT and the business. Business and IT users are struggling to know what version of the data is valid, where they should get the data from, and how to combine and aggregate all the data sources to apply analytics and deliver results in a timely manner. All the while, IT is struggling to supply the business with more and more data that is becoming available through many different data sources such as the Internet, sensors, the Internet of Things, and others. In addition, once they start trying to join and aggregate all the different types of data, the manual coding can be very complicated and tedious, can demand extraneous resources and processing, and can negatively impact the overhead on the system. If IT enables agile analytics in a data lab, it can alleviate many of these issues, increase productivity, and deliver an effective self-service environment for all users. This self-service environment using SAS^® analytics in Teradata has decreased the time required to prepare the data and develop the statistical data model, and delivered faster results in minutes compared to days or even weeks. This session discusses how you can enable agile analytics in a data lab, leverage SAS analytics in Teradata to increase performance, and learn how hundreds of organizations have adopted this concept to deliver self-service capabilities in a streamlined process.

Randomized control trials have long been considered the gold standard for establishing causal treatment effects. Can causal effects be reasonably estimated from observational data too? In observational studies, you observe treatment T and outcome Y without controlling confounding variables that might explain the observed associations between T and Y. Estimating the causal effect of treatment T therefore requires adjustments that remove the effects of the confounding variables. The new CAUSALTRT (causal-treat) procedure in SAS/STAT^® 14.2 enables you to estimate the causal effect of a treatment decision by modeling either the treatment assignment T or the outcome Y, or both. Specifically, modeling the treatment leads to the inverse probability weighting methods, and modeling the outcome leads to the regression methods. Combined modeling of the treatment and outcome leads to doubly robust methods that can provide unbiased estimates for the treatment effect even if one of the models is misspecified. This paper reviews the statistical methods that are implemented in the CAUSALTRT procedure and includes examples of how you can use this procedure to estimate causal effects from observational data. This paper also illustrates some other important features of the CAUSALTRT procedure, including bootstrap resampling, covariate balance diagnostics, and statistical graphics.

Read the paper (PDF)

Pooling two or more cross-sectional survey data sets (such as stacking the data sets on top of one another) is a strategy often used by researchers for one of two purposes: (1) to more efficiently conduct significance tests on point estimate changes observed over time or (2) to increase the sample size in hopes of improving the precision of a point estimate. The latter purpose is especially common when making inferences on a subgroup, or domain, of the target population insufficiently represented by a single survey data set. Using data from the National Survey of Family Growth (NSFG), the aim of this paper is to walk through a series of practical estimation objectives that can be tackled by analyzing data from two or more pooled survey data sets. Where applicable, we comment on the resulting interpretive nuances.

Read the paper (PDF)

Student growth percentile (SGP) is one of the most widely used score metrics for measuring a student's academic growth. Using longitudinal data, SGP describes a student's growth as the relative standing among students who had a similar level of academic achievement in previous years. Although several models for SGP estimation have been introduced, and some models have been implemented with R, no studies have yet described using SAS^®. As a result, this research describes various types of SGP models and demonstrates how practitioners can use SAS procedures to fit these models. Specifically, this study covers three types of statistical models for SGP: 1) quantile regression-based model 2) conditional cumulative density function-based model 3) multidimensional item response theory-based model. Each of the three models partly uses procedures in SAS, such as PROC QUANTREG, PROC LOGISTIC, PROC TRANSREG, PROC IRT, or PROC MCMC, for its computation. The program code is illustrated using a simulated longitudinal data set over two consecutive years, which is generated by SAS/IML^®. In addition, the interpretation of the estimation results and the advantages and disadvantages of implementing these three approaches in SAS are discussed.

View the e-poster or slides (PDF)

Model validation is an important step in the model building process because it provides opportunities to assess the reliability of models before their deployment. Predictive accuracy measures the ability of the models to predict future risks, and significant developments have been made in recent years in the evaluation of survival models. SAS/STAT^® 14.2 includes updates to the PHREG procedure with a variety of techniques to calculate overall concordance statistics and time-dependent receiver operator characteristic (ROC) curves for right-censored data. This paper describes how to use these criteria to validate and compare fitted survival models and presents examples to illustrate these applications.

Read the paper (PDF)

Given the proposed budget cuts to higher education in the state of Kentucky, public universities will likely be awarded financial appropriations based on several performance metrics. The purpose of this project was to conceptualize, design, and implement predictive models that addressed two of the state's metrics: six-year graduation rate and fall-to-fall persistence for freshmen. The Western Kentucky University (WKU) Office of Institutional Research analyzed five years' worth of data on first-time, full-time bachelor's degree seeking students. Two predictive models evaluated and scored current students on their likelihood to stay enrolled and their chances of graduating on time. Following an ensemble of machine-learning assessments, the scored data were imported into SAS^® Visual Analytics, where interactive reports allowed users to easily identify which students were at a high risk for attrition or at risk of not graduating on time.

Read the paper (PDF)

The British Airways (BA) revenue management team is responsible for surfacing prices made available in the market with the objective of maximizing revenue from our 40,000,000 passenger journeys. BA is currently working to understand how competitor data can be exploited to help facilitate better decision making. Due to the low level of aggregation, competitor data is too large (and consequently too expensive) to store on conventional relational databases. Therefore, it has been stored on a small Hadoop installation at BA. Thanks to SAS/ACCESS^® Interface to Hadoop, we have been able to run our complex algorithms on these large data sets without changing the way we work and whilst exploiting the full capabilities of SAS^®.

Read the paper (PDF)

With an aim to improve rural healthcare, Oklahoma State University (OSU) Center for Health Systems Innovation (CHSI) conducted a study with primary care clinics (n=35) in rural Oklahoma to identify possible impediments to clinic workflows. The study entailed semi-structured personal interviews (n=241) and administered an online survey using an iPad (n=190). Respondents encompassed all consenting clinic constituents (physicians, nurses, practice managers, schedulers). Quantitative data from surveys revealed that electronic medical records (EMRs) are well accepted and contributed to increasing workflow efficiency. However, the qualitative data from interviews reveals that there are IT-related barriers like Internet connectivity, hardware problems, and inefficiencies in information systems. Interview responses identified six IT-related response categories (computer, connectivity, EMR-related, fax, paperwork, and phone calls) that routinely affect clinic workflow. These categories together account for more than 50% of all the routine workflow-related problems faced by the clinics. Text mining was performed on transcribed Interviews using SAS^® Text Miner to validate these six categories and to further identify concept linking for a quantifiable insight. Two variables (Redundancy Reduction and Idle Time Generation) were derived from survey questions with low scores of -129 and -64 respectively out of 384. Finally, ANOVA was run using SAS^® Enterprise Guide^® 6.1 to determine whether the six qualitative categories affect the two quantitative variables differently.

Read the paper (PDF)

Traditional analytical modeling, with roots in statistical techniques, works best on structured data. Structured data enables you to impose certain standards and formats in which to store the data values. For example, a variable indicating gas mileage in miles per gallon should always be a number (for example, 25). However, with unstructured data analysis, the free-form text no longer limits you to expressing this information in only one way (25 mpg, twenty-five mpg, and 25M/G). The nuances of language, context, and subjectivity of text make it more complex to fit generalized models. Although statistical methods using supervised learning prove efficient and effective in some cases, sometimes you need a different approach. These situations are when rule-based models with Natural Language Processing capabilities can add significant value. In what context would you choose a rule-based modeling versus a statistical approach? How do you assess the tradeoffs of choosing a rule-based modeling approach with higher interpretability versus a statistical model that is black-box in nature? How can we develop rule-based models that optimize model performance without compromising accuracy? How can we design, construct, and maintain a complex rule-based model? What is a data-driven approach to rule writing? What are the common pitfalls to avoid? In this paper, we discuss all these questions based on our experiences working with SAS^® Contextual Analysis and SAS^® Sentiment Analysis.

Read the paper (PDF)

The DOCUMENT procedure is a little known procedure that can save you vast amounts of time and effort when managing the output of your SAS^® programming efforts. This procedure is deeply associated with the mechanism by which SAS controls output in the Output Delivery System (ODS). Have you ever wished you didn't have to modify and rerun the report-generating program every time there was some tweak in the desired report? PROC DOCUMENT enables you to store one version of the report as an ODS Document Object and then call it out in many different output forms, such as PDF, HTML, listing, RTF, and so on, without rerunning the code. Have you ever wished you could extract those pages of the output that apply to certain BY variables such as State, StudentName, or CarModel? With PROC DOCUMENT, you have where capabilities to extract these. Do you want to customize the table of contents that assorted SAS procedures produce when you make frames for the table of contents with HTML, or use the facilities available for PDF? PROC DOCUMENT enables you to get to the inner workings of ODS and manipulate them. This paper addresses PROC DOCUMENT from the viewpoint of end results, rather than provide a complete technical review of how to do the task at hand. The emphasis is on the benefits of using the procedure, not on detailed mechanics.

Read the paper (PDF) | View the e-poster or slides (PDF)

Factorization machines are a new type of model that is well suited to very high-cardinality, sparsely observed transactional data. This paper presents the new FACTMAC procedure, which implements factorization machines in SAS^® Visual Data Mining and Machine Learning. This powerful and flexible model can be thought of as a low-rank approximation of a matrix or a tensor, and it can be efficiently estimated when most of the elements of that matrix or tensor are unknown. Thanks to a highly parallel stochastic gradient descent optimization solver, PROC FACTMAC can quickly handle data sets that contain tens of millions of rows. The paper includes examples that show you how to use PROC FACTMAC to recommend movies to users based on tens of millions of past ratings, predict whether fine food will be highly rated by connoisseurs, restore heavily damaged high-resolution images, and discover shot styles that best fit individual basketball players. ^®

Read the paper (PDF)

Implementation of state transition models for loan-level portfolio evaluation was an arduous task until now. Several features have been added to the SAS^® High-Performance Risk engine that greatly enhance the ability of users to implement and execute these complex, loan-level models. These new features include model methods, model groups, and transition matrix functions. These features eliminate unnecessary and redundant calculations; enable the user to seamlessly interconnect systems of models; and automatically handle the bulk of the process logic in model implementation that users would otherwise need to code themselves. These added features reduce both the time and effort needed to set up model implementation processes, as well as significantly reduce model run time. This paper describes these new features in detail. In addition, we show how these powerful models can be easily implemented by using SAS^® Model Implementation Platform with SAS^® 9.4. This implementation can help many financial institutions take a huge leap forward in their modeling capabilities.

Read the paper (PDF)

Credit card fraud. Loan fraud. Online banking fraud. Money laundering. Terrorism financing. Identity theft. The strains that modern criminals are placing on financial and government institutions demands new approaches to detecting and fighting crime. Traditional methods of analyzing large data sets on a periodic, batch basis are no longer sufficient. SAS^® Event Stream Processing provides a framework and run-time architecture for building and deploying analytical models that run continuously on streams of incoming data, which can come from virtually any source: message queues, databases, files, TCP\IP sockets, and so on. SAS^® Visual Scenario Designer is a powerful tool for developing, testing, and deploying aggregations, models, and rule sets that run in the SAS^® Event Stream Processing Engine. This session explores the technology architecture, data flow, tools, and methodologies that are required to build a solution based on SAS Visual Scenario Designer that enables organizations to fight crime in real time.

Read the paper (PDF)

Finding daylight saving time (DST) is a common task for manipulating time series data. The date of daylight saving time changes every year. If SAS^® programmers depend on manually entering the value of daylight saving time in their programs, the maintenance of the program becomes tedious. Using a SAS function can make finding the value easy. This paper discusses several ways to capture and use daylight saving time.

Read the paper (PDF)

U.S. stock exchanges (currently there are 12) are tracked in real time via the Consolidated Trade System (CTS) and the Consolidated Quote System (CQS). CQS contains every updated quote from each of these exchanges, covering some 8,500 stock tickers. It provides the basis by which brokers can honor their fiduciary obligation to investors to execute transactions at the best price, that is, at the National Best Bid or Best Offer (NBBO). With the advent of electronic exchanges and high-frequency trading (timestamps are published to the nanosecond), data set size (approaching 1 billion quotes requiring 80 gigabytes of storage for a normal trading day) has become a major operational consideration for market behavior researchers re-creating NBBO values from quotes. This presentation demonstrates a straightforward use of hash tables for tracking constantly changing quotes for each ticker/exchange combination to provide the NBBO for each ticker at each time point in the trading day.

This paper discusses format enumeration (via the DICTIONARY.FORMATS view) and the new FMTINFO function that gives information about a format, such as whether it is a date or currency format.

Read the paper (PDF)

SAS/STAT^® software has several procedures that estimate parameters from generalized linear models designed for both continuous and discrete response data (including proportions and counts). Procedures such as LOGISTIC, GENMOD, GLIMMIX, and FMM, among others, offer a flexible range of analysis options to work with data from a variety of distributions and also with correlated or clustered data. SAS^® procedures can also model zero-inflated and truncated distributions. This paper demonstrates how statements from PROC NLMIXED can be written to match the output results from these procedures, including the LS-means. Situations arise where the flexible programming statements of PROC NLMIXED are needed for other situations such as zero-inflated or hurdle models, truncated counts, or proportions (including legitimate zeros) that have random effects, and also for probability distributions not available elsewhere. A useful application of these coding techniques is that programming statements from NLMIXED can often be directly transferred into PROC MCMC with little or no modification to perform analyses from a Bayesian perspective with these various types of complex models.

Read the paper (PDF)

Cumulative logistic regression models are used to predict an ordinal response. They have the assumption of proportional odds. Proportional odds means that the coefficients for each predictor category must be consistent or have parallel slopes across all levels of the response. This paper uses a sample data set to demonstrate how to test the proportional odds assumption. It shows how to use the UNEQUALSLOPES option when the assumption is violated. A cumulative logistic regression model is built, and then the performance of the model on a test set is compared to the performance of a generalized multinomial model. This shows the utility and necessity of the UNEQUALSLOPES option when building a cumulative logistic regression model. The procedures shown are produced using SAS^® Enterprise Guide^® 7.1.

Read the paper (PDF)

Longitudinal count data arise when a subject's outcomes are measured repeatedly over time. Repeated measures count data have an inherent within subject correlation that is commonly modeled with random effects in the standard Poisson regression. A Poisson regression model with random effects is easily fit in SAS^® using existing options in the NLMIXED procedure. This model allows for overdispersion via the nature of the repeated measures; however, departures from equidispersion can also exist due to the underlying count process mechanism. We present an extension of the cross-sectional COM-Poisson (CMP) regression model established by Sellers and Shmueli (2010) (a generalized regression model for count data in light of inherent data dispersion) to incorporate random effects for analysis of longitudinal count data. We detail how to fit the CMP longitudinal model via a user-defined log-likelihood function in PROC NLMIXED. We demonstrate the model flexibility of the CMP longitudinal model via simulated and real data examples.

Read the paper (PDF)

Do you ever wonder how to create a report with weighted averages, or one that displays the last day of the month by default? Do you want to take advantage of the one-click relative-time calculations available in SAS^® Visual Analytics, or learn a few other creative ways to enhance your report? If your answer is yes, then this paper is for you. We not only teach you some new tricks, but the techniques covered here will also help you expand the way you think about SAS Visual Analytics the next time you are challenged to create a report.

Read the paper (PDF)

The increasing complexity of data in research and business analytics requires versatile, robust, and scalable methods of building explanatory and predictive statistical models. Quantile regression meets these requirements by fitting conditional quantiles of the response with a general linear model that assumes no parametric form for the conditional distribution of the response; it gives you information that you would not obtain directly from standard regression methods. Quantile regression yields valuable insights in applications such as risk management, where answers to important questions lie in modeling the tails of the conditional distribution. Furthermore, quantile regression is capable of modeling the entire conditional distribution; this is essential for applications such as ranking the performance of students on standardized exams. This expository paper explains the concepts and benefits of quantile regression, and it introduces you to the appropriate procedures in SAS/STAT^® software.

Read the paper (PDF)

The macro language is both powerful and flexible. With this power, however, comes complexity, and this complexity often makes the language more difficult to learn and use. Fortunately, one of the key elements of the macro language is its use of macro variables, and these are easy to learn and easy to use. You can create macro variables using a number of different techniques and statements. However, the five most commonly methods are not only the most useful, but also among the easiest to master. Since macro variables are used in so many ways within the macro language, learning how they are created can also serve as an excellent introduction to the language itself. These methods include: 1) the %LET statement; 2) macro parameters (named and positional); 3) the iterative %DO statement; 4) using the INTO clause in PROC SQL; and 5) using the CALL SYMPUTX routine.

Read the paper (PDF) | Download the data file (ZIP)

In fantasy football, it is a relatively common strategy to rotate which team's defense a player uses based on some combination of favorable/unfavorable player matchups, recent performance, and projection of expected points. However, there is danger in this strategy because defensive scoring volatility is high, and any team has the possibility of turning in a statistically bad performance in a given week. This paper uses data mining techniques to identify which National Football League (NFL) teams give up high numbers of defensive penalties on a week-to-week basis, and to what degree those high-penalty games correlate with poor team defensive fantasy scores. Examining penalty count and penalty yards allowed totals, we can narrow down which teams are consistently hurt by poor technique and find correlation between games with high penalty totals to their respective fantasy football score. By doing so, we seek to find which teams should be avoided in fantasy football due to their likelihood of poor performance.

Objective: To assess the risk of dying in a severe traffic accident (where at least one death occurred) among slow drivers, defined as those driving _ 15% below traffic flow. Methods: Records of severe traffic accidents were acquired from Fatality Analysis Reporting System (FARS); interstate and US-highway traffic flow speeds in California were acquired from the California Department of Transportation (Caltrans). Each accident involving at least two vehicles was matched to the nearest available speed monitoring station to assess how slow or fast the vehicles were relative to traffic flow. The outcome was whether the driver died in the accident. To control for external confounders such as weather and road conditions, a conditional logistic regression model was used to stratify the vehicles by accidents. Covariates of interests included those describing the drivers, the vehicles, and the accidents. Results: In the final multivariate model, slow drivers were a significant predictor of death in severe accidents when compared to drivers traveling at traffic flow (OR = 2.41, 95% CI: 1.39, 4.18), after adjusting for vehicle type, extent of vehicle damage, and alcohol use. Conclusion: Slow driving speed puts the driver at higher risk of dying in a severe traffic accident than those driving at traffic flow.

Read the paper (PDF)

Formats can be used for more than just making your data look nice. They can be used in memory lookup tables and to help you create data-driven code. This paper shows you how to build a format from a data set, how to write a format out as a data set, and how to use formats to make programs data driven. Examples are provided.

Read the paper (PDF)

Higher education institutions have a plethora of analytical needs. However, the irregular and inconsistent practices in connecting those needs with appropriate analytical delivery systems have resulted in a patchwork this patchwork sometimes overlaps unnecessarily and sometimes exposes unaddressed gaps. The purpose of this paper is to examine a framework of components for addressing institutional analytical needs, while leveraging existing institutional strengths to maximize analytical goal attainment most effectively and efficiently. The core of this paper is a focused review of components for attaining greater analytical strength and goal attainment in the institution.

Read the paper (PDF)

Innovation in teaching and assessment has become critical for many reasons. This is especially true in the fields of data science and big data analytics. Reasons range from the need to significantly improve the development of soft skills (as reported in an e-skills UK and SAS^® joint report from November 2014), to the rapidly changing software standards of products used by students, to the rapidly increasing range of functionality and product set, to the need to develop lifelong learning skills to learn new software and functionality. And, this is just a few of the reasons. In some educational institutions, it is easy to be extremely innovative. However, in many institutions and countries, there are numerous constraints on the levels of innovation that can be implemented. This presentation captures the author's developing pedagogic practice at the University of Derby. He suggests fundamental changes to the classic approaches to teaching and assessing data science and big data analytics. These changes have resulted in significant improvement in student engagement and achievements and students soft skills. Improvements are illustrated by innovations in teaching SAS to first-year students and teaching IBM Bluemix and Watson Analytics to final-year students. Students have successfully developed both technical and soft skills and experienced excellent levels of achievement.

Read the paper (PDF)

SAS^® Environment Manager is the predominant tool for managing your SAS^® environment. Its popularity is increasing quickly as evidenced by the increased technical support requests from our customers. This paper identifies the most frequently asked questions from customers by reviewing the support work completed by the development and technical support teams over the last few years. The questions range across topics such as web interface usage; alerts, controls, and resource discovery; Agent issues; and security issues. Questions discussed in the paper include: What resources need to be configured after we install SAS Environment Manager? What Control Actions are available, what is their purpose, and when do I use them? Why does SAS Environment Manager show all resources as (!) (Down)? What is the best way to enable an alert for a resource? How do I configure HTTPs? Can we configure the Agents with certificates other than the default? What is the combination of roles needed to see the Resources Tab? This paper presents detailed answers to the questions and also points out where you can find more information. We believe that by understanding these answers, SAS^® administrators will be more knowledgeable about SAS Environment Manager, and can better implement and manage their SAS environment.

Read the paper (PDF)

Are you a marketing analyst who speaks SAS^®? Congratulations, you are in high demand! Or are you? Marketing analysts with programming skills are critical today. The ability to extract large volumes of data, massage it into a manageable format, and display it simply are necessary skills in the world of big data. However, programming skills are not nearly enough. In fact, some marketing managers are putting less and less weight on them and are focusing more on the softer skills that they require. This session will help ensure that you are not left out. In this session, Emma Warrillow shares why being a good programmer is only the beginning. She provide practical tips on moving from being a someone who is good at coding to becoming a true collaborator with marketing taking your marketing analytics to the next level. In 2016, Emma Warrillow's presentation at SAS^® Global Forum was very well received (http://blogs.sas.com/content/sgf/2016/04/21/always-be-yourself-unless-you-can-be-a-unicorn/). In this follow-up, she revisits some of the highlights from 2016 and shares some new ideas. You can be sure of an engaging code-free session!

Read the paper (PDF)

In the quest for valuable analytics, access to business data through message queues provides near real-time access to the entire data life cycle. This in turn enables our analytical models to perform accurately. What does the item a user temporarily put in the shopping basket indicate, and what can be done to motivate the user? How do you recover the user who has now unsubscribed, given that the user had previously unsubscribed and re-subscribed quickly? User behavior can be captured completely and efficiently using a message queue, which causes minimal load on production systems and allows for distributed environments. There are some technical issues encountered when attempting to populate a data warehouse using events from a message queue. The presentation outlines a solution to the following issues: the message queue connection, how to ensure that messages aren't lost in transit, and how to efficiently process messages with SAS^®; message definition and metadata, and how to react to changes in message structure; data architecture and which data architecture is appropriate for storing message data and other business data; late arrival of messages and how late arriving data can be loaded into slowly changing dimensions; and analytical processing and how transactional message data can be reformatted for analytical modeling. Ultimately, populating a data warehouse with message queue data can require less development than accessing source databases; however a robust architecture

Read the paper (PDF)

Having crossed the spectrum from an epidemiologist and researcher (where ad hoc is a way of life and where research is the main focus) to a SAS^® programmer (writing reusable code for automation and batch jobs, which require no manual interventions), I have learned a few things that I wish I had known as a researcher. These things would not only have helped me to be a better SAS programmer, but they also would have saved me time and effort as a researcher by enabling me to have well-organized, accurate code (that I didn't accidentally remove) and code that would work when I ran it again on another date. This poster presents five SAS tips that are common practice among SAS programmers. I provide researchers who use SAS with tips that are handy and useful, and I provide code (where applicable) that they can try out at home. Using the tips provided will make any SAS programmer smile when they are presented with your code (not guaranteed, but your results should not vary by using these tips).

View the e-poster or slides (PDF)

This paper demonstrates how to use the capabilities of SAS^® Data Integration Studio to extract, load, and transform your data within a Hadoop environment. Which transformations can be used in each layer of the ELT process is illustrated using a sample use case, and the functionality of each is described. The use case steps through the process from source to target.

Read the paper (PDF)

Tracking gains or losses from the purchase and sale of diverse equity holdings depends in part on whether stocks sold are assumed to be from the earliest lots acquired (a first-in, first-out queue, or FIFO queue) or the latest lots acquired (a last-in, first-out queue, or LIFO queue). Other inventory tracking applications have a similar need for application of either FIFO or LIFO rules. This presentation shows how a collection of simple ordered hash objects, in combination with a hash-of-hashes, is a made-to-order technique for easy data-step implementation of FIFO, LIFO, and other less likely rules (for example, HIFO [highest-in, first-out] and LOFO [lowest-in, first-out]).

Read the paper (PDF)

The acquisition of new customers is fundamental to the success of every business. Data science methods can greatly improve the effectiveness of acquiring prospective customers and can contribute to the profitability of business operations. In a business-to-business (B2B) setting, a predictive model might target business prospects as individual firms listed by, for example, Dunn & Bradstreet. A typical acquisition model can be defined using a binary response with the values categorizing a firm's customers and non-customers, for which it is then necessary to identify which of the prospects are actually customers of the firm. The methods of fuzzy logic, for example, based on the distance between strings, might help in matching customers' names and addresses with the overall universe of prospects. However, two errors can occur: false positives (when the prospect is incorrectly classified as a firm's customer), and false negatives (when the prospect is incorrectly classified as a non-customer). In the current practice of building acquisition models, these errors are typically ignored. In this presentation, we assess how these errors affect the performance of the predictive model as measured by a lift. In order to improve the model's performance, we suggest using a pre-determined sample of correct matches and to calibrate its predicted probabilities based on actual take-up rates. The presentation is illustrated with real B2B data and includes elements of SAS^® code that was used in the research.

Read the paper (PDF)

The analysis of longitudinal data requires a model that correctly accounts for both the inherent correlation amongst the responses as a result of the repeated measurements, as well as the feedback between the responses and predictors at different time points. Lalonde, Wilson, and Yin (2013) developed an approach based on generalized method of moments (GMM) for identifying and using valid moment conditions to account for time-dependent covariates in longitudinal data with binary outcomes. However, the model developed using this approach does not provide information about the specific relationships that exist across time points. We present a SAS^® macro that extends the work of Lalonde, Wilson, and Yin by using valid moment conditions to estimate and evaluate the relationships between the response and predictors at different time periods. The performance of this method is compared to previously established results.

Read the paper (PDF)

An important component of insurance pricing is the insured location and the associated riskiness of that location. Recently, we have experienced a large increase in the availability of external risk classification variables and associated risk factors by geospatial location. As additional geospatial data becomes available, it is prudent for insurers to take advantage of the new information to better match price to risk. Generalized additive models using penalized likelihood (GAMPL) have been explored as a way to incorporate new location-based information. This type of model can leverage the new geospatial information and incorporate it with traditional insurance rating variables in a regression-based model for rating. In our method, we propose a local regression model in conjunction with our GAMPL model. Our discussion demonstrates the use of the LOESS procedure as well as the GAMPL procedure in a combined solution. Both procedures are in SAS/STAT^® software. We discuss in detail how we built a local regression model and used the predictions from this model as an offset into a generalized additive model. We compare the results of the combined approach to results of each model individually.

Read the paper (PDF)

The mean-variance model might be the most famous model in the financial field. It can determine the optimal portfolio if you know every asset's expected return and its covariance matrix. The tangency portfolio is a type of optimal portfolio, which means that it has the maximum expected return (mean) and the minimial risk (variance) among all portfolios. This paper uses sample data to get the tangency portfolio using SAS/IML^® code.

Read the paper (PDF) | View the e-poster or slides (PDF)

When creating statistical models that include multiple covariates (for example, Cox proportional hazards models or multiple linear regression), it is important to address which variables are categorical and continuous for proper analysis and interpretation in SAS^®. Categorical variables, regardless of SAS data type, should be added in the MODEL statement with an additional CLASS statement. In larger models containing many continuous or categorical variables, it is easy to overlook variables that should be added to the CLASS statement. To solve this problem, we have created a macro that uses simple input from the model variables, with PROC CONTENTS and additional logic checks, to create the necessary CLASS statement and to run the desired model. With this macro, variables are evaluated on multiple conditions to see whether they should be considered class variables. Then, they are added automatically to the CLASS statement.

Read the paper (PDF) | View the e-poster or slides (PDF)

The presentation will give a brief introduction to Bayesian Analysis within SAS. Participants will learn the difference between Bayesian and Classical Statistics and be introduced to PROC MCMC.

SAS^® has been installed at your organization now what? How do you approach configuring groups, roles, folders, and permissions in your environment? This presentation is built on best practices used within the U.S. SAS^® Professional Services and Delivery division and aims to equip new and seasoned SAS administrators with the knowledge and tools necessary to design and implement a SAS metadata and file system security model. We start by covering the basic building blocks of the SAS^® Intelligence Platform metadata and security framework. We discuss the SAS metadata architecture, and highlight the differences between groups and roles, permissions and capabilities, access control entries and access control templates, and what content can be stored within metadata folders versus in file system folders. We review the various authorization layers in a SAS deployment that must work together to create a secure environment, including the metadata layer, the file system, and the data layer. Then, we present a 10-step best practice approach for how to design your SAS metadata security model. We provide an introduction to basic metadata security design and file system security design templates that have been used extensively by SAS Professional Services and Delivery in helping customers secure their SAS environments.

Read the paper (PDF)

Machine Learning algorithms have been available in SAS software since 1979. This session provides practical examples of machine learning applications. The evolution of machine learning at SAS is illustrated with examples of nearest-neighbor discriminant analysis in SAS/STAT PROC DISCRIM to advanced predictive modeling in SAS Enterprise Miner. Machine learning techniques addressed include memory based reasoning, decision trees, neural networks, and gradient boosting algorithms.

In this presentation you will learn the basics of working with nested data, such as students within classes, customers within households, or patients within clinics through the use of multilevel models. Multilevel models can accommodate correlation among nested units through random intercepts and slopes, and generalize easily to 2, 3, or more levels of nesting. These models represent a statistically efficient and powerful way to test your key hypotheses while accounting for the hierarchical nesting of the design. The GLIMMIX procedure is used to demonstrate analyses in SAS.

Allowing SAS^® users to leverage SAS prompts when running programs is a very powerful tool. Using SAS prompts makes it easier for SAS users to submit parameter-driven programs and for developers to create robust, data-driven programs. This presentation demonstrates how to create SAS prompts from SAS^® Enterprise Guide^® and shows how to roll them out to users so that they can take advantage of them from SAS Enterprise Guide, the SAS^® Add-In for Microsoft Office, and the SAS^® Stored Process Web Application.

Read the paper (PDF)

Getting Started with ARIMA Models will introduce the basic features of time series variation, and the model components used to accommodate them; stationary (ARMA), trend and seasonal (the 'I' in ARIMA) and exogenous (input variable related). The Identify, Estimate and Forecast framework for building ARIMA models is illustrated with two demonstrations.

SAS^® 9.4 provides three ways to upgrade: upgrade in place, automated migration with the SAS^® Migration Utility, and partial promotion. This session focuses primarily on the different techniques and best practices for each. We also discuss the pros and cons of using the SAS Migration Utility and what is required for migrating users' content like projects, data, and code.

Read the paper (PDF)

When you look at examples of the REPORT procedure, you see code that tests _BREAK_ and _RBREAK_, but you wonder what s the breakdown of the COMPUTE block? And, sometimes, you need more than one break line on a report, or you need a customized or adjusted number at the break. Everything in PROC REPORT that is advanced seems to involve a COMPUTE block. This paper provides examples of advanced PROC REPORT output that uses _BREAK_ and _RBREAK_ to customize the extra break lines that you can request with PROC REPORT. Examples include how to get custom percentages with PROC REPORT, how to get multiple break lines at the bottom of the report, how to customize break lines, and how to customize LINE statement output. This presentation is aimed at the intermediate to advanced report writer who knows some about PROC REPORT, but wants to get the breakdown of how to do more with PROC REPORT and the COMPUTE block.

Read the paper (PDF) | Download the data file (ZIP)

This paper shows how you can reduce the computing footprint of your SAS^® applications without compromising your end products. The paper presents the 15 axioms of going green with your SAS applications. The axioms are proven, real-world techniques for reducing the computer resources used by your SAS programs. When you follow these axioms, your programs run faster, use less network bandwidth, use fewer desktop or shared server computing resources, and create more compact SAS data sets.

Read the paper (PDF)

Many SAS^® users are working across multiple platforms, commonly combining Microsoft Windows and UNIX environments. Often, SAS code developed on one platform (for example, on a PC) might not work on another platform (for example, on UNIX). Portability is not just working across multi-platform environments; it is also about making programs easier to use across projects, across companies, or across clients and vendors. This paper examines some good programming practices to address common issues that occur when you work across SAS on a PC and SAS on UNIX. They include: 1) avoid explicitly defining file paths in LIBNAME, filename, and %include statements that require platform-specific syntax such as forward slash (in UNIX) or backslash (in PC SAS); 2) avoid using X commands in SAS code to execute statements on the operating system, which works only on Windows but not on UNIX; 3) use the appropriate SAS rounding function for numeric variables to avoid different results when dealing with 64-bit operating systems and 32-bit systems. The difference between rounding before or after calculations and derivations is discussed; 4) develop portable SAS code to import or export Microsoft Excel spreadsheets across PC SAS and UNIX SAS, especially when dealing with multiple worksheets within one Excel file; and 5) use SAS^® Enterprise Guide^® to access and run PC SAS programs in UNIX effectively.

Read the paper (PDF)

Because many SAS^® users either work for or own companies that house big data, the threat that malicious software poses becomes even more extreme. Malicious software, often abbreviated as malware, includes many different classifications, ways of infection, and methods of attack. This E-Poster highlights the types of malware, detection strategies, and removal methods. It provides guidelines to secure essential assets and prevent future malware breaches.

Read the paper (PDF) | View the e-poster or slides (PDF)

Would you like to be more confident in producing graphs and figures? Do you understand the differences between the OVERLAY, GRIDDED, LATTICE, DATAPANEL, and DATALATTICE layouts? Finally, would you like to learn the fundamental Graph Template Language methods in a relaxed environment that fosters questions? Great this topic is for you! In this hands-on workshop, you are guided through the fundamental aspects of the GTL procedure, and you can try fun and challenging SAS^® graphics exercises to enable you to more easily retain what you have learned.

Read the paper (PDF) | Download the data file (ZIP)

In this course you will learn how to access and manage SAS and Excel data in SAS^® Viya .

This workshop provides hands-on experience with using SAS Enterprise Miner. Workshop participants will learn to do the following: open a project; create and explore a data source; build and compare models; and produce and examine score code that can be used for deployment.

This hands-on-workshop explores the power of SAS Macro, a text substitution facility for extending and customizing SAS programs. Examples will range from simple macro variables to advanced macro programs. As a participant, you will add macro syntax to existing programs to dynamically enhance your programming experience.

This workshop provides hands-on experience with some basic functionality of SAS Data Loader for Hadoop. You will learn how to: Copy Data to Hadoop Profile Data in Hadoop Cleanse Data in Hadoop

This workshop provides hands-on experience with SAS^® Studio. Workshop participants will use SAS's new web-based interface to access data, write SAS programs, and generate SAS code through predefined tasks. This workshop is intended for SAS programmers from all experience levels.

This workshop provides hands-on experience with SAS Viya Data Mining and Machine Learning through the programming interface to SAS Viya. Workshop participants will learn how to start and stop a CAS session; move data into CAS; prepare data for machine learning; use SAS Studio tasks for supervised learning; and evaluate the results of analyses.

This workshop provides hands-on experience performing statistical analysis with the Statistics tasks in SAS Studio. Workshop participants will learn to perform statistical analyses using tasks, evaluate which tasks are ideal for different kinds of analyses, edit the generated code, and customize a task.

This workshop provides hands-on experience using SAS^® Text Miner. For a collection of documents, workshop participants will learn how to: read and convert documents for use by SAS Text Miner; retrieve information from the collection using query features of the software; identify the dominant themes and concepts in the collection; and classify documents having pre-assigned categories.

Do you need to add annotations to your graphs? Do you need to specify your own colors on the graph? Would you like to add Unicode characters to your graph, or would you like to create templates that can also be used by non-programmers to produce the required figures? Great, then this topic is for you! In this hands-on workshop, you are guided through the more advanced features of the GTL procedure. There are also fun and challenging SAS^® graphics exercises to enable you to more easily retain what you have learned.

Read the paper (PDF) | Download the data file (ZIP)

When modeling time series data, we often use a LAG of the dependent variable. The LAG function works great for this, until you try to predict out into the future and need the model's predicted value from one record as an independent value for a future record. This paper examines exactly how the LAG function works, and explains why it doesn't in this case. It also explains how to create a hash object that will accomplish a LAG of any value, how to load the initial data, how to add predicted values to the hash object, and how to extract those values when needed as an independent variable for future observations.

Read the paper (PDF)

Data processing can sometimes require complex logic to match and rank record associations across events. This paper presents an efficient solution to generating these complex associations using the DATA step and data hash objects. The solution applies to multiple business needs including subsequent purchases, repayment of loan advance, or hospital readmits. The logic demonstrates how to construct a hash process to identify a qualifying initial event and append linking information with various rank and analysis factors, through the example of a specific use case of the process.

Read the paper (PDF) | Download the data file (ZIP)

Secondary use of administrative claims data, EHRs and EMRs, registry data, and other data sources within the health data ecosystem provide rich opportunity and potential to study topics ranging from public health surveillance to comparative effectiveness research. Data sourced from individual sites can be limited in their scope, coverage, and statistical power. Sharing and pooling data from multiple sites and sources, however, present administrative, governance, analytic, and patient-privacy challenges. Distributed data networks represent a paradigm shift in health-care data sharing. They have evolved at a critical time when big data and patient privacy are often competing priorities. A distributed data network is one that has no central repository of data. Data reside behind the firewall of each data-contributing partner in a network. Each partner transforms its source data in accordance with a common data model and allows indirect access to data through a standard query approach using flexibly designed informatics tools. This presentation discusses how distributed data networks have matured to make important contributions to the health-care data ecosystem and the evolving Learning Healthcare System. The presentation focuses on 1) the distributed data network and its purpose, concept, guiding principles, and benefits. 2) Common data models and their concepts, designs, and benefits. 3) Analytic tool development and its design and implementation considerations. 4) Analytic chal

Read the paper (PDF)

Heat maps use colors to communicate numeric data by varying the underlying values that represent red, green, and blue (RGB) as a linear function of the data. You can use heat maps to display spatial data, plot big data sets, and enhance tables. You can use colors on the spectrum from blue to red to show population density in a US map. In fields such as epidemiology and sociology, colors and maps are used to show spatial data, such as how rates of disease or crime vary with location. With big data sets, patterns that you would hope to see in scatter plots are hidden in dense clouds of points. In contrast, patterns in heat maps are clear, because colors are used to display the frequency of observations in each cell of the graph. Heat maps also make tables easier to interpret. For example, when displaying a correlation matrix, you can vary the background color from white to red to correspond to the absolute correlation range from 0 to 1. You can shade the cell behind a value, or you can replace the table with a shaded grid. This paper shows you how to make a variety of heat maps by using PROC SGPLOT, the Graph Template Language, and SG annotation.

Read the paper (PDF)

How would you answer this question? Most of us struggle to articulate the value of the tools, techniques, and teams we use when using analytics. How do you help the new director understand the value of SAS^® to you, your job, and the company? In this interactive session, you will discover the components that make up total cost of ownership (TCO) as they apply to the analytics lifecycle. What should you consider when you evaluate total cost of ownership and why should you measure it? How can you help your management team understand the value that SAS provides?

Read the paper (PDF)

This panel discusses a wide range of topics related to analytics in higher education. Panelists are from diverse institutions and represent academic research, information technology, and institutional research. Challenges related to data acquisition and quality, system support, and meeting customer needs are covered. Topics such as effective dashboards and reporting, big data, predictive analytics, and more are on the agenda.

Immediately after a new credit card product is launched and in the wallets of cardholders, sentiment begins to build. Positive and negative experiences of current customers posted online generate impressions among prospective cardholders in the form of technological word of mouth. Companies that issue credit cards can use sentiment analysis to understand how their product is being received by consumers, and, by taking suitable measure, can propel the card's market success. With the help of text mining and sentiment analysis using SAS^® Enterprise Miner and SAS^® Sentiment Analysis Studio, we are trying to answer which aspects of a credit card garnered the most favor, and conversely, which generated negative impressions among consumers. Credit Karma is a free credit and financial management platform for US consumers available on the web and on major mobile platforms. It provides free weekly updated credit scores and credit reports from the national credit bureaus TransUnion and Equifax. The implications of this project are as follows: 1) all companies that issue credit cards can use this technique to determine how their product is fairing in the market, and they can make business decisions to improve the flaws, based on public opinion; and 2) sentiment analysis can simulate the word-of-mouth influence of millions of existing users about a credit card.

Read the paper (PDF)

Another year implementing, validating, securing, optimizing, migrating, and adopting the Hadoop platform. What have been the top 10 accomplishments with Hadoop seen over the last year? We also review issues, concerns, and resolutions from the past year as well. We discuss where implementations are and some best practices for moving forward with Hadoop and SAS^® releases.

Read the paper (PDF)

Investors usually trade stocks or exchange-traded funds (ETFs) based on a methodology, such as a theory, a model, or a specific chart pattern. There are more than 10,000 securities listed on the US stock market. Picking the right one based on a methodology from so many candidates is usually a big challenge. This paper presents the methodology based on the CANSLIM1 theorem and momentum trading (MT) theorem. We often hear of the cup and handle shape (C&H), double bottoms and multiple bottoms (MB), support and resistance lines (SRL), market direction (MD), fundamental analyses (FA), and technical analyses (TA). Those are all covered in CANSLIM theorem. MT is a trading theorem based on stock moving direction or momentum. Both theorems are easy to learn but difficult to apply without an appropriate tool. The brokers' application system usually cannot provide such filtering due to its complexity. For example, for C&H, where is the handle located? For the MB, where is the last bottom you should trade at? Now, the challenging task can be fulfilled through SAS^®. This paper presents the methods on how to apply the logic and graphically present them though SAS. All SAS users, especially those who work directly on capital market business, can benefit from reading this document to achieve their investment goals. Much of the programming logic can also be adopted in SAS finance packages for clients.

Read the paper (PDF)

In today's instant information society, we want to know the most up-to-date information about everything, including what is happening with our favorite sports teams. In this paper, we explore some of the readily available sources of live sports data, and look at how SAS^® technologies, including SAS^® Event Stream Processing and SAS^® Visual Analytics, can be used to collect, store, process, and analyze the streamed data. A bibliography of sports data websites that were used in this paper is included, with emphasis on the free sources.

Read the paper (PDF)

The openness of SAS^® Viya , the new cloud analytic platform that uses SAS^® Cloud Analytic Services (CAS), emphasizes a unified experience for data scientists. You can now execute the analytics capabilities of SAS^® in different programming languages including Python, Java, and Lua, as well as use a RESTful endpoint to execute CAS actions directly. This paper provides an introduction to these programming languages. For each language, we illustrate how the API is surfaced from the CAS server, the types of data that you can upload to a CAS server, and the result tables that are returned. This paper also provides a comprehensive comparison of using these programming languages to build a common analytical process, including loading data to a CAS server; exploring, manipulating, and visualizing data; and building statistical and machine learning models.

Read the paper (PDF)

The clock is ticking! Is your company ready for May 25, 2018 when the General Data Protection Regulation that affects data privacy laws across Europe comes into force? If companies fail to comply, they incur very large fines and might lose customer trust if sensitive information is compromised. With data streaming in from multiple channels in different formats, sizes, and wavering quality, it is increasingly difficult to keep track of personal data so that you can protect it. SAS^® Data Management helps companies on their journey toward governance and compliance involving tasks such as detection, quality assurance, and protection of personal data. This paper focuses on using SAS^® Federation Server and SAS^® Data Management Studio in the SAS^® data management suite of products to surface and manage that hard-to find-personal data. SAS Federation Server provides you with a universal way to access data in Hadoop, Teradata, SQL Server, Oracle, SAP HANA, and other types of data without data movement during processing. The advanced data masking and encryption capabilities of SAS Federation Server can be use when virtualizing data for users. Purpose-built data quality functions are used to perform identification analysis, parsing, and matching and extraction of personal data in real time. We also provide insight to how the exploratory data analysis capability of SAS^® Data Management Studio enables you to scan through your investigation hub to identify and categorize personal data.

Read the paper (PDF)

The traditional view is that a utility's long-term forecast must have a standard against which it is judged. Weather normalization is one of the industry-standard practices that utilities use to assess the efficacy of a forecasting solution. While recent advances in probabilistic load forecasting techniques are proving to be a methodology that brings many benefits to a forecast, many utilities still require the benchmarking process to determine the accuracy of their long-term forecasts. Due to climatological volatility and the potentially large annual variances in temperature, humidity, and other relevant weather variables, most utilities create normalized weather profiles through various processes in order to estimate what is traditionally called a weather normalized load profile. However, new research shows that due to the nonlinear response of electric demand to weather variations, a simple normal weather profile in many cases might not equate to a normal load. In this paper, we introduce a probabilistic approach to deriving normalized load profiles and monthly peak and energy in through a process we label load normalization against the effects of weather . We compare it with the traditional weather normalization process to quantify the costs and benefits of using such a process. The proposed method has been successfully deployed at utilities for their long-term operation and planning purposes, and risk management.

Read the paper (PDF)

What if you had analytics near the edge for your Internet of Things (IoT) devices that would tell you whether a piece of equipment is operating within its normal range? And what if those same analytics could help you intelligently determine what data you should keep and what data should be filtered at the edge? This session focuses on classifying streaming data near the edge by showcasing a demo that implements a single-class classification model within a gateway device. The model identifies observations that are normal and abnormal to help determine possible machine issues and preventative maintenance opportunities. The classification also helps to provide a method for filtering data at the edge by capturing all abnormal data but taking only a sample of the normal operating data. The model is developed using SAS^® Viya and implemented on a gateway device using SAS^® Event Stream Processing. By using a single-class classification technique, the demo also illustrates how to avoid issues with binary classification that would require failure observations in order to build an accurate model. Problems that this demo addresses include: identifying potential and future equipment failures in near real time; filtering sensor data near the edge to prevent unnecessary transport and storage of less valuable data; and building a classification model for failure that doesn't require observations relating to failures.

Read the paper (PDF)

Many SAS^® environments are set up for single-byte character sets (SBCS). But many organizations now have to process names of people and companies with characters outside that set. You can solve this problem by changing the configuration to the UTF-8 encoding, which is a multi-byte character set (MBCS). But the commonly used text manipulating functions like SUBSTR, INDEX, FIND, and so on, act on bytes, and should not be used anymore. SAS has provided new functions to replace these (K-functions). Also, character fields have to be enlarged to make room for multi-byte characters. This paper describes the problems and gives guidelines for a strategy to change. It also presents code to analyze existing code for functions that might cause problems. For those interested, a short historic background and a description of UTF-8 encoding is also provided. Conclusions focus on the positioning of SAS environments configured with UTF-8 versus single-byte encodings, the strategy of organizations faced with a necessary change, and the documentation.

Read the paper (PDF)

In this technology-driven era, multi-channel communication has become a pivotal part of an effective customer care strategy for companies. Old ways of delivering customer service are no longer adequate. To survive a tough competitive market and retain current customer base, companies are spending heavily to serve customers in the manner in which they wish to be served. West Corporation helps their clients in designing a strategy that would provide their customers with a connected inbound and outbound communication experience. This paper illustrates how the Data Science team at West Corporation has measured the effect of outbound short message service (SMS) notification in reducing inbound interactive voice response (IVR) call volume and improving customer satisfaction for a leading telecom services company. As part of a seamless experience, customers have the option of receiving outbound SMS notifications at several stages while traversing inside IVR. Notifications can involve successful payment and appointment confirmations, outage updates in the area, and an option of receiving text with details to reset Wi-Fi password and activate new devices. This study was performed on two groups of customers one whose members opted to receive notifications and one whose members did not opt in. Also, analysis was performed using SAS^® to understand repeat caller behaviors within both groups. The group that opted to receive SMS notifications were less likely to call back than those who did not opt in.

Read the paper (PDF)

Capacity management is concerned with managing, controlling, and optimizing the hardware resources on a technology platform. Its primary goal is to ensure that IT resources are right-sized to meet current and future business requirements in a cost-effective manner. In other words, keeping those hardware vendors at bay! A SAS^® LASR Analytic Server, with its dependence on in-memory resources, necessitate a revisit to the traditional IT server capacity management practices. A major UK-based financial services institution operates a multi-tenanted Enterprise SAS^® platform. The tenants share platform resources and as such, require quotas enforced with system limits and costs for their resource utilization, aligned to the business outcomes and agreed-upon service level agreements (SLAs). This paper discusses the implementation of system, operational, and development polices applicable in a multi-tenanted SAS platform, in order to optimize an investment in the analytic platform provided by SAS LASR Analytic Server and to be in control as to when capacity uplifts are required.

Read the paper (PDF)

Traditionally, role-based access control is implemented as group memberships. Access to SAS^® data sets or metadata libraries requires membership in the group that 'owns' the resources. From the point of view of a SAS process, these authorizations are additive. If a user is a member in two distinct groups, her SAS processes have access to the data resources of both groups simultaneously. This happens every time the user runs a SAS process; even when the code in question is meant to be used with only one group's resources. As a consequence, having a master data source defining data flows between groups becomes futile, as any SAS process of the user can bypass said definitions. In addition, as it is not possible to reduce the user's authorizations to match those of only the relevant group, it becomes challenging to determine whether other members of the group have sufficient authorization. Furthermore, it becomes difficult to audit statistics production, as it cannot be automatically determined which of the groups owns a certain log file. All these problems can be avoided by using role-based access control with dynamic separation of duties (RBAC DSoD). In DSoD, the user is able to activate only one group membership at a time. This paper describes one way to implement an RBAC with DSoD schema in a UNIX server environment.

Read the paper (PDF)

XML documents are becoming increasingly popular for transporting data from different operating systems. In the pharmaceutical industry, the Food and Drug Administration (FDA) requires pharmaceutical companies to submit certain types of data in XML format. This paper provides insights into XML documents and summarizes different methods of importing and exporting XML documents with SAS^®, including: using the XML LIBNAME engine to translate between the XML markup and the SAS format; creating an XML Map and using the XML92 LIBNAME engine to read in XML documents and create SAS data sets; and using Clinical Data Interchange Standards Consortium (CDISC) procedures to import and export XML documents. An example of importing OpenClinica data into SAS by implementing these methods is provided.

Read the paper (PDF)

In the past 10 years, SAS^® Enterprise Guide^® has developed into the go-to application to access the power of SAS^®. With each new release, SAS continues to add functionality that makes the SAS user's life easier. We take a closer look at some of the built-in features within SAS Enterprise Guide and how they can make your life easier. One of the most exciting and powerful features we explore is allowing parallel execution on the same server. This gives you the ability to run multiple SAS processes at the same time regardless of whether you have a SAS^® Grid Computing environment. Some other topics we cover include conditional processing within SAS Enterprise Guide, how to securely store database login and password information, setting up autoexec files in SAS Enterprise Guide, exploiting process flows, and much more.

Read the paper (PDF)

A growing need in higher education is the more effective use of analytics when evaluating the success of a postsecondary institution. The metrics currently used are simplistic measures of graduation rates, publications, and external funding. These measures offer a limited view of the effectiveness of postsecondary institutions. This paper provides a global perspective of the academic progress of students and the business of higher education. It presents innovative metrics that are more effective in evaluating postsecondary-institutional effectiveness.

Read the paper (PDF)

This paper describes an effective real-time contextual marketing system based on a successful case implemented in a private communication company in Chile. Implementing real-time cases is becoming a major challenge due to stronger competition, which generates an increase of churn and higher operational costs, among other issues. All of these can have an enormous effect on revenue and profit. A set of predictive machine learning models can help to improve response rates of outbound campaigns, but it s not enough to be more proactive in this business. Our real-time system for contextual marketing uses the two SAS^® solutions: SAS^® Event Stream Processing and SAS^® Real-Time Decision Manager, which are connected in cascade. In this configuration, SAS Event Stream Processing can read massive amounts of data from call detail records (CDRs) and antennas, and SAS Real-Time Decision Manager receives the resulting golden events, which trigger the right responses. Time elapsed from the detection of a golden event until a response is processed is approximately 5 seconds. Since implementing seven use cases of this real-time system, the results show an average augmentation in revenue of two million dollars in a testing period of four months, thus returning the investment in a short-term period. The implementation of this system has changed the way Telef nica Chile generates value from big data. Moreover, an outstanding, long-term working relationship between Telef nica Chile and SAS has been started.

Read the paper (PDF)

SAS^® Enterprise Guide^® continues to add easy-to-use features that enable you to work more efficiently. For example, you can now debug your DATA step code with a DATA step debugger tool; upload data to SAS^® Viya with a point-and-click task; control process flow execution behavior when an error occurs; export results to Microsoft Excel and Microsoft PowerPoint destinations with the click of a button; zoom views; filter the data grid with your own WHERE clause; easily define case-insensitive filters; and automatically get the latest product updates. Come see these and more new features and enhancements in SAS Enterprise Guide 7.11, 7.12, and 7.13.

Read the paper (PDF)

This presentation demonstrates that applying Fast Fourier Transformation (FFT) on smart meter data can provide enhanced customer segmentation and discovery. The FFT is a mathematical method for transforming a function of time into a function of frequency. It's vastly used in analyzing sound but is also relevant for utilities. Advanced Metering Infrastructure (AMI) refers to the full measurement and collection system that includes meters at the customer site and communication networks between the customer and the utility. With the inception of AMI, utilities experienced an explosion of data that provides vast analytical opportunities to improve reliability, customer satisfaction, and safety. However, the data explosion comes with its own challenges. The first challenge is the volume. Consider that just 20,000 customers with AMI data can reach over 300 GB of data per year. Simply aggregating the data from minutes to hours or even days can skew results and not provide accurate segmentations. The second challenge is the bad data that is being collected. Outliers caused by missing or incorrect reads, outages, or other factors must be addressed. FFT can eliminate this noise. The proposed framework is expected to identify various customer segments that could be used for demand response programs. The framework also has the potential to investigate diversion or fraud or failing meters (revenue protection), which is a big problem for many utilities.

Do you need to create a format instantly? Does the format have a lot of labels, and it would take a long time to type in all the codes and labels by hand? Sometimes, a SAS^® programmer needs to create a user-defined format for hundreds or thousands of codes, and he needs an easy way to accomplish this without having to type in all of the codes. SAS provides a way to create a user-defined format without having to type in any codes. If the codes and labels are in a text file, SAS data set, Excel file, or in any file that can be converted to a SAS data set, then a SAS user-defined format can be created on the fly. The CNTLIN=option of PROC FORMAT allows a user to create a user-defined format or informat from raw data or from a SAS file. This paper demonstrates how to create two user-defined formats instantly from a raw text file on our Census Bureau website. It explains how to use these user-defined formats for the final report and final output data set from PROC TABULATE. The paper focuses on the CNTLIN= option of PROC FORMAT, not the CNTLOUT= option.

Read the paper (PDF)

Exploring, analyzing, and presenting information are strengths of SAS^® Visual Analytics. However, when we need to expand the viewing of this information to an unlimited public outside the boundaries of the organization, we must aggregate geographic information to facilitate interaction and the use of information. This application uses JavaScript and CSS programming language, integrated with SAS^® programming, to present information about 4,239 programs of postgraduate study in Brazil. This information was evaluated by the Brazilian Federal Agency for Support and Evaluation of Graduate Education (CAPES, Brazil) with cartographic precision, enabling the visualization of the data generated in SAS Visual Analytics integrated with Google Maps and Google Street View. Users can select from Brazilian postgraduate programs, know information about the program, learn about theses and dissertations, and see the location of the institution and the campus. The application that is presented can be accessed at http://goo.gl/uAjvGw.

Read the paper (PDF)

SAS^® Visual Analytics has two offerings, SAS^® Visual Statistics and SAS^® Visual Data Mining and Machine Learning, that provide knowledge workers and data scientists an interactive interface for data partition, data exploration, feature engineering, and rapid modeling. These offerings are powered by the SAS^® Viya platform, thus enabling big data and big analytic problems to be solved. This paper focuses on the steps a user would perform during an interactive modeling session.

Read the paper (PDF)

This paper will build on the knowledge gained in the Intro to SAS^® ODS Graphics. The capabilities in ODS Graphics grow with every release as both new paradigms and smaller tweaks are introduced. After talking with the ODS developers, a selection of the many wonderful capabilities was selected. This paper will look at that selection of both types of capabilities and provide the reader with more tools for their belt. Visualization of data is an important part of telling the story seen in the data. And while the standards and defaults in ODS Graphics are very well done, sometimes the user has specific nuances for characters in the story or additional plot lines they want to incorporate. Almost any possibility, from drama to comedy to mystery, is available in ODS Graphics if you know how. We will explore tables, annotation and changing attributes, as well as the BLOCK plot. Any user of Base SAS on any platform will find great value from the SAS ODS Graphics procedures. Some experience with these procedures is assumed, but not required.

Read the paper (PDF) | Download the data file (ZIP)

Interrupted time series analysis (ITS) is a tool that can be used to help Learning Healthcare Systems evaluate programs in settings where randomization is not feasible. Interrupted time series is a statistical method to assess repeated snap shots over regular intervals of time before and after a system-level intervention or program is implemented. This method can be used by Learning Healthcare Systems to evaluate programs aimed at improving patient outcomes in real-world, clinical settings. In practice, the number of patients and the timing of observations are restricted. This presentation describes a program that helps statisticians identify optimal segments of time within a fixed population size for an interrupted time series analysis. A macro creates simulations based on DO loops to calculate power to detect changes over time due to system-level interventions. Parameters used in the macro are sample size, number of subjects in each time frame in each year, number of intervals in a year, and the probability of the event before and after the intervention. The macro gives the user the ability to specify different assumptions that result in design options that yield varying power based on the number of patients in each time intervals given the fixed parameters. The output from the macro can help stakeholders understand necessary parameters to help determine the optimal evaluation design.

Read the paper (PDF)

How can we run traditional SAS^® jobs, including SAS^® Workspace Servers, on Hadoop worker nodes? The answer is SAS^® Grid Manager for Hadoop, which is integrated with the Hadoop ecosystem to provide resource management, high availability and enterprise scheduling for SAS customers. This paper provides an introduction to the architecture, configuration, and management of SAS Grid Manager for Hadoop. Anyone involved with SAS and Apache Hadoop should find the information in this paper useful. The first area covered is a breakdown of each required SAS and Hadoop component. From the Hadoop ecosystem, we define the role of Hadoop YARN, Hadoop Distributed File System (HDFS) storage, and Hadoop client services. We review SAS metadata definitions for SAS Grid Manager, SAS^® Object Spawner, and SAS^® Workspace Servers. We cover required Kerberos security, as well as SAS^® Enterprise Guide^® and the SAS^® Grid Manager Client Utility. YARN queues and the SAS Grid Policy file for optimizing job scheduling are also reviewed. And finally, we discuss traditional SAS math running on a Hadoop worker node, and how it can take advantage of high-performance math to accelerate job execution. By leveraging SAS Grid Manager for Hadoop, sites are moving SAS jobs inside a Hadoop cluster. This will ultimately cut down on data movement and provide more consistent job execution. Although this paper is written for SAS and Hadoop administrators, SAS users can also benefit from this session.

Read the paper (PDF)

This presentation teaches the audience how to use ODS Graphics. Now part of Base SAS^®, ODS Graphics are a great way to easily create clear graphics that enable any user to tell their story well. SGPLOT and SGPANEL are two of the procedures that can be used to produce powerful graphics that used to require a lot of work. The core of the procedures is explained, as well as some of the many options available. Furthermore, we explore the ways to combine the individual statements to make more complex graphics that tell the story better. Any user of Base SAS on any platform will find great value in the SAS ODS Graphics procedures.

Read the paper (PDF) | Download the data file (ZIP)

For many years now you have learned the ins and outs of using SAS/ACCESS^® software to move data into SAS^® to do your analytics. With the new open, cloud-ready SAS^® Viya platform comes a new set of data access technologies known as SAS data connectors and SAS data connect accelerators. This paper describes what these new data access products are and how they integrate with the SAS Viya platform. After reading this paper, you will have the foundation needed to load data from third-party data sources into SAS Viya.

Read the paper (PDF)

Statistical analysis is like detective work, and a data set is like the crime scene. The data set contains unorganized clues and patterns that can, with proper analysis, ultimately lead to meaningful conclusions. Using SAS^® tools, a statistical analyst (like any good crime scene investigator) performs a preliminary analysis of the data set through visualization and descriptive statistics. Based on the preliminary analysis, followed by a detailed analysis, both the crime scene investigator (CSI) and the statistical analyst (SA) can use scientific or analytical tools to answer the key questions: What happened? What were the causes and effects? Why did this happen? Will it happen again? Applying the CSI analogy, this paper presents an example case study using a two-step process to investigate a big-data crime scene. Part I shows the general procedures that are used to identify clues and patterns and to obtain preliminary insights from those clues. Part II narrows the focus on the specific statistical analyses that provide answers to different questions.

Read the paper (PDF)

In 1993, Erin Brockovich, a legal clerk to Edward L. Masry, began a lengthy manual investigation after discovering a link between elevated clusters of cancer cases in Hinkley, CA, and contaminated water in the same area due to the disposal of chemicals from a utility company. In this session, we combine disparate data sources - cancer cases and chemical spillages - to identify connections between the two data sets using SAS^® Visual Investigator. Using the map and network functionalities, we visualize the contaminated areas and their link to cancer clusters. What took Erin Brockovich months and months to investigate, we can do in minutes with SAS Visual Investigator.

Read the paper (PDF)

Walt Disney once said, 'Of all of our inventions for mass communication, pictures still speak the most universally understood language.' Using data visualization to tell our stories makes analytics accessible to a wider audience than we can reach through words and numbers alone. Through SAS^® Visual Analytics, we can provide insight to a wide variety of audiences, each of whom see the data through a unique lens. Determining the best data and visualizations to provide for users takes concentrated effort and thoughtful planning. This session discusses how Western Kentucky University uses SAS Visual Analytics to provide a wide variety of users across campus with the information they need to visually identify trends, even some they never expected to see, and to answer questions they might not have thought to ask.

Read the paper (PDF)

As technology expands, we have the need to create programs that can be handed off to clients, to regulatory agencies, to parent companies, or to other projects, and handed off with little or no modification by the recipient. Minimizing modification by the recipient often requires the program itself to self-modify. To some extent the program must be aware of its own operating environment and what it needs to do to adapt to it. There are a great many tools available to the SAS^® programmer that will allow the program to self-adjust to its own surroundings. These include location-detection routines, batch files based on folder contents, the ability to detect the version and location of SAS, programs that discern and adjust to the current operating system and the corresponding folder structure, the use of automatic and user defined environmental variables, and macro functions that use and modify system information. Need to create a portable program? We can hand you the tools.

Read the paper (PDF)

This paper details analysis that was conducted on the World Development Indicators data set, obtained from the World Bank Information Repository. The aim of this analysis was to provide useful insight into how countries, particularly developing countries such as those in South America and Asia, can use social investment programs to grow their GDP. The analysis identified that useful models can be obtained to predict per capita GDP based on social factors. Further study and investigation is recommended to explore further the relationships discovered in performing this analysis.

Read the paper (PDF)

Data with a location component is naturally displayed on a map. Base SAS^® 9.4 provides libraries of map data sets to assist in creating these images. Sometimes, a particular sub-region is all that needs to be displayed. SAS/GRAPH^® software can create a new subset of the map using the GPROJECT procedure minimum and maximum latitude and longitude options. However, this method is capable only of cutting out a rectangular area. This paper presents a polygon clipping algorithm that can be used to create arbitrarily shaped custom map regions. Maps are nothing more than sets of polygons, defined by sets of border points. Here, a custom polygon shape overlays the map polygons and saves the intersection of the two. The DATA step hash object is used for easier bookkeeping of the added and deleted points needed to maintain the correct shape of the clipped polygons.

Read the paper (PDF)

In doing this project, we hoped to find patterns in crime occurrences that can help law enforcement officials decrease the number of crimes that the City of Philadelphia experiences. More specifically, we wanted to determine when and where most crimes occur, and what types of crimes were most prevalent. Our data came from the open data repository for Philadelphia, which has additional data from other sources in the region. After merging four data sets into one, doing a lot of data cleaning, and creating new variables, we found some interesting trends. First, we found that crimes generally occurred in more isolated areas where there was less traffic, and that certain locations had higher crime counts than others. We also discovered there was less crime in the morning and more in the afternoon, as well as less crime on Sunday and more on Tuesday. During the summer months, total crime occurrences as well as the most prevalent types of crime occurrences (thefts, vandalism/criminal mischief, miscellaneous crimes, and other assault) peaked. Theft occurrences, the most prevalent crime occurrence, showed many of the same trends as overall crime occurrences. We found that thefts and vehicle thefts as well as overall crime occurrences were most prevalent in ZIP code 19102. The models we built to try to classify a crime as violent or nonviolent were not very fruitful, but the tree model was the best in terms of validation misclassification error rate.

Read the paper (PDF)

How do you enable strong authentication across different parts of your organization in a safe and secure way? We know that Kerberos provides us with a safe and secure strong authentication mechanism, but how does it work across different domains or realms? In this paper, we examine how Kerberos cross-realm authentication works and the different parts that you need ready in order to use Kerberos effectively. Understanding the principals and applying the ideas we present will make you successful at improving the security of your authentication system.

Read the paper (PDF)

A leading global information and communications technology solution company provides a broad range of telecom products across the world. Their finished products share commonality in key components, and, in most cases, are assembled after the customer orders are realized. Each finished product typically consists of a large number of key components, and the stockout of any components causes a delay of customer orders. For these reasons, the optimal inventory policy of one component should be determined in conjunction with those of other components. Currently the company uses business experience to manage inventory across their supply chain network for all of the components and finished products. However, the increasing variety of products and business expansion raise difficulties in inventory management. The company wants to explore a systematic approach to optimizing inventory policies, assuring customer service level and minimizing total inventory cost. This paper describes using SAS/OR^® software and SAS^® inventory optimization technologies to model such a multi-echelon assembly system and optimize inventory policies for key components and finished products.

Read the paper (PDF)

Throughout history, the phrase know thyself has been the aspiration of many. The trend of wearable technologies has certainly provided the opportunity to collect personal data. These technologies enable individuals to know thyself on a more sophisticated level. Specifically, wearable technologies that can track a patient's medical profile in a web-based environment, such as continuous blood glucose monitors, are saving lives. The main goal for diabetics is to replicate the functions of the pancreas in a manner that allows them to live a normal, functioning lifestyle. Many diabetics have access to a visual analytics website to track their blood glucose readings. However, they often are unreadable and overloaded with information. Analyzing these readings from the glucose monitor and insulin pump with SAS^®, diabetics can parse their own information into more simplified and readable graphs. This presentation demonstrates the ease in creating these visualizations. Not only is this beneficial for diabetics, but also for the doctors that prescribe the necessary basal and bolus levels of insulin for a patient s insulin pump.

View the e-poster or slides (PDF)

When analyzing data with SAS^®, we often use the SAS DATA step and the SQL procedure to explore and manipulate data. Though they both are useful tools in SAS, many SAS users do not fully understand their differences, advantages, and disadvantages and thus have numerous unnecessary biased debates on them. Therefore, this paper illustrates and discusses these aspects with real work examples, which give SAS users deep insights into using them. Using the right tool for a given circumstance not only provides an easier and more convenient solution, it also saves time and work in programming, thus improving work efficiency. Furthermore, the illustrated methods and advanced programming skills can be used in a wide variety of data analysis and business analytics fields.

Read the paper (PDF)

From stock price histories to hospital stay records, analysis of time series data often requires use of lagged (and occasionally lead) values of one or more analysis variable. For the SAS^® user, the central operational task is typically getting lagged (lead) values for each time point in the data set. Although SAS has long provided a LAG function, it has no analogous lead function, which is an especially significant problem in the case of large data series. This paper 1) reviews the lag function, in particular, the powerful but non-intuitive implications of its queue-oriented basis; 2) demonstrates efficient ways to generate leads with the same flexibility as the LAG function, but without the common and expensive recourse to data re-sorting; and 3) shows how to dynamically generate leads and lags through use of the hash object.

Read the paper (PDF)

Managing your career future involves learning outside the box at all stages. The next step is not always on the path we planned as opportunities develop and must be taken when we are ready. Prepare with this paper, which explains important features of Base SAS^® that support teams. In this presentation, you learn about the following: concatenating team shared folders with personal development areas; creating consistent code; guidelines for a team (not standards); knowing where the documentation will provide the basics; thinking of those who follow (a different interface); creating code for use by others; and how code can learn about the SAS environment.

Read the paper (PDF)

Programming for others involves new disciplines not called for when we write to provide results. There are many additional facilities in the languages of SAS^® to ensure the processes and programs you provide for others will please your customers. Not all are obvious and some seem hidden. The never-ending search to please your friends, colleagues, and customers could start in this presentation.

Read the paper (PDF)

Data is your friend. This presentation discusses the use of data for quality improvement (QI). Measurement over time is integral to quality improvement, and statistical process control charts (also known as Shewhart or SPC charts) are a good way to learn from the way measures change over time, in response to our improvement efforts. The presentation explains what an SPC chart is, how to chose the correct type of chart, how to create and update a chart using SAS^®, and how to learn from the chart. The examples come from QI projects in health care, and the material is based on the Institute for Healthcare Improvement's Model for Improvement. However, the material is applicable to other fields, including manufacturing and business. The presentation is intended for people newly considering a QI project, people who want to graph their data and need help with getting started, and anyone interested in interpreting SPC charts created by someone else.

Read the paper (PDF)

Making sure that you have saved all the necessary information to replicate a deliverable can be a cumbersome task. You want to make sure that all the raw data sets and all the derived data sets, whether they are Study Data Tabulation Model (SDTM) data sets or Analysis Data Model (ADaM) data sets, are saved. You prefer that the date/time stamps are preserved. Not only do you need the data sets, you also need to keep a copy of all programs that were used to produce the deliverable, as well as the corresponding logs from when the programs were executed. Any other information that was needed to produce the necessary outputs also needs to be saved. You must do all of this for each deliverable, and it can be easy to overlook a step or some key information. Most people do this process manually. It can be a time-consuming process, so why not let SAS^® do the work for you?

Read the paper (PDF)

Developing software using agile methodologies has become the common practice in many organizations. We use the SCRUM methodology to prepare, plan, and implement changes in our analytics environment. Preparing for the deployment of a new release usually took two days of creating packages, promoting them, deploying jobs, creating migration scripts, and correcting errors made in the first attempt. A sprint that originally took 10 working days (two weeks) was effectively reduced to barely seven. By automating this process, we were able to reduce the time needed to prepare our deployment to less than half a day, increasing the time we can spend developing by 25%. In this paper, we present the process and system prerequisites for automating the deployment process. We also describe the process, code, and scripts required for automating metadata promotion and physical table comparison and update.

Read the paper (PDF)

Linear regression, which is widely used, can be improved by the inclusion of the penalizing parameter. This helps reduce variance (at the cost of a slight increase in bias) and improves prediction accuracy and model interpretability. The regularization model is implemented on the sample data set, and recommendations for the practice are included.

View the e-poster or slides (PDF)

String externalization is the key to making your SAS^® applications speak multiple languages, even if you can't. Using the new features in SAS^® 9.3 for internationalization, your SAS applications can be written to adapt to whatever environment they are found in. String externalization is the process of identifying and separating translatable strings from your SAS program. This paper outlines the four steps of string externalization: create a Microsoft Excel spreadsheet for messages (optional), create SMD files, convert SMD files, and create the final SAS data set. Furthermore, it briefly shows you a real-world project on applying the concept. Using the Excel spreadsheet message text approach, professional translators can work more efficiently translating text in a friendlier and more comfortable environment. Subsequently, a programmer can also fully concentrate on developing and maintaining SAS code when your application is traveling to a new country.

View the e-poster or slides (PDF)

Geofencing is one of the most promising and exciting concepts that has developed with the advent of the internet of things. Like John Anderton in the 2002 movie Minority Report, you can now enter a mall and immediately receive commercial ads and offers based on your personal taste and past purchases. Authorities can track vessels positions and detect when a ship is not in the area it should be, or they can forecast and optimize harbor arrivals. When a truck driver breaks from the route, the dispatcher can be alerted and can act immediately. And there are countless examples from manufacturing, industry, security, or even households. All of these applications are based on the core concept of geofencing, which consists of detecting whether a device s position is within a defined geographical boundary. Geofencing requires real-time processing in order to react appropriately. In this session, we explain how to implement real-time geofencing on streaming data with SAS^® Event Stream Processing and achieve high-performance processing, in terms of millions of events per second, over hundreds of millions of geofences.

Read the paper (PDF)

In this panel session, professors from three geographically diverse universities explain what makes for an effective partnership with private sector companies. Specific examples are discussed from health care, insurance, financial services, insurance, and retail. The panelists discuss what works, what doesn t, and what both parties need to be prepared to bring to the table for a long-term, mutually beneficial partnership.

The days of comparing paper copies of graphs on light boxes are long gone, but the problems associated with validating graphical reports still remain. Many recent graphs created using SAS/GRAPH^® software include annotations, which complicate an already complex problem. In ODS Graphics, only a single input data set should be used. Because annotation can be more easily added by overlaying an additional graph layer, it is now more practical to use that single input data set for validation, which removes all of the scaling, platform, and font issues that got in the way before. This paper guides you through the techniques to simplify validation while you are creating your perfect graph.

Read the paper (PDF)

SAS^® education is a mainstay across disciplines and educational levels in the United States. Along with other courses that are relevant to the jobs students want, independent SAS courses or SAS education integrated into additional courses can help a student be more interesting to a potential employer. The multitude of SAS offerings (SAS^® University Edition, Base SAS^®, SAS^® Enterprise Guide^®, SAS^® Studio, and the SAS^® OnDemand offerings) provide the tools for education, but reaching students where they are is the greatest key for making the education count. This presentation discusses several roadblocks to learning SAS^® syntax or point-and-click from the student perspective and several solutions developed jointly by students and educators in one graduate educational program.

Read the paper (PDF)

Every organization, from the most mature to a day-one start-up, needs to grow organically. A deep understanding of internal customer and operational data is the single biggest catalyst to develop and sustain the data. Advanced analytics and big data directly feed into this, and there are best practices that any organization (across the entire growth curve) can adopt to drive success. Analytics teams can be drivers of growth. But to be truly effective, key best practices need to be implemented. These practices include in-the-weeds details, like the approach to data hygiene, as well as strategic practices, like team structure and model governance. When executed poorly, business leadership and the analytics team are unable to communicate with each other they talk past each other and do not work together toward a common goal. When executed well, the analytics team is part of the business solution, aligned with the needs of business decision-makers, and drives the organization forward. Through our engagements, we have discovered best practices in three key areas. All three are critical to analytics team effectiveness. 1) Data Hygiene 2) Complex Statistical Modeling 3) Team Collaboration

Read the paper (PDF)

You re in the business of performing complex analyses on large amounts of data. This data changes quickly and often, so you ve invested in a powerful high-performance analytics engine with the speed to respond to a real-time data stream. However, you realize an immediate problem upon the implementation of your software solution: your analytics engine wants to process many records of data at once, but your streaming engine wants to send individual records. How do you store this streaming data? How do you tell the analytics engine about the updates? This paper explains how to manage real-time streaming data in a batch-processing analytics engine. The problem of managing streaming data in analytics engines comes up in many industries: energy, finance, health care, and marketing to name a few. The solution described in this paper can be applied in any industry, using features common to most analytics engines. You learn how to store and manage streaming data in such a way as to guarantee that the analytics engine has only current information, limit interruptions to data access, avoid duplication of data, and maintain a historical record of events.

Read the paper (PDF)

How many environments does your organization have-three (Dev/Test/Prod), five (Dev/SIT/UAT/Pre-Prod/Prod), or maybe only one? Once you've built your SAS^® process-an ETL job, a model, an exploration, or a report-how should you promote it across these environments? If you have only one environment, is a development life cycle still possible? (Yes, it is.) Historically, the traditional systems development life cycle (SDLC) spans multiple environments (for example, Dev/Test/Prod). This approach has benefits-primarily to ensure that change in one environment does not adversely impact others, but costs and release time-frames mean this is not always practicable. Some sites now adopt a two-platform approach: Non-Production and Production. Non-Prod exists for technology change, such as new software, hot fixes, database connections, and so on. At these sites, the business runs wholly within the Production environment, yet still requires a business-specific life-cycle management within the Production environment. And, of course, all promotion must include thorough testing. Other questions to consider are: 1) Can this promotion process be automated? 2) Can this process extend beyond business content to include configuration settings? This presentation investigates the SAS tools available to promote content between environments or between functional areas of a single environment, and how to automate and test the promotion process. Just imagine: a weekly automated and tested promotion process? Let's see

Read the paper (PDF)

One of the first maps of the present United States was John White's 1585 map of the Albemarle Sound and Roanoke Island, the site of the Lost Colony and the site of my present home. This presentation looks at advances in mapping through the ages, from the early surveys and hand-painted maps, through lithographic and photochemical processes, to digitization and computerization. Inherent difficulties in including small pieces of coastal land (often removed from map boundary files and data sets to smooth a boundary) are also discussed. The paper concludes with several current maps of Roanoke Island created with SAS^®.

Read the paper (PDF)

In the increasingly competitive environment for banks and credit unions, every potential advantage should be pursued. One of these advantages is to market additional products to your existing customers rather than to new customers, since your existing customers already know (and hopefully trust) you, and you have so much data on them. But how can this best be done? How can you market the right products to the right customers at the right time? Predictive analytics can do this by forecasting which customers have the highest chance of purchasing a given financial product. This paper provides a step-by-step overview of a relatively simple but comprehensive approach to maximize cross-sell opportunities among your customers. We first prepare the data for a statistical analysis. With some basic predictive analytics techniques, we can then identify those members who have the highest chance of buying a financial product. For each of these members, we can also gain insight into why they would purchase, thus suggesting the best way to market to them. We then make suggestions to improve the model for better accuracy.

Read the paper (PDF)

As a retailer, have you ever found yourself reviewing your last season's assortment and wondering, What should I have carried in my assortment ? You are constantly faced with the challenge of product selection, placement, and ensuring your assortment will drive profitable sales. With millions of consumers, thousands of products, and hundreds of locations, this question can often times be challenging and overwhelming. With the rise in omnichannel, traditional approaches just won't cut it to gain the insights needed to maximize and manage localized assortments as well as increase customer satisfaction. This presentation explores applications of analytics within marketing and merchandising to drive assortment curation as well as relevancy for customers. The use of analytics can not only increase efficiencies but can also give insights into what you should be buying, how best to create a profitable assortment, and how to engage with customers in-season to drive their path to purchase. Leveraging an analytical infrastructure to infuse analytics into the assortment management process can help retailers achieve customer-centric insights, in a way that is easy to understand, so that retailers can quickly take insights to actions and gain the competitive edge.

Read the paper (PDF)

Meta-analysis is a method for combining multiple independent studies on the same subject or question, producing a single large study with increased accuracy and enhanced ability to detect overall trends and smaller effects. This is done by treating the results of each study as a single observation and performing analysis on the set, while controlling for differences between individual studies. These differences can be treated as either fixed or random effects, depending on context. This paper demonstrates the process and techniques used in meta-analysis using human trafficking studies. This problem has seen increasing interest in the past few years, and there are now a number of localized studies for one state or a metropolitan area. This meta-analysis combines these to begin development of a comprehensive analytic understanding of human trafficking across the United States. Both fixed and random effects are described. All elements of this analysis were performed using SAS^® University Edition.

Read the paper (PDF)

Many practitioners of machine learning are familiar with support vector machines (SVMs) for solving binary classification problems. Two established methods of using SVMs in multinomial classification are the one-versus-all approach and the one-versus-one approach. This paper describes how to use SAS^® software to implement these two methods of multinomial classification, with emphasis on both training the model and scoring new data. A variety of data sets are used to illustrate the pros and cons of each method.

Read the paper (PDF)

A microservice architecture prescribes the design of your software application as suites of independently deployable services. In this paper, we detail how you can design your SAS^® 9.4 programs so that they adhere to a microservice architecture. We also describe how you can leverage Many-Task Computing (MTC) in your SAS^® programs to gain a high level of parallelism. Under these paradigms, your SAS code will gain encapsulation, robustness, reusability, and performance. The design principles discussed in this paper are implemented in the SAS^® Infrastructure for Risk Management (IRM) solution. Readers with an intermediate knowledge of Base SAS^® and the SAS macro language will understand how to design their SAS code so that it follows these principles and reaps the benefits of a microservice architecture.

Read the paper (PDF)

SAS^® BI Dashboard is an important business intelligence and data visualization product used by many customers worldwide. They still rely on SAS BI Dashboard for performance monitoring and decision support. SAS^® Visual Analytics is a new-generation product, which empowers customers to explore huge volumes of data very quickly and view visualized results with web browsers and mobile devices. Since SAS Visual Analytics is used by more and more regular customers, some SAS BI Dashboard customers might want to migrate existing dashboards to SAS Visual Analytics to take advantage of new technologies. In addition, some customers might hope to deploy the two products in parallel and keep everyone on the same page. Because the two products use different data models and formats, a special conversion tool is developed to convert SAS BI Dashboard dashboards into SAS Visual Analytics dashboards and reports. This paper comprehensively describes the guidelines, methods, and detailed steps to migrate dashboards from SAS BI Dashboard to SAS Visual Analytics. Then the converted dashboards can be shown in supported viewers of SAS Visual Analytics including mobile devices and modern browsers.

Read the paper (PDF)

SAS^® migrations are the number one reason why SAS architects and administrators are fired. Even though this bold statement is not universally true, it has been at the epicenter of many management and technical discussions at UnitedHealth Group. The competing business forces between the desire to innovate and to provide platform stability drive difficult discussions between business leaders and IT partners that tend to result in a frustrated user-base, flustered IT professionals, and a stale SAS environment. Migrations are the antagonist of any IT professional because of the disruption, long hours, and stress that typically ensues. This paper addresses the lessons learned from a SAS migration from the first maintenance release of SAS^® 9.4 to the third maintenance release of SAS^® 9.4 on a technically sophisticated enterprise SAS platform including clustered metadata servers, clustered middle-tier, SSL, an IBM Platform Load Sharing Facility (LSF) grid, and SAS^® Visual Analytics.

Read the paper (PDF)

This paper presents a case study in which social media posts by individuals related to public transport companies in the United Kingdom were collected from social media sites such as Twitter and Facebook and also from forums using SAS^® and Python. The posts were then further processed by SAS^® Text Miner and SAS^® Visual Analytics to retrieve brand names, means of public transport (underground, trains, buses), and any mentioned attributes. Relevant concepts and topics are identified using text mining techniques and visualized using concept maps and word clouds. Later, we aim to identify and categorize sentiments against public transport in the corpus of the posts. Finally, we create an association map/mind-map of the different service dimensions/topics and the brands of public transport, using correspondence analysis.

Read the paper (PDF)

Banks can create a competitive advantage in their business by using business intelligence (BI) and by building models. In the credit domain, the best practice is to build risk-sensitive models (Probability of Default, Exposure at Default, Loss Given Default, Unexpected Loss, Concentration Risk, and so on) and implement them in decision-making, credit granting, and credit risk management. There are models and tools on the next level that are built on these models and that are used to help in achieving business targets, setting risk-sensitive pricing, capital planning, optimizing Return on Equity/Risk Adjusted Return on Capital (ROE/RAROC), managing the credit portfolio, setting the level of provisions, and so on. It works remarkably well as long as the models work. However, over time, models deteriorate, and their predictive power can drop dramatically. As a result, heavy reliance on models in decision-making (some decisions are automated following the model's results-without human intervention) can result in a huge error, which might have dramatic consequences for the bank's performance. In my presentation, I share our experience in reducing model risk and establishing corporate governance of models with the following SAS^® tools: SAS^® Model Monitoring Microservice, SAS^® Model Manager, dashboards, and SAS^® Visual Analytics.

Read the paper (PDF)

This presentation has the objective to present a methodology for interest rates, life tables, and actuarial calculations using generational mortality tables and the forward structure of interest rates for pension funds, analyzing long-term actuarial projections and their impacts on the actuarial liability. It was developed as a computational algorithm in SAS^® Enterprise Guide^® and Base SAS^® for structuring the actuarial projections and it analyzes the impacts of this new methodology. There is heavy use of the IML and SQL procedures.

Read the paper (PDF)

A successful conversion to the International Financial Reporting Standards (IFRS) standard known as IFRS 9 can present many challenges for a financial institution. We discuss how leveraging best practices in project management, accounting standards, and platform implementation can overcome these challenges. Effective project management ensures that the scope of the implementation and success criteria are well defined. It captures all major decision points and ensures thorough documentation of the platform and how its unique configuration ties back directly to specific business requirements. Understanding the nuances of the IFRS 9 standard, specifically the impact of bucketing all financial assets according to their cash flow characteristics and business models, is crucial to ensuring the design of an efficient and robust reporting platform. Credit impairment is calculated at the instrument level, and can both improve or deteriorate. Changes in the level of credit impairment of individual financial assets enters the balance sheet as either an amortized cost, other comprehensive income, or fair value through profit and loss. Introducing more volatility to these balances increases the volatility in key financial ratios used by regulators. A robust and highly efficient platform is essential to process these calculations, especially under tight reporting deadlines and the possibility of encountering challenges. Understanding how the system is built through the project documentatio

Read the paper (PDF)

Prince Niccolo Machiavelli said things on the order of, The promise given was a necessity of the past: the word broken is a necessity of the present. His utilitarian philosophy can be summed up by the phrase, The ends justify the means. As a personality trait, Machiavelianism is characterized by the drive to pursue one's own goals at the cost of others. In 1970, Richard Christie and Florence L. Geis created the MACH-IV test to assign a MACH score to an individual, using 20 Likert-scaled questions. The purpose of this study was to build a regression model that can be used to predict the MACH score of an individual using fewer factors. Such a model could be useful in screening processes where personality is considered, such as in job screening, offender profiling, or online dating. The research was conducted on a data set from an online personality test similar to the MACH-IV test. It was hypothesized that a statistically significant model exists that can predict an average MACH score for individuals with similar factors. This hypothesis was accepted.

View the e-poster or slides (PDF)

This paper establishes the conceptualization of the dimension of the shopping cart (or market basket) on apparel retail websites. It analyzes how the cart dimension (describing anonymous shoppers) and the customer dimension (describing non-anonymous shoppers) impact merchandise return behavior. Five data-mining techniques-namely logistic regression, decision tree, neural network, gradient boosting, and support vector machine-are used for predicting the likelihood of merchandise return. The target variable is a dichotomous response variable: return vs not return. The primary input variables are conceptualized as constituents of the cart dimension, derived from engineering merchandise-related variables such as item style, item size, and item color, as well as free-shipping-related thresholds. By further incorporating the constituents of the customer dimension such as tenure, loyalty membership, and purchase histories, the predictive accuracy of the model built using each of the five data-mining techniques was found to improve substantially. This research also highlights the relative importance of the constituents of the cart and customer dimensions governing the likelihood of merchandise return. Recommendations for possible applications and research areas are provided.

Read the paper (PDF)

Dynamic social networks can be used to monitor the constantly changing nature of interactions and relationships between people and groups. The size and complexity of modern dynamic networks can make this task extremely challenging. Using the combination of SAS/IML^®, SAS/QC^®, and R, we propose a fast approach to monitor dynamic social networks. A discrepancy score at edge level was developed to measure the unusualness of the observed social network. Then, multivariate and univariate change-point detection methods were applied on the aggregated discrepancy score to identify the edges and vertices that have experienced changes. Stochastic block model (SBM) networks were simulated to demonstrate this method using SAS/IML and R. PROC SHEWHART and PROC CUSUM in SAS/QC and PROC SGRENDER heat maps were applied on the aggregated discrepancy score to monitor the dynamic social network. The combination of SAS/IML, SAS/QC, and R make it an ideal tool to monitor dynamic social networks.

View the e-poster or slides (PDF)

Microsoft Excel worksheets enable you to explore data that answers the difficult questions that you face daily in your work. When you combine the SAS^® Output Deliver System (ODS) with the capabilities of Excel, you have a powerful toolset that you can use to manipulate data in various ways, including highlighting data, using formulas to answer questions, and adding a pivot table or graph. In addition, ODS and Excel give you many methods for enhancing the appearance of your tables and graphs. This paper, written for the beginning analyst to the most advanced programmer, illustrates first how to manipulate styles and presentation elements in your worksheets by controlling text wrapping, highlighting and exploring data, and specifying Excel templates for data. Then, the paper explains how to use the TableEditor tagset and other tools to build and manipulate both basic and complex pivot tables that can help you answer all of the questions about your data. You will also learn techniques for sorting, filtering, and summarizing pivot-table data. ^®

Read the paper (PDF)

The SAS/IML^® language excels in handling matrices and performing matrix computations. A new feature in SAS/IML 14.2 is support for nonmatrix data structures such as tables and lists. In a matrix, all elements are of the same type: numeric or character. Furthermore, all rows have the same length. In contrast, SAS/IML 14.2 enables you to create a structure that contains many objects of different types and sizes. For example, you can create an array of matrices in which each matrix has a different dimension. You can create a table, which is an in-memory version of a data set. You can create a list that contains matrices, tables, and other lists. This paper describes the new data structures and shows how you can use them to emulate other structures such as stacks, associative arrays, and trees. It also presents examples of how you can use collections of objects as data structures in statistical algorithms.

Read the paper (PDF)

The TABULATE procedure has long been a central workhorse of our organization's reporting processes, given that it offers a uniquely concise syntax for obtaining descriptive statistics on deeply grouped and nested categories within a data set. Given the diverse output capabilities of SAS^®, it often then suffices to simply ship the procedure's completed output elsewhere via the Output Delivery System (ODS). Yet there remain cases in which we want to not only obtain a formatted result, but also to acquire the full nesting tree and logic by which the computations were made. In these cases, we want to treat the details of the Tabulate statements as data, not merely as presentation. I demonstrate how we have solved this problem by parsing our Tabulate statements into a nested tree structure in JSON that can be transferred and easily queried for deep values elsewhere beyond the SAS program. Along the way, this provides an excellent opportunity to walk through the nesting logic of the procedure's statements and explain how to think about the axes, groupings, and set computations that make it tick. The source code for our syntax parser are also available on GitHub for further use.

Read the paper (PDF)

The EXPAND procedure is very useful when handling time series data and is commonly used in fields such as finance or economics, but it can also be applied to medical encounter data within a health research setting. Medical encounter data consists of detailed information about healthcare services provided to a patient by a managed care entity and is a rich resource for epidemiologic research. Specific data items include, but are not limited to, dates of service, procedures performed, diagnoses, and costs associated with services provided. Drug prescription information is also available. Because epidemiologic studies generally focus on a particular health condition, a researcher using encounter data might wish to distinguish individuals with the health condition of interest by identifying encounters with a defining diagnosis and/or procedure. In this presentation, I provide two examples of how cases can be identified from a medical encounter database. The first uses a relatively simple case definition, and then I EXPAND the example to a more complex case definition.

View the e-poster or slides (PDF)

In item response theory (IRT), the distribution of examinees' abilities is needed to estimate item parameters. However, specifying the ability distribution is difficult, if not impossible, because examinees' abilities are latent variables. Therefore, IRT estimation programs typically assume that abilities follow a standard normal distribution. When estimating item parameters using two separate computer runs, one problem with this approach is that it causes item parameter estimates obtained from two groups that differ in ability level to be on different scales. There are several methods that can be used to place the item parameter estimates on a common scale, one of which is multi-group calibration. This method is also called concurrent calibration because all items are calibrated concurrently with a single computer run. There are two ways to implement multi-group calibration in SAS^®: 1) Using PROC IRT. 2) Writing an algorithm from scratch using SAS/IML^®. The purpose of this study is threefold. First, the accuracy of the item parameter estimates are evaluated using a simulation study. Second, the item parameter estimates are compared to those produced by an item calibration program flexMIRT. Finally, the advantages and disadvantages of using these two approaches to conduct multi-group calibration are discussed.

View the e-poster or slides (PDF)

Multicollinearity can be briefly described as the phenomenon in which two or more identified predictor variables in a multiple regression model are highly correlated. The presence of this phenomenon can have a negative impact on the analysis as a whole and can severely limit the conclusions of the research study. This paper reviews and provides examples of the different ways in which multicollinearity can affect a research project, and tells how to detect multicollinearity and how to reduce it once it is found. In order to demonstrate the effects of multicollinearity and how to combat it, this paper explores the proposed techniques by using the Behavioral Risk Factor Surveillance System data set. This paper is intended for any level of SAS^® user. This paper is also written to an audience with a background in behavioral science or statistics.

Read the paper (PDF)

Recent advances in computing technology, monitoring systems, and data collection mechanisms have prompted renewed interest in multivariate time series analysis. In contrast to univariate time series models, which focus on temporal dependencies of individual variables, multivariate time series models also exploit the interrelationships between different series, thus often yielding improved forecasts. This paper focuses on cointegration and long memory, two phenomena that require careful consideration and are observed in time series data sets from several application areas, such as finance, economics, and computer networks. Cointegration of time series implies a long-run equilibrium between the underlying variables, and long memory is a special type of dependence in which the impact of a series' past values on its future values dies out slowly with the increasing lag. Two examples illustrate how you can use the new features of the VARMAX procedure in SAS/ETS^® 14.1 and 14.2 to glean important insights and obtain improved forecasts for multivariate time series. One example examines cointegration by using the Granger causality tests and the vector error correction models, which are the techniques frequently applied in the Federal Reserve Board's Comprehensive Capital Analysis and Review (CCAR), and the other example analyzes the long-memory behavior of US inflation rates.

Read the paper (PDF) | Download the data file (ZIP)

No Batch Scheduler? No problem! This paper describes the use of a SAS^® Data Integration Studio job that can be started by a time-dependent scheduler like Windows Scheduler (or crontab in UNIX) to mimic a batch scheduler using SAS^® Grid Manager.

Read the paper (PDF)

Data has to be a moderate size when you estimate parameters with machine learning. For data with huge numbers of records like healthcare big data (for example, receipt data) or super multi-dimensional data like genome big data, it is important to follow a procedure in which data is cleaned first and then selection of data or variables for modeling is performed. Big data often consists of macroscopic and microscopic groups. With these groups, it is possible to increase the accuracy of estimation by following the above procedure in which data is cleaned from a macro perspective and the selection of data or variables for modeling is performed from a micro perspective. This kind of step-wise procedure can be expected to help reduce bias. We also propose a new analysis algorithm with N-stage machine learning. For simplicity, we assume N =2. Note that different machine learning approaches should be applied; that is, a random forest method is used at the first stage for data cleaning, and an elastic net method is used for the selection of data or variables. For programming N-stage machine learning, we use the LUA procedure that is not only efficient, but also enables an easily readable iteration algorithm to be developed. Note that we use well-known machine learning methods that are implementable with SAS^® 9.4, SAS^® In-Memory Statistics, and so on.

Read the paper (PDF)

The SAS^® DATA step is one of the best (if not the best) data manipulators in the programming world. One of the areas that gives the DATA step its power is the wealth of functions that are available to it. This paper takes a PEEK at some of the functions whose names have more than one MEANing. While the subject matter is very serious, the material is presented in a humorous way that is guaranteed not to BOR the audience. With so many functions available, we have to TRIM our list so that the presentation can be made within the TIME allotted. This paper also discusses syntax and shows several examples of how these functions can be used to manipulate data.

Read the paper (PDF)

Graphics are an excellent way to display results from multiple statistical analyses and get a visual message across to the correct audience. Scientific journals often have very precise requirements for graphs that are submitted with manuscripts. While authors often find themselves using tools other than SAS^® to create these graphs, the combination of the SGPLOT procedure and the Output Delivery System enables authors to create what they need in the same place as they conducted their analysis. This presentation focuses on two methods for creating a publication quality graphic in SAS^® 9.4 and provides solutions for some issues encountered when doing so.

Read the paper (PDF)

A new ODS destination for creating Microsoft Excel workbooks is available starting in the third maintenance release for SAS^® 9.4. This destination creates native Microsoft Excel XLSX files, supports graphic images, and offers other advantages over the older ExcelXP tagset. In this presentation, you learn step-by-step techniques for quickly and easily creating attractive multi-sheet Excel workbooks that contain your SAS^® output. The techniques can be used regardless of the platform on which SAS software is installed. You can even use them on a mainframe! Creating and delivering your workbooks on demand and in real time using SAS server technology is discussed. Using earlier versions of SAS to create multi-sheet workbooks is also discussed. Although the title is similar to previous presentations by this author, this presentation contains new and revised material not previously presented.

Read the paper (PDF) | Download the data file (ZIP)

Creating your first suite of reports using SAS^® Visual Analytics is like being a kid in a candy store with so many options for data visualization, it is difficult to know where to start. Having a plan for implementation can save you a lot of time in development and beyond, especially when you are wrangling big data. This paper helps you make sure that you are parallelizing work (where possible), maximizing your data insights, and creating a polished end product. We provide guidelines to common questions, such as How many objects are too many ? or When should I use multiple tabs versus report linking? to start any data visualizer off on the right foot.

Read the paper (PDF)

Confidence intervals are critical to understanding your survey data. If your intervals are too narrow, you might inadvertently judge a result to be statistically significant when it is not. While many familiar SAS^® procedures, such as PROC MEANS and PROC REG, provide statistical tests, they rely on the assumption that the data comes from a simple random sample. However, almost no real-world survey uses such sampling. Learn how to use the SURVEYMEANS procedure and its SURVEY cousins to estimate confidence intervals and perform significance tests that account for the structure of the underlying survey, including the replicate weights now supplied by some statistical agencies. Learn how to extract the results you need from the flood of output that these procedures deliver.

Read the paper (PDF)

Do you create Excel files from SAS^®? Do you use the ODS EXCELXP tagset or the ODS EXCEL destination? In this presentation, the EXCELXP tagset and the ODS EXCEL destination are compared face to face. There's gonna be a showdown! We give quick tips for each and show how to create Excel files for our Special Census program. Pros of each method are explored. We show the added benefits of the ODS EXCEL destination. We display how to create XML files with the EXCELXP tagset. We present how to use TAGATTR formats with the EXCELXP tagset to ensure that leading and trailing zeros in Excel are preserved. We demonstrate how to create the same Excel file with the ODS EXCEL destination with SAS formats instead of with TAGATTR formats. We show how the ODS EXCEL destination creates native Excel files. One of the drawbacks of an XML file created with the EXCELXP tagset is that a pop-up message is displayed in Excel each time you open it. We present differences using the ABSOLUTE_COLUMN_WIDTH= option in both methods.

Read the paper (PDF)

In order to display data visually, our audience preferred charts and graphs generated by Microsoft Excel over those generated by SAS^®. However, to make the necessary 30 graphs in Excel took 2 3 hours of manual work, even though the chart templates had already been created, and led to mistakes due to human error. SAS graphs took much less time to create, but lacked key functionality that the audience preferred and that was available in Excel graphs. Thanks to SAS, the answer came in Excel 4 Macro Language (X4ML) programming. SAS can actually submit coding to Excel in order to create customized data reporting, to create graphs or to update templates' data series, and even to populate Microsoft Word documents for finalized reports. This paper explores how SAS can be used to create presentation-ready graphs in a proven process that takes less than one minute, compared to the earlier process that took hours. The following code is used and discussed: %macro(macro_var), filename, rc commands, Output Delivery System (ODS), X4ML, and Microsoft Visual Basic for Applications (VBA).

Read the paper (PDF)

When first learning SAS^®, programmers often see the proprietary DATA step as a foreign and nonstandard concept. The introduction of the SAS^® 9.4 DS2 language eases the transition for traditional programmers delving into SAS for the first time. Object Oriented Programming (OOP) has been an industry mainstay for many years, and the DS2 procedure provides an object-oriented environment for the DATA step. In this poster, we go through a business case to show how DS2 can be used to define a reusable package following object-oriented principles.

View the e-poster or slides (PDF)

As a data scientist, you need analytical tools and algorithms, whether commercial or open source, and you have some favorites. But how do you decide when to use what? And how can you integrate their use to your maximum advantage? This presentation provides several best practices for deploying both SAS^® and open-source analytical tools to increase productivity and efficiency in your enterprise ecosystem. See an example of a marketing analysis using SAS and R algorithms in SAS^® Enterprise Miner to develop a predictive model, and then operationalize that model for performance monitoring and in-database scoring. Also learn about using Python and SAS integration for developing predictive models from a Jupyter Notebook environment. Seeing these cases will help you decide how to improve your analytics with similar integration of SAS and open source.

Read the paper (PDF)

Many communication channels exist for customers to engage with businesses, yet an interactive voice response (IVR) system remains the most critical of them. The reason is is because IVR acts as the front end to consumer interaction and is the most effective method for customers to do business with companies in order to resolve their issues before talking to an agent. If the IVR interface is not designed properly, customers can be stuck in an endless loop of pressing buttons that can lead to consumer annoyance. The bottom line is: An IVR system should be set up to quickly resolve as many routine inbound inquires as possible and to allow customers to speak to an agent when necessary. In order to accomplish this, the IVR interface has to be optimized so that it is fully effective and provides a great customer experience. This paper demonstrates how SAS^® tools helped optimize the IVR system of a book publishing company. The data set used in this study was obtained from a telecom services company and contained IVR logs of more than 300,000 calls with 1.4 million observations. To gain insights into customer behaviors, path analysis was performed on this data using SAS^® Enterprise Miner and obstacles faced by customers were identified. This helped in determining underperforming prompts, and analysis using SAS procedures was conducted on such prompts. Prompts tuning was recommended and new self-service areas were identified that avoid transfers and can save clients thousands of dollars in investments in call centers.

Read the paper (PDF)

People typically invest in more than one stock to help diversify their risk. These stock portfolios are a collection of assets that each have their own inherit risk. If you know the future risk of each of the assets, you can optimize how much of each asset to keep in the portfolio. The real challenge is trying to evaluate the potential future risk of these assets. Different techniques provide different forecasts, which can drastically change the optimal allocation of assets. This talk presents a case study of portfolio optimization in three different scenarios historical standard deviation estimation, capital asset pricing model (CAPM), and GARCH-based volatility modeling. The structure and results of these three approaches are discussed.

Read the paper (PDF)

Financial institutions are faced with a common challenge to meet the ever-increasing demand from regulators to monitor and mitigate money laundering risk. Anti-Money Laundering (AML) Transaction Monitoring systems produce large volumes of work items, most of which do not result in quality investigations or actionable results. Backlogs of work items have forced some financial institutions to contract staffing firms to triage alerts spanning back months. Moreover, business analysts struggle to define interactions between AML models and to explain what attributes make a model productive. There is no one approach to solve this issue. Analysts need several analytical tools to explore model relationships, improve existing model performance, and add coverage for uncovered risk. This paper demonstrates an approach to improve existing AML models and focus money laundering investigations on cases that are more likely to be productive using analytical SAS^® tools including SAS^® Visual Analytics, SAS^® Enterprise Miner , SAS^® Studio, SAS/STAT^® software, and SAS^® Enterprise Guide^®.

Read the paper (PDF)

Optimizing delivery routes and efficiently using delivery drivers are examples of classic problems in Operations Research, such as the Traveling Salesman Problem. In this paper, Oberweis and Zencos collaborate to describe how to leverage SAS/OR^® procedures to solve these problems and optimize delivery routes for a retail delivery service. Oberweis Dairy specializes in home delivery service that delivers premium dairy products directly to customers homes. Because freshness is critical to delivering an excellent customer experience, Oberweis is especially motivated to optimize their delivery logistics. As Oberweis works to develop an expanding footprint and a growing business, Zencos is helping to ensure that delivery routes are optimized and delivery drivers are used efficiently.

Read the paper (PDF)

Making optimal use of SAS^® Grid Computing relies on the ability to spread the workload effectively across all of the available nodes. With SAS^® Scalable Performance Data Server (SPD Server), it is possible to partition your data and spread the processing across the SAS Grid Computing environment. In an ideal world it would be possible to adjust the size and number of partitions according to the data volumes being processed on any given day. This paper discusses a technique that enables the processing performed in the SAS Grid Computing environment to be dynamically reconfigured, automatically at run time, to optimize the use of SAS Grid Computing, and to provide significant performance benefits.

Read the paper (PDF)

Today, companies are increasingly using analytics to discover new revenue and cost-saving opportunities. Many business professionals turn to SAS, a leader in business analytics software and service, to help them improve performance and make better decisions faster. Analytics is also being used in risk management, fraud detection, life sciences, sports, and many more emerging markets. However, to maximize the value to the business, analytics solutions need to be deployed quickly and cost-effectively, while also providing the ability to readily scale without degrading performance. Of course, in today's demanding environments, where budgets are still shrinking and mandates to reduce carbon footprints are growing, the solution must deliver excellent hardware utilization, power efficiency, and return on investment. To help solve some of these challenges, Red Hat and SAS have collaborated to recommend the best practices for configuring SAS^®9 running on Red Hat Enterprise Linux. The scope of this document covers Red Hat Enterprise Linux 6 and 7. Areas researched include the I/O subsystem, file system selection, and kernel tuning, both in bare metal and virtualized (KVM) environments. Additionally, we now include grid-based configurations running with Red Hat Resilient Storage Add-On (Global File System 2 [GFS2] clusters).

Read the paper (PDF)

Whether you are a current SAS^® Marketing Optimization user who wants to fine tune your scenarios, a SAS^® Marketing Automation user who wants to understand more about how SAS Marketing Optimization might improve your campaigns, or completely new to the world of marketing optimizations, this session covers ideas and insights for getting the highest strategic impact out of SAS Marketing Optimization. SAS Marketing Optimization is powerful analytical software, but like all software, what you get out is largely predicated by what you put in. Building scenarios is as much an art as it is a science, and how you build those scenarios directly impacts your results. What questions should you be asking to establish the best objectives? What suppressions should you consider? We develop and compare multiple what-if scenarios and discuss how to leverage SAS Marketing Optimization as a business decisioning tool in order to determine the best scenarios to deploy for your campaigns. We include examples from various industries including retail, financial services, telco, and utilities. The following topics are discussed in depth: establishing high-impact objectives, with an emphasis on setting objectives that impact organizational key performance indicators (KPIs); performing and interpreting sensitivity analysis; return on investment (ROI); evaluating opportunity costs; and comparing what-if scenarios.

Read the paper (PDF)

The analytical data life cycle consists of 4 stages: data exploration, preparation, model development, and model deployment. Traditionally, these stages can consume 80% of the time and resources within your organization. With innovative techniques such as in-database and in-memory processing, managing data and analytics can be streamlined, with an increase in performance, economics, and governance. This session explores how you can optimize the analytical data life cycle with some best practices and tips using SAS^® and Teradata.

Outliers, such as unusual, violated, unexpected or rare events, have been focused on intensively by researchers and practitioners, providing their impacts on estimated statistics and developed models. Today, some business disciplines are focusing primarily on outliers such as defaults of credit, operational risks, quality nonconformities, fraud, or even the results of marketing initiatives in highly competitive environments with low response rates of a couple percent or even less. This paper discusses the importance of detecting, isolating, and categorizing business outliers to discover their root causes and to monitor them dynamically. Addressing not only extreme values or multivariable densities detecting outliers, but also addressing distributions, patterns, clusters, combinations of items, and sequences of events will allow for opportunities to be established for business improvement. SAS^® Enterprise Miner can be used to perform such detections. Thus, creating special business segments or running specialized outlier oriented data mining processes, such as decision trees, allows for isolation of business important outliers, which are normally masked in traditional statistical techniques. This process combined with 'What-If' scenario generation prepares businesses for future possible surges even when having no current specific type outliers. Furthermore, analyzing some specific outliers may play a role in assessing business stability to corresponding stress tests.

Read the paper (PDF)

The DATASETS procedure provides the most diverse selection of capabilities and features of any of the SAS^® procedures. It is the prime tool that programmers can use to manage SAS data sets, indexes, catalogs, and so on. Many SAS programmers are only familiar with a few of PROC DATASETS's many capabilities. Most often, they only use the data set updating, deleting, and renaming capabilities. However, there are many more features and uses that should be in a SAS programmer's toolkit. This paper highlights many of the major capabilities of PROC DATASETS. It discusses how it can be used as a tool to update variable information in a SAS data set; provide information about data set and catalog contents; delete data sets, catalogs, and indexes; repair damaged SAS data sets; rename files; create and manage audit trails; add, delete, and modify passwords; add and delete integrity constraints; and more. The paper contains examples of the various uses of PROC DATASETS that programmers can cut and paste into their own programs as a starting point. After reading this paper, a SAS programmer will have practical knowledge of the many different facets of this important SAS procedure.

Read the paper (PDF)

In this paper, we explore advantages of the DS2 procedure over the DATA step programming in SAS^®. DS2 is a new SAS proprietary programming language appropriate for advanced data manipulation. We explore the use of PROC DS2 to execute queries in databases using SAS FedSQL. Several DS2 language elements accept embedded FedSQL syntax, and the run-time generated queries can exchange data interactively between DS2 and the supported database. This action enables SQL preprocessing of input tables, which effectively allows processing data from multiple tables in different databases within the same query, thereby drastically reducing processing times and improving performance. We explore use of DS2 for creating tables, bulk loading tables, manipulating tables, and querying data in an efficient manner. We explore advantages of using PROC DS2 over DATA step programming such as support for different data types, ANSI SQL types, programming structure elements, and benefits of using new expressions or writing one's own methods or packages available in the DS2 system. The DS2 procedure enables requests to be processed by the DS2 data access technology that supports a scalable, threaded, high-performance, and standards-based way to access, manage, and share relational data. In the end, we empirically measure performance benefits of using PROC DS2 over PROC SQL for processing queries in-database by taking advantage of threaded processing in supported databases such as Oracle.

Read the paper (PDF)

Multicategory logit models extend the techniques of logistic regression to response variables with three or more categories. For ordinal response variables, a cumulative logit model assumes that the effect of an explanatory variable is identical for all modeled logits (known as the assumption of proportional odds). Past research supports the finding that as the sample size and number of predictors increase, it is unlikely that proportional odds can be assumed across all predictors. An emerging method to effectively model this relationship uses a partial proportional odds model, fit with unique parameter estimates at each level of the modeled relationship only for the predictors in which proportionality cannot be assumed. First used in SAS/STAT^® 12.1, PROC LOGISTIC in SAS^® 9.4 now extends this functionality for variable selection methods in a manner in which all equal and unequal slope parameters are available for effect selection. Previously, the statistician was required to assess predictor non-proportionality a priori through likelihood tests or subjectively through graphical diagnostics. Following a review of statistical methods and limitations of other commercially available software to model data exhibiting non-proportional odds, a public-use data set is used to examine the new functionality in PROC LOGISTIC using stepwise variable selection methods. Model diagnostics and the improvement in prediction compared to a general cumulative model are noted.

Read the paper (PDF) | Download the data file (ZIP) | View the e-poster or slides (PDF)

Real workflow dependencies exist when the completion or output of one data process is a prerequisite for subsequent data processes. For example, in extract, transform, load (ETL) systems, the extract must precede the transform and the transform must precede the load. This serialization is common in SAS^® data analytic development but should be implemented only when actual dependencies exist. A false dependency, by contrast, exists when the workflow itself does not require serialization but is coded in a manner that forces a process to wait unnecessarily for some unrelated process to complete. For example, an ETL system might extract, transform, and load one data set, and then extract, transform, and load a second data set, causing processing of the second data set to wait unnecessarily for the first to complete. This hands-on session demonstrates three common patterns of false dependencies, teaching SAS practitioners how to recognize and remedy false dependencies through parallel processing paradigms. Groups of participants are pitted against each other, as the class simultaneously runs both serialized software and distributed software that runs in parallel. Participants execute exercises in unison, and then watch their machines race to the finish as the tremendous performance advantages of parallel processing are demonstrated in one exercise after another--ideal for anyone seeking to walk away with proven techniques that can measurably increase your performance and bonus.

Read the paper (PDF)

JavaScript Object Notation (JSON) has quickly become the de facto standard for data transfer on the Internet due to an increase in web data and the usage of full-stack JavaScript. JSON has become dominant in the emerging technologies of the web today, such as in the Internet of Things and in the mobile cloud. JSON offers a light and flexible format for data transfer. It can be processed directly from JavaScript without the need for an external parser. This paper discusses several abilities within SAS^® to process JSON files, the new JSON LIBNAME, and several procedures. This paper compares all of these in detail.

Read the paper (PDF)

As a programmer specializing in tracking systems for research projects, I was recently given the task of implementing a newly developed, web-based tracking system for a complex field study. This tracking system uses an SQL database on the back end to hold a large set of related tables. As I learned about the new system, I found that there were deficiencies to overcome to make the system work on the project. Fortunately, I was able to develop a set of utilities in SAS^® to bridge the gaps in the system and to integrate the system with other systems used for field survey administration on the project. The utilities helped to do the following: 1) connect schemas and compare cases across subsystems; 2) compare the statuses of cases across multiple tracked processes; 3) generate merge input files to be used to initiate follow-up activities; 4) prepare and launch SQL stored procedures from a running SAS job; and 5) develop complex queries in Microsoft SQL Server Management Studio and making them run in SAS. This paper puts each of these needs into a larger context by describing the need that is not addressed by the tracking system. Then, each program is explained and documented, with comments.

Read the paper (PDF)

Organizations that create and store personally identifiable information (PII) are often required to de-identify sensitive data to protect an individual s privacy. There are multiple methods in SAS^® that can be used to de-identify PII depending on data types and encryption needs. The first method is to apply crosswalk mapping by linking a data set with PII to a secured data set that contains the PII and its corresponding surrogate. Then, the surrogate replaces the PII in the original data set. A second method is SAS encryption, which involves translating PII into an encrypted string using SAS functions. This could be a one-byte-to-one-byte swap or a one-byte-to-two-byte swap. The third method is in-database encryption, which encrypts the PII in a data warehouse, such as Oracle and Teradata, using SAS tools before any information is imported into SAS for users to see. This paper discusses the advantages and disadvantages of these three methods, provides sample SAS code, and describes the corresponding methods to decrypt the encrypted data.

Read the paper (PDF) | View the e-poster or slides (PDF)

Moving a workforce in a new direction takes a lot of energy. Your planning should include four pillars: culture, technology, process, and people. These pillars assist small and large SAS^® rollouts with a successful implementation and an eye toward future proofing. Boston Scientific is a large multi-national corporation that recently grew SAS from a couple of desktops to a global implementation. Boston Scientific's real world experiences reflect on each pillar, both in what worked and in lessons learned.

Read the paper (PDF)

Installation and configuration of a SAS^® Enterprise BI platform in the requirements of the today's world requires knowledge on a wide variety of subjects. Security requirements are growing, the number of involved components is growing, time to delivery should be shorter, and the quality must be increased. The expectations of the customers are based on a cloud experience where automated deployments with ready-to-use applications are state of the art. This paper describes an approach to address the challenges to deploy SAS^® 9.4 on Linux to meet today's customer expectations.

Read the paper (PDF)

For customers providing SAS^® reporting to the public, the ability to use a Social login opens up a number of possibilities to provide richer services. Instead of everybody using generic Guest access and being limited to a common subset of reports or other functionality, previously unknown users can seamlessly log in and access SAS web content while SAS administrators can continue to apply best-practice security. This paper focuses on integrating Google Sign-In, Microsoft Account Sign-In, and Facebook Sign-In as alternative methods to log in from the SAS Logon Manager, as well as registering any new users SAS metadata automatically.

Read the paper (PDF)

SAS^® Decision Manager includes a hidden gem: a web service for high-speed online scoring of business events. The fourth maintenance release of SAS^® 9.4 represents the third release of the SAS^® Micro Analytics Service for scoring SAS^® DS2 code decisions in a standard JSON web service. Users will learn how to create decisions, deploy modules to the web service, test the service, and record business events.

Read the paper (PDF)

Are secondary schools in the United States hiring enough qualified math teachers? In which regions is there a disparity of qualified teachers? Data from an extensive survey conducted by the National Center for Education Statistics (NCES) was used for predicting qualified secondary school teachers across public schools in the US. The three criteria examined to determine whether a teacher is qualified to teach a given subject are: 1) Whether the teacher has a degree in the subject he or she is teaching 2) Whether he or she has a teaching certification in the subject 3) Whether he or she has five years of experience in the subject. A qualified teacher is defined as one who has all three of the previous qualifications. The sample data included socioeconomic data at the county level, which was used as predictors for hiring a qualified teacher. Data such as the number of students on free or reduced lunch at the school was used to assign schools as high-needs or low-needs schools. Other socioeconomic factors included were the income and education levels of working adults within a given school district. Some of the results show that schools with higher-needs students (a school that has more than 40% of the students on some form of reduced lunch program) have less-qualified teachers. The resultant model is used to score other regions and is presented on a heat map of the US. SAS^® procedures such as PROC SURVEYFREQ and PROC SURVEYLOGISTIC are used.

View the e-poster or slides (PDF)

Research using electronic health records (EHR) is emerging, but questions remain about its completeness, due in part to physicians' time to enter data in all fields. This presentation demonstrates the use of SAS^® Enterprise Miner to predict completeness of clinical data using claims data as the standard 'source of truth' against which to compare it. A method for assessing and predicting the completeness of clinical data is presented using the tools and techniques from SAS Enterprise Miner. Some of the topics covered include: tips for preparing your sample data set for use in SAS Enterprise Miner; tips for preparing your sample data set for modeling, including effective use of the Input Data, Data Partition, Filter, and Replacement nodes; and building predictive models using Stat Explore, Decision Tree, Regression, and Model Compare nodes.

View the e-poster or slides (PDF)

The most commonly reported model evaluation metric is the accuracy. This metric can be misleading when the data are imbalanced. In such cases, other evaluation metrics should be considered in addition to the accuracy. This study reviews alternative evaluation metrics for assessing the effectiveness of a model in highly imbalanced data. We used credit card clients in Taiwan as a case study. The data set contains 30,000 instances (22.12% risky and 77.88% non-risky) assessing the likeliness of a customer defaulting on a payment. Three different techniques were used during the model building process. The first technique involved down-sampling the majority class in the training subset. The second used the original imbalanced data whereas prior probabilities were set to account for oversampling in the third technique. The same sets of predictive models were then built for each technique after which the evaluation metrics were computed. The results suggest that model evaluation metrics might reveal more about distribution of classes than they do about the actual performance of models when the data are imbalanced. Moreover, some of the predictive models were identified to be very sensitive to imbalance. The final decision in model selection should consider a combination of different measures instead of relying on one measure. To minimize imbalance-biased estimates of performance, we recommend reporting both the obtained metric values and the degree of imbalance in the data.

Read the paper (PDF)

Predictive modeling might just be the single most thrilling aspect of data science. Who among us can deny the allure: to observe a naturally occurring phenomenon, conjure a mathematical model to explain it, and then use that model to make predictions about the future? Though many SAS^® users are familiar with using a data set to generate a model, they might not use the awesome power of SAS to store the model and score other data sets. In this paper, we distinguish between parametric and nonparametric models and discuss the tools that SAS provides for storing and scoring each. Along the way, you come to know the STORE statement and the SCORE procedure. We conclude with a brief overview of the PLM procedure and demonstrate how to effectively load and evaluate models that have been stored during the model building process.

Read the paper (PDF)

This paper compiles information from documents produced by the U.S. Food and Drug Administration (FDA), the Clinical Data Interchange Standards Consortium (CDISC), and Computational Sciences Symposium (CSS) workgroups to identify what analysis data and other documentation is to be included in submissions and where it all needs to go. It not only describes requirements, but also includes recommendations for things that aren't so cut-and-dried. It focuses on the New Drug Application (NDA) submissions and a subset of Biologic License Application (BLA) submissions that are covered by the FDA binding guidance documents. Where applicable, SAS^® tools are described and examples given.

Read the paper (PDF)

Airbnb is the world's largest home-sharing company and has over 800,000 listings in more than 34,000 cities and 190 countries. Therefore, the pricing of their property, done by the Airbnb hosts, is crucial to the business. Setting low prices during a high-demand period might hinder profits, while setting high prices during a low-demand period might result in no bookings at all. In this paper, we suggest a price recommendation methodology for Airbnb hosts that helps in overcoming the problems of overpricing and underpricing. Through this methodology, we try to identify key factors related to Airbnb pricing: factors influential in determining a price for a property; the relation between the price of a property and the frequency of its booking; and similarities among successful and profitable properties. The constraints outlined in the analysis were entered into SAS^® optimization procedures to achieve a best possible price. As a part of this methodology, we built a scraping tool to get details of New York City host user data along with their metrics. Using this data, we build a pricing model to predict the optimal price of an Airbnb home.

Read the paper (PDF)

This paper explores the utilization of medical services, which has a characteristic exponential distribution. Because of this characteristic, a variable generalized linear model can be applied to it to obtain self-managed health plan rates. This approach is different from what is generally used to set the rates of health plans. This new methodology is characterized by capturing qualitative elements of exposed participants that old rate-making methods are not able to capture. Moreover, this paper also uses generalized linear models to estimate the number of days that individuals remain hospitalized. The method is expanded in a project in SAS^® Enterprise Guide^®, in which the utilization of medical services by the base during the years 2012, 2013, 2014, and 2015 (the last year of the base) is compared with the Hospital Cost Index of Variation. The results show that, among the variables chosen for the model, the income variable has an inverse relationship with the risk of health care expenses. Individuals with higher earnings tend to use fewer services offered by the health plan. Male individuals have a higher expenditure than female individuals, and this is reflected in the rate statistically determined. Finally, the model is able to generate tables with rates that can be charged to plan participants for health plans that cover all average risks.

Read the paper (PDF)

Production forecasts that are based on data analytics are able to capture the character of the patterns that are created by past behavior of wells and reservoirs. Future trends are a reflection of past trends unless operating principles have changed. Therefore, the forecasts are more accurate than the monotonous, straight line that is provided by decline curve analysis (DCA). The patterns provide some distinct advantages: they provide a range instead of an absolute number, and the periods of high and low performance can be used for better planning. When used together with DCA, the current method of using data driven production forecasting can certainly enhance the value tremendously for the oil and gas industry, especially in times of volatility in the global oil and gas industry.

View the e-poster or slides (PDF)

The report brings a simple and intuitive overview on behavior of technical provision and rentability of health insurance segments, based on historical data of a major insurance company. The profitability analysis displays indicators consisting of claims, prices, and quantity of insureds and their performance separated by gender, region, and different products. The report's user can simulate more accurate premiums by inputting information about medical costs increasing and target claims rate. The technical provision view identifies the greatest impacts on the provision, such as claims payments, legal expense estimates, and future claims payments and reports. Also, it compares the real health insurance costs with the provision estimated on a previous period. Therefore, the report enables the user to get a unique panorama of health insurance underwriting and evaluate its results in order to make strategic decision for the future.

Read the paper (PDF)

Bayesian inference has become ubiquitous in applied science because of its flexibility in modeling data and advances in computation that allow special methods of simulation to obtain sound estimates when more mathematical approaches are intractable. However, when the sample size is small, the choice of a prior distribution becomes difficult. Computationally convenient choices for prior distributions can overstate prior beliefs and bias the estimates. We propose a simple form of prior distribution, a mixture of two uniform distributions, that is weakly informative, in that the prior distribution has a relatively large standard deviation. This choice leads to closed-form expressions for the posterior distribution if the observed data follow a normal, binomial, or Poisson distribution. The explicit formulas are easily encoded in SAS^®. For a small sample size of 10, we illustrate how to elicit the mixture prior and indicate that the resulting posterior distribution is insensitive to minor misspecification of input values. Weakly informative prior distributions suitable for small sample sizes are easy to specify and appear to provide robust inference.

View the e-poster or slides (PDF)

In a randomized study, subjects are randomly assigned to either a treated group or a control group. Random assignment ensures that the distribution of the covariates is the same in both groups and that the treatment effect can be estimated by directly comparing the outcomes for the subjects in the two groups. In contrast, subjects in an observational study are not randomly assigned. In order to establish causal interpretations of the treatment effects in observational studies, special statistical approaches that adjust for the covariate confounding are required to obtain unbiased estimation of causal treatment effects. One strategy for correctly estimating the treatment effect is based on the propensity score, which is the conditional probability of the treatment assignment given the observed covariates. Prior to the analysis, you use propensity scores to adjust the data by weighting observations, stratifying subjects that have similar propensity scores, or matching treated subjects to control subjects. This paper reviews propensity score methods for causal inference and introduces the PSMATCH procedure, which is new in SAS/STAT^® 14.2. The procedure provides methods of weighting, stratification, and matching. Matching methods include greedy matching, matching with replacement, and optimal matching. The procedure assesses covariate balance by comparing distributions between the adjusted treated and control groups.

Read the paper (PDF)

Predictive analytics has been evolving in property and casualty insurance for the past two decades. This paper first provides a high-level overview of predictive analytics in each of the following core business operations in the property and casualty (P&C) insurance industry: marketing, underwriting, actuarial pricing, actuarial reserving, and claims. Then, a common P&C insurance predictive modeling technical process in SAS^® dealing with large data sets is introduced. The steps of this process include data acquisition, data preparation, variable creation, variable selection, model building (also known as model fitting), model validation, model testing, and so on. Finally, some successful models are introduced. Base SAS^®, SAS/STAT^® software, SAS^® Enterprise Guide^®, and SAS^® Enterprise Miner are presented as the main tools for this process. This predictive modeling process could be tweaked or directly used in many other industries as the statistical foundations of predictive analytics have large overlaps across P&C insurance, health care, life insurance, banking, pharmaceutical, genetics industries, and so on. This paper is intended for any level of SAS^® user or business people from different industries who are interested in learning about general predictive analytics.

Read the paper (PDF)

Face it your data can occasionally contain characters that wreak havoc on your macro code. Characters such as the ampersand in at&t, or the apostrophe in McDonald's, for example. This paper is designed for programmers who know most of the ins and outs of SAS^® macro code already. Now let's take your macro skills a step farther by adding to your skill set, specifically, %BQUOTE, %STR, %NRSTR, and %SUPERQ. What is up with all these quoting functions? When do you use one over the other? And why would you need %UNQUOTE? The macro language is full of subtleties and nuances, and the quoting functions represent the epitome of all of this. This paper shows you in which instances you would use the different quoting functions. Specifically, we show you the difference between the compile-time and the execution-time functions. In addition to looking at the traditional quoting functions, you learn how to use %QSCAN and %QSYSFUNC among other functions that apply the regular function and quote the result.

Read the paper (PDF)

A recurring problem with large research databases containing sensitive information about an individual's health, financial, and personal information is how to make meaningful extracts available to qualified researchers without compromising the privacy of the individuals whose data is in the database. This problem is exacerbated when a large number of extracts need to be made from the database. In addition to using statistical disclosure control methods, this paper recommends limiting the variables included in each extract to the minimum needed and implementing a method of assigning request-specific randomized IDs to each extract that is both secure and self-documenting.

Read the paper (PDF)

Conferences for SAS^® programming are replete with the newest software capabilities and clever programming techniques. However, discussion about quality control (QC) is lacking. QC is fundamental to ensuring both correct results and sound interpretation of data. It is not industry specific, and it simply makes sense. Most QC procedures are a function of regulatory requirements, industry standards, and corporate philosophies. Good QC goes well beyond just reviewing results, and should also consider the underlying data. It should be driven by a thoughtful consideration of relevance and impact. While programmers strive to produce correct results, it is no wonder that programming mistakes are common despite rigid QC processes in an industry where expedited deliverables and a lean workforce are the norm. This leads to a lack of trust in team members and an overall increase in resource requirements as these errors are corrected, particularly when SAS programming is outsourced. Is it possible to produce results with a high degree of accuracy, even when time and budget are limited? Thorough QC is easy to overlook in a high-pressure environment with increased expectations of workload and expedited deliverables. Does this suggest that QC programming is becoming a lost art, or does it simply suggest that we need to evolve with technology? The focus of the presentation is to review the who, what, when, how, why, and where of QC programming implementation.

Read the paper (PDF)

SQL is a universal language that allows you to access data stored in relational databases or tables. This hands-on workshop presents core concepts and features of using PROC SQL to access data stored in relational database tables. Attendees learn how to define, access, and manipulate data from one or more tables using PROC SQL quickly and easily. Numerous code examples are presented on how to construct simple queries, subset data, produce simple and effective output, join two tables, summarize data with summary functions, construct BY-groups, identify FIRST. and LAST. rows, and create and use virtual tables.

Read the paper (PDF) | Download the data file (ZIP)

SAS^® Enterprise Guide^® empowers organizations, programmers, business analysts, statisticians, and end users with all the capabilities that SAS has to offer. This hands-on workshop presents the SAS Enterprise Guide graphical user interface (GUI). It covers access to multi-platform enterprise data sources, various data manipulation techniques that do not require you to learn complex coding constructs, built-in wizards for performing reporting and analytical tasks, the delivery of data and results to a variety of mediums and outlets, and support for data management and documentation requirements. Attendees learn how to use the graphical user interface to access SAS^® data sets and tab-delimited and Microsoft Excel input files; to subset and summarize data; to join (or merge) two tables together; to flexibly export results to HTML, PDF, and Excel; and to visually manage projects using flow diagrams.

Read the paper (PDF)

SAS^® Enterprise Guide^® empowers organizations, programmers, business analysts, statisticians, and users with all the capabilities that SAS^® has to offer. This hands-on workshop presents the SAS Enterprise Guide graphical user interface (GUI), access to multi-platform enterprise data sources, various data manipulation techniques without the need to learn complex coding constructs, built-in wizards for performing reporting and analytical tasks, the delivery of data and results to a variety of mediums and outlets, and support for data management and documentation requirements. Attendees learn how to use the GUI to access SAS data sets and tab-delimited and Excel input files; how to subset and summarize data; how to join (or merge) two tables together; how to flexibly export results to HTML, PDF, and Excel; and how to visually manage projects using flow diagrams.

Read the paper (PDF) | Download the data file (ZIP)

The announcement of SAS Institute's free SAS^® University Edition is an exciting development for SAS users and learners around the world! The software bundle includes Base SAS^®, SAS/STAT^® software, SAS/IML^® software, SAS^® Studio (user interface), and SAS/ACCESS^® for Windows, with all the popular features found in the licensed SAS versions. This is an incredible opportunity for users, statisticians, data analysts, scientists, programmers, students, and academics everywhere to use (and learn) for career opportunities and advancement. Capabilities include data manipulation, data management, comprehensive programming language, powerful analytics, high-quality graphics, world-renowned statistical analysis capabilities, and many other exciting features. This paper illustrates a variety of powerful features found in the SAS University Edition. Attendees will be shown a number of tips and techniques on how to use the SAS^® Studio user interface, and they will see demonstrations of powerful data management and programming features found in this exciting software bundle.

Read the paper (PDF)

Getting speedy results from your SAS^® programs when you re working with bulky data sets is more than elegant coding techniques. There are several approaches to improving performance when working with biggish data. Although you can upgrade your hardware, this just helps you to run inefficient code and bloated tables quicker. So, you should also consider the results that tuning your database and adjusting your SAS platform can bring. In this paper, we review the various options available to give you some ideas about things you can do better.

Read the paper (PDF)

This session introduces how Equifax uses SAS^® to develop a streamlined model automation process, including development and performance monitoring. The tool increases modeling efficiency and accuracy of the model, reduces error, and generates insights. You can integrate more advanced analytics tools in a later phase. The process can apply in any given business problem in risk and marketing, which helps leaders to make precise and accurate business decisions.

The United States Access Board will soon refresh the Section 508 accessibility standards. The new requirements are based on Web Content Accessibility Guidelines (WCAG) 2.0 and include a total of 38 testable success criteria-16 more than the current requirements. Is your organization ready? Don't worry, the fourth maintenance release for SAS^® 9.4 Output Delivery System (ODS) HTML5 destination has you covered. This paper describes the new accessibility features in the ODS HTML5 destination, explains how to use them, and shows you how to test your output for compliance with the new Section 508 standards.

Read the paper (PDF)

A random forest is an ensemble of decision trees that often produce more accurate results than a single decision tree. The predictions of the individual trees in the forest are averaged to produce a final prediction. The question now arises whether a better or more accurate final prediction cannot be obtained by a more intelligent use of the trees in the forest. In particular, in the way random forests are currently defined, every tree contributes the same fraction to the final result (for example, if there are 50 trees, each tree contributes 1/50th to the final result). This ignores model uncertainty as less accurate trees are treated exactly like more accurate trees. Replacing averaging with Bayesian Model Averaging will give better trees the opportunity to contribute more to the final result, which might lead to more accurate predictions. However, there are several complications to this approach that have to be resolved, such as the computation of an SBC value for a decision tree. Two novel approaches to solving this problem are presented and the results compared to that obtained with the standard random forest approach.

Read the paper (PDF)

SAS^® Management Console has been a key tool to interact with SAS^® Metadata Server. But sometimes users need much more than what SAS Management Console can do. This paper contains a couple of SAS^® macros that can be used in SAS^® Enterprise Guide^® and PC SAS to read SAS metadata. These macros read users, roles, and groups registered in metadata. This paper explains how these macros can be executed in SAS Enterprise Guide and how to change these macros to meet other business requirements. There might be some tools available in the market that can be used to read SAS metadata, but this paper helps in achieving most of them within a SAS client like PC SAS and SAS Enterprise Guide, without requiring any additional plug-ins.

Read the paper (PDF) | View the e-poster or slides (PDF)

AdaBoost (or Adaptive Boosting) is a machine learning method that builds a series of decision trees, adapting each tree to predict difficult cases missed by the previous trees and combining all trees into a single model. I discuss the AdaBoost methodology, introduce the extension called Real AdaBoost, which is so similar to stepwise weight of evidence logistic regression (SWOELR) that it might offer a framework with which we can understand the power of the SWOELR approach. I discuss the advantages of Real AdaBoost, including variable interaction and adaptive, stage-wise binning, and demonstrate a SAS^® macro that uses Real AdaBoost to generate predictive models.

Read the paper (PDF)

The intrepid Mars Rovers have inspired awe and curiosity and dreams of mapping Mars using SAS/GRAPH^® software. This presentation demonstrates how to import Esri shapefile (SHP) data (using the MAPIMPORT procedure) from sources other than SAS^® and GfK GeoMarketing map data to produce useful (and sometimes creative) maps. Examples include mapping neighborhoods, ZCTA5 areas, postal codes, and of course, Mars. Products used are Base SAS^® and SAS/GRAPH^®. SAS programmers of any skill level will benefit from this presentation.

Read the paper (PDF)

We live in a world of data; small data, big data, and data in every conceivable size between small and big. In today's world, data finds its way into our lives wherever we are. We talk about data, create data, read data, transmit data, receive data, and save data constantly during any given hour in a day, and we still want and need more. So, we collect even more data at work, in meetings, at home, on our smartphones, in emails, in voice messages, sifting through financial reports, analyzing profits and losses, watching streaming videos, playing computer games, comparing sports teams and favorite players, and countless other ways. Data is growing and being collected at such astounding rates, all in the hope of being able to better understand the world around us. As SAS^® professionals, the world of data offers many new and exciting opportunities, but it also presents a frightening realization that data sources might very well contain a host of integrity issues that need to be resolved first. This presentation describes the available methods to remove duplicate observations (or rows) from data sets (or tables) based on the row's values and keys using SAS.

Read the paper (PDF)

At the end of a project, many institutional review boards (IRBs) require project directors to certify that no personally identifiable information (PII) is retained by a project. This paper briefly reviews what information is considered PII and explores how to identify variables containing PII in a given project. It then shows a comprehensive way to ensure that all SAS^® variables containing PII have their values set to NULL and how to use SAS to document that this has been done.

Read the paper (PDF)

Do you ever feel like you email the same reports to the same people over and over and over again? If your customers are anything like mine, you create reports, and lots of them. Our office is using macros, SAS^® email capabilities, and other programming techniques, in conjunction with our trusty contact list, to automate report distribution. Customers now receive the data they need, and only the data they need, on the schedule they have requested. In addition, not having to send these emails out manually saves our office valuable time and resources that can be used for other initiatives. In this session, we walk through a few of the SAS techniques we are using to provide better service to our internal and external partners and, hopefully, make us look a little more like rock stars.

Read the paper (PDF)

If you've got an iPhone, you might have noticed that the Health app is hard at work collecting data on every step you take. And, of course, the data scientist inside you is itching to analyze that data with SAS^®. This paper and an accompanying E-Poster show you how to get step data out of your iPhone Health app and into SAS. Once it's there, you can have at it with all things SAS. In this presentation, we show you how a (what else?) step plot can be used to visualize the 73,000+ steps the author took at SAS^® Global Forum 2016.

Read the paper (PDF) | View the e-poster or slides (PDF)

Using zero inflated beta regression and gradient boosting, a solution to forecast the gross revenue of credit card products was developed. This solution was based on 1) A set of attributes from invoice information. 2) Zero inflated beta regression for forecasts of interchange and revolving revenue (by using PROC NLMIXED and by building data processing routines (with attributes and a target variable)). 3) Gradient boosting models for different product forecasts (annuity, insurance, etc.) using PROC TREEBOOST, exploring its parameters, and creating a routine for selecting and adjusting models. 4) Construction of ranges of revenue for policies and monitoring. This presentation introduces this credit card revenue forecasting solution.

Read the paper (PDF)

From state-of-the-art research to routine analytics, the Jupyter Notebook offers an unprecedented reporting medium. Historically, tables, graphics, and other types of output had to be created separately, and then integrated into a report piece by piece, amidst the drafting of text. The Jupyter Notebook interface enables you to create code cells and markdown cells in any arrangement. Markdown cells allow all typical formatting. Code cells can run code in the document. As a result, report creation happens naturally and in a completely reproducible way. Handing a colleague a Jupyter Notebook file to be re-run or revised is much easier and simpler for them than passing along, at a minimum, two files: one for the code and one for the text. Traditional reports become dynamic documents that include both text and living SAS^® code that is run during document creation. With the new SAS kernel for Jupyter, all of this is possible and more!

Read the paper (PDF)

SAS^® job flows created by Windows services have a problem. Currently, they can execute only jobs in a series (one at a time). This can slow down job processing, and it limits the utility of the flows. This paper shows how you can alter the flow of Windows services after they have been generated to enable jobs to run in parallel (side by side). A high-level overview of PROC GROOVY, which automates these changes, is provided, as well as a summary of the positives and negatives of running jobs in parallel.

Read the paper (PDF) | Download the data file (ZIP)

There are so many ways for SAS/ACCESS^® users to read and write data from and to Microsoft Excel files: SAS^® PC Files Server, XLS and XLSX engines, the SAS IMPORT and EXPORT procedures, various Excel file formats (.xls, .xlsx, .xlsb, .xlsm), and more. Many users ask, 'Which is best for me?' This paper explores the requirements and limitations of each engine, along with performance considerations and some of the not-so-obvious things to consider. It also includes a brief analogous discussion on Microsoft Access databases, which share some of the same mechanisms.

Read the paper (PDF)

SAS^® has an amazing arsenal of tools for using and displaying geographic information that are relatively unknown and underused. High-quality GfK GeoMarketing maps have been provided by SAS since the second maintenance release for SAS^® 9.3, as sources for inexpensive map data dried up. SAS has been including both GfK and traditional SAS map data sets with licenses for SAS/GRAPH^® software for some time, recognizing there will need to be an extended transitional period. However, for those of us who have been putting off converting our SAS/GRAPH mapping programs to use the new GfK maps, the time has come, as the traditional SAS map data sets are no longer being updated. If you visit SAS^® Maps Online, you can find only GfK maps in current maps. The GfK maps are updated once a year. This presentation walk through the conversion of a long-standing SAS program to produce multiple US maps for a data compendium to take advantage of GfK maps. Products used are Base SAS^® and SAS/GRAPH^®. SAS programmers of any skill level will benefit from this presentation.

Read the paper (PDF)

One of the many difficulties for a SAS^® programmer is remembering how to accurately use SAS syntax, especially syntax that includes many parameters. Not mastering basic syntax parameters definitely makes coding inefficient because the programmer has to check reference manuals constantly to ensure that syntax is correct. One of the more useful but somewhat unknown tools in SAS is the use of SAS abbreviations. This feature enables users to store text strings (such as the syntax of a DATA step function, a SAS procedure, or a complete DATA step) in a user-defined and easy-to-remember abbreviation. When this abbreviation is entered in the enhanced editor, SAS automatically brings up the corresponding stored syntax. Knowing how to use SAS abbreviations is beneficial to programmers with varying levels of SAS expertise. In this paper, various examples of using SAS abbreviations are demonstrated.

Have you heard of SAS^® Customer Intelligence 360, the program for creating a digital marketing SasS offering on a multi-tenant SAS cloud? Were you mesmerized by it but found it overwhelming? Did you tell yourself, I wish someone would show me how to do this ? This paper is for you. This paper provides you with an easy, step-by-step procedure on how to create a successful digital web, mobile, and email marketing campaign. In addition to these basics, the paper points to resources that allow you to get deeper into the application and customize each object to satisfy your marketing needs.

Read the paper (PDF)

SAS^® Data Integration Studio jobs are not always linear. While Loop transformations have been part of SAS Data Integration Studio for ages, only more recently has SAS Data Integration Studio included the Conditional Control transformations to control logic flow within a job. This paper demonstrates the use of both the Loop and Conditional transformations in a real world example.

Read the paper (PDF)

A common issue in data integration is that often the documentation and the SAS^® data integration job source code start to diverge and eventually become out of sync. At Capgemini, working for a specific client, we developed a solution to rectify this challenge. We proposed moving all necessary documentation into the SAS^® Data Integration Studio job itself. In this way, all documentation then becomes part of the metadata we have created, with the possibility of automatically generating Job and Release documentation from the metadata. This presentation therefore focuses on the metadata documentation generator. Specifically, this presentation: 1) looks at how to use programming and documentation standards in SAS data integration jobs to enable the generation of documentation from the metadata; and 2) shows how the documentation is generated from the metadata, and the challenges that were encountered creating the code. I draw on our hands-on experience; Capgemini has implemented this for a customer in the Netherlands, and we are rolling this out as an accelerator in other SAS data integration projects worldwide. I share examples of the generated documentation, which contains functional and technical designs, including a list with all source tables, a list with the target tables, all transformations with their own documentation, job dependencies, and more.

Read the paper (PDF)

The hash object provides an efficient method for quick data storage and data retrieval. Using a common set of lookup keys, hash objects can be used to retrieve data, store data, merge or join tables of data, and split a single table into multiple tables. This paper explains what a hash object is and why you should use hash objects, and provides basic programming instructions associated with the construction and use of hash objects in a DATA step.

Read the paper (PDF)

SAS^® In-Memory Analytics for Hadoop is an analytical programming environment that enables a user to use many components of an analytics project in a single environment, rather than switching between different applications. Users can easily prepare raw data for different types of analytics procedures. These techniques explore the data to enhance the information extractions. They can apply a large variety of statistical and machine learning techniques to the data to compare different analytical approaches. The model comparison capabilities let them quickly find the best model, which they can deploy and score in the Hadoop environment. All of these different components of the analytics project are supported in a distributed in-memory environment for lightning-fast processing. This paper highlights tips for working with the interaction between Hadoop data and for dealing with SAS^® LASR Analytic Server. It contains multiple scenarios with elementary but pragmatic approaches that enable SAS^® programmers to work efficiently within the SAS^® In-Memory Analytics environment.

Read the paper (PDF) | View the e-poster or slides (PDF)

Binary logistic regression models are widely used in CRM (customer relationship management) or credit risk modeling. In these models, it is common to use nominal, ordinal, or discrete (NOD) predictors. NOD predictors typically are binned (reducing the number of their levels) before usage in a logistic model. The primary purpose of binning is to obtain parsimony without greatly reducing the strength of association of the predictor X to the binary target Y. In this paper, two SAS^® macros are discussed. The %NOD_BIN macro bins predictors with nominal values (and ordinal and discrete values) by collapsing levels to maximize information value (IV). The %ORDINAL_BIN macro is applied to predictors that are ordered and in which collapsing can occur only for levels that are adjacent in the ordering of X. The %ORDINAL_BIN macro finds all possible binning solutions by complete enumeration. Solutions are ranked by IV, and monotonic solutions are identified.

Read the paper (PDF)

Mediation analysis is a statistical technique for investigating the extent to which a mediating variable transmits the relation of an independent variable to a dependent variable. Because it is useful in many fields, there have been rapid developments in statistical mediation methods. The most cutting-edge statistical mediation analysis focuses on the causal interpretation of mediated effect estimates. Cause-and-effect inferences are particularly challenging in mediation analysis because of the difficulty of randomizing subjects to levels of the mediator (MacKinnon, 2008). The focus of this paper is how incorporating longitudinal measures of the mediating and outcome variables aides in the causal interpretation of mediated effects. This paper provides useful SAS^® tools for designing adequately powered studies to detect the mediated effect. Three SAS macros were developed using the powerful but easy-to-use REG, CALIS, and SURVEYSELECT procedures to do the following: (1) implement popular statistical models for estimating the mediated effect in the pretest-posttest control group design; (2) conduct a prospective power analysis for determining the required sample size for detecting the mediated effect; and (3) conduct a retrospective power analysis for studies that have already been conducted and a required sample to detect an observed effect is desired. We demonstrate the use of these three macros with an example.

Read the paper (PDF)

The SAS^® macro language provides a powerful tool to write a program once and reuse it many times in multiple places. A repeatedly executed section of a program can be wrapped into a macro, which can then be shared among many users. A practical example of a macro can be a utility that takes in a set of input parameters, performs some calculations, and sends back a result (such as an interest calculator). In general, a macro modularizes a program into smaller and more manageable sections, and encapsulates repetitive tasks into re-usable code. Modularization can help the code to be tested independently. This paper provides an introduction to writing macros. It introduces the user to the basic macro constructs and statements. This paper covers the following advanced macro subjects: 1) using multiple &s to retrieve/resolve the value of a macro variable; 2) creating a macro variable from the value of another macro variable; 3) handling special characters; 4) the EXECUTE statement to pass a DATA step variable to a macro; 5) using the Execute statement to invoke a macro; and 6) using %RETURN to return a variable from a macro.

Read the paper (PDF)

Every year, a tragically high number of Americans are killed in a gun-related accident, suicide, or homicide. With the idea that many of these deaths could have easily been prevented or are the result of complex social issues, the topic of gun mortality has recently become more prevalent in our society. Through our analysis, we focus on key demographic variables such as race, age, marital status, education, and sex to see how gun mortality trends vary among different groups of people. Statistical procedures used include logistic regression, random forests procedure, chi-square tests, and multiple graphs to present the primarily categorical data in a meaningful way. This analysis can provide useful foundational knowledge for policy leaders, gun owners, and public policy leaders, so that gun and firearm reform can be approached in the most efficient, impactful way. We hope to inspire others to look deeper into the issue of gun mortality that plagues our nation today.

Read the paper (PDF)

The purpose of this paper is to provide an overview of SAS^® metadata security for new or inexperienced SAS administrators. The focus of the discussion is on identifying the most common metadata security objects such as access control entries (ACEs), access control templates (ACTs), metadata folders, authentication domains, and so on, and on describing how these objects work together to secure the SAS environment. Based on a standard SAS^® Enterprise Office Analytics for Midsize Business installation in a Windows environment, this paper walks through a simple example of securing a metadata environment, which demonstrates how security is prioritized, the impact of each security layer, and how conflicts are resolved.

Read the paper (PDF)

You have got your SAS^® environments installed, configured, and running smoothly. Time to relax and put your feet up, right? Not so fast! There is still one more leg to go on your security journey. After the deployment of your initial security plan, the security audit process provides active and regular monitoring and ensures that your environment remains secure. There are many reasons to carry out security audits: to ensure regulatory compliance, to maintain business confidence, and to keep your SAS platform as per the design specifications. This paper looks at some of the available ways to regularly review your environment to ensure that protected resources are not at risk, to comply with security auditing requirements, and to quickly and easily answer the question 'Who has access to what?' through efficient SAS metadata security management using Metacoda software.

Read the paper (PDF)

After you know the basics of SAS^® Visual Analytics, you realize that there are some situations that require unique strategies. Sometimes tables are not structured correctly or become too large for the environment. Maybe creating the right custom calculation for a dashboard can be confusing. Geospatial data is hard to work with if you haven't ever used it before. We studied hundreds of SAS^® Communities posts for the most common questions. These solutions (and a few extras) were extracted from the newly released book titled 'An Introduction to SAS^® Visual Analytics: How to Explore Numbers, Design Reports, and Gain Insight into Your Data'.

Read the paper (PDF)

Web Intelligence is a business objects web-based application used by the FDA for accessing and querying data files, and ultimately creating reports from multiple databases. The system allows querying of different databases using common business terms, and in the case of the FDA's Center for Food Safety and Applied Nutrition (CFSAN), careful review of dietary supplement information. However, in order to create timely and efficient reports for detection of safety signals leading to adverse events, a more efficient system is needed to obtain and visually display the data. Using SAS^® Visual Analytics and SAS^® Enterprise Guide^® can assist with timely extraction of data from multiple databases commonly used by CFSAN and create a more user friendly interface for management's review to help make key decisions in prevention of adverse events for the public.

Read the paper (PDF)

Not only does the new SAS^® Viya platform bring exciting advancements in high-performance analytics, it also takes a revolutionary step forward in the area of administration. The new SAS^® Cloud Analytic Services is accompanied by new platform management tools and techniques that are designed to ease the administrative burden while leveraging the open programming and visual interfaces that are standard among SAS Viya applications. Learn about the completely rewritten SAS^® Environment Manager 3.2, which supports the SAS Viya platform. It includes a cleaner HTML5-based user interface, more flexible and intuitive authorization windows, and user and group management that is integrated with your corporate Lightweight Directory Access Protocol (LDAP). Understand how authentication works in SAS Viya without metadata identities. Discover the key differences between SAS^®9 and SAS Viya deployments, including installation and automated update-in-place strategies orchestrated by Ansible for hot fixes, maintenance, and new product versions alike. See how the new microservices and stateful servers are managed and monitored. In general, gain a better understanding of the components of the SAS Viya architecture, and how they can be collectively managed to keep your environment available, secure, and performant for the users and processes you support.

Read the paper (PDF)

The fourth maintenance release for SAS^® 9.4 and the new SAS^® Viya platform bring even more progress with respect to the interoperability between SAS^® and Hadoop the industry standard for big data. This talk brings you up-to-date with where we are: more distributions, more data types, more options and then there is the cloud. Come and learn about the exciting new developments for blending your SAS processing with your shared Hadoop cluster.

Read the paper (PDF)

The SAS^® platform with Unicode's UTF-8 encoding is ready to help you tackle the challenges of dealing with data in multiple languages. In today's global economy, software needs are changing. Companies are globalizing and consolidating systems from various parts of the world. Software must be ready to handle data from social media, international web pages, and databases that have characters in many different languages. SAS makes migrating your data to Unicode a snap! This paper helps you move smoothly from your legacy SAS environment to the powerful SAS Unicode environment with UTF-8 support. Along the way, you will uncover secrets to successfully manipulate your characters, so that all of your data remains intact.

Read the paper (PDF)

Hospital Information Technologists are faced with a dilemma: how to get the many pharmacy databases, dynamic data sets, and software systems to communicate with each other and generate useful, automated, real-time output. SAS^® serves as a unifying tool for our hospital pharmacy. It brings together data from multiple sources, generates output in multiple formats, analyzes trends, and generates summary reports to meet workload, quality, and regulatory requirements. Data sets originate from multiple sources, including drug and device wholesalers, web-based drug information systems, dumb machine output, pharmacy drug-dispensing platforms, hospital administration systems, and others. SAS output includes CSV files that can be read by dispensing machines, report output for Pharmacy and Therapeutics committees, graphs to summarize year-to-year dispensing and quality trends, emails to customers with inventory and expiry date notifications, investigational drug information summaries for hospital staff, inventory trending with restock alerts, and quality assurance summary reports. For clinical trial support, additional output includes randomization codes, data collection forms, blinded enrollment summaries, study subject assignment lists, and others. For business operations, output includes invoices, shipping documents, and customer metrics. SAS brings our pharmacy information systems together and supports an efficient, cost-effective, flexible, and reliable workflow.

Read the paper (PDF) | View the e-poster or slides (PDF)

The purpose of this paper is to show a SAS^® macro named %SURVEYGENMOD developed in a SAS/IML^® procedure as an upgrade of macro %SURVEYGLM developed by Silva and Silva (2014) to deal with complex survey design in generalized linear models (GLMs). The new capabilities are the inclusion of negative binomial distribution, zero-inflated Poisson (ZIP) model, zero-inflated negative binomial (ZINB) model, and the possibility to get estimates for domains. The R function svyglm (Lumley, 2004) and Stata software were used as background, and the results showed that estimates generated by the %SURVEYGENMOD macro are close to the R function and Stata software.

Read the paper (PDF)

SAS^® Visual Analytics provides a complete platform for analytics visualization and exploration of the data. There are several interactive visualizations such as charts, histograms, heat maps, decision tree, and Sankey diagrams. A Sankey diagram helps in performing path analytics and offers a better understanding of complex data. It is a graphic illustration of flows from one set of values to another as a series of paths, where the width of each flow represents the quantity. It is a better and more efficient way to illustrate which flows represent advantages and which flows are responsible for the disadvantages or losses. Sankey diagrams are named after Matthew Henry Phineas Riall Sankey, who first used this in a publication on energy efficiency of a steam engine in 1898. This paper begins with information regarding the essentials or parts of Sankey: nodes, links, drop-off links, and path. Later, the paper explains the method for creating a meaningful visualization (with the help of examples) with a Sankey diagram by looking into the data roles and properties, describing ways to manage the path selection, exploring the transaction identifier values for a path selection, and using the spotlight tool to view multiple data tips in SAS Visual Analytics. Finally, the paper provides recommendation and tips to work effectively and efficiently with the Sankey diagram.

Read the paper (PDF)

With the proliferation of analytics expanding across every function of the enterprise, the need for broader access to data by experienced data scientists and non-technical users to produce reports and do discovery is growing exponentially. The unintended consequence of this trend is a bottleneck within IT to deliver the necessary data while still maintaining the necessary governance and data security standards required to safeguard this critical corporate asset. This presentation illustrates how organizations are solving this challenge and enabling users to both access larger quantities of existing data and add new data to their own models without negatively impacting the quality, security, or cost to store that data. It also highlights some of the cost and performance benefits achieved by enabling self-service data management.

Self-driving cars are no longer a futuristic dream. In the recent past, Google has launched a prototype of the self-driving car, while Apple is also developing its own self-driving car. Companies like Tesla have just introduced an Auto Pilot version in their newer version of electric cars which have created quite a buzz in the car market. This technology is said to enable aging or disabled people to remain mobile, while also increasing overall traffic saftery. But many people are still skeptical about the idea of self-driving cars, and that's our area of interest. In this project, we plan to do sentiment analysis on thoughts voiced by people on the Internet about self-driving cars. We have obtained the data from http://www.crowdflower.com/data-for-everyone which contain these reviews about the self-driving cars. Our dataset contains 7,156 observations and 9 variables. We plan to do descriptive analysis of the reviews to identify key topics and then use supervised sentiment analysis. We also plan to track and report how the topics and the sentiments change over time.

View the e-poster or slides (PDF)

Imagine if you will a program, a program that loves its data, a program that loves its data to be in the same directory as the program itself. Together, in the same directory. True love. The program loves its data so much, it just refers to it by filename. No need to say what directory the data is in; it is the same directory. Now imagine that program being thrust into the world of the server. The server knows not what directory this program resides in. The server is an uncaring, but powerful, soul. Yet, when the program is executing, and the program refers to the data just by filename, the server bellows nay, no path, no data. A knight in shining armor emerges, in the form of a SAS^® macro, who says lo, with the help of the SAS^® Enterprise Guide^® macro variable minions, I can gift you with the location of the program directory and send that with you to yon mighty server. And there was much rejoicing. Yay. This paper shows you a SAS macro that you can include in your SAS Enterprise Guide pre-code to automatically set your present working directory to the same directory where your program is saved on your UNIX or Linux operating system. This is applicable to submitting to any type of server, including a SAS Grid Server. It gives you the flexibility of moving your code and data to different locations without having to worry about modifying the code. It also helps save time by not specifying complete pathnames in your programs. And can't we all use a little more time?

Read the paper (PDF)

As a SAS^® administrator, have you ever wanted to look at the data in SAS^® Environment Manager spanning a longer length of time? Has your manager asked for access to the data so that they can use it to spot trends and make predictions? This paper shows you how to share that wealth of information found in the SAS Environment Manager log data. It explains how to save and store the data for use in SAS^® Visual Analytics. You will find tips on structuring the data for easy analysis and examples of using the data to make business decisions.

Read the paper (PDF)

If you are planning to deploy SAS^® Grid Manager or SAS^® Enterprise BI (or other distributed SAS^® Foundation applications) with load-balanced servers on multiple operating systems instances, a shared file system is required. In order to determine the best shared file system choice for a given deployment, it is important to understand how the file system is used, the SAS^® I/O workload characteristics performed on it, and the stressors that SAS Foundation applications produce on the file system. For the purposes of this paper, we use the term shared file system to mean both a clustered file system and shared file system, even though shared can denote a network file system and a distributed file system not clustered. This paper examines the shared file systems that are most commonly used with SAS and reviews their strengths and weaknesses.

Read the paper (PDF)

In 2012, US Customs scanned nearly 4% and physically inspected less than 1% of the 11.5 million cargo containers that entered the United States. Laundering money through trade is one of the three primary methods used by criminals and terrorists. The other two methods used to launder money are using financial institutions and physically moving money via cash couriers. The Financial Action Task Force (FATF) roughly defines trade-based money laundering (TBML) as disguising proceeds from criminal activity by moving value through the use of trade transactions in an attempt to legitimize their illicit origins. As compared to other methods, this method of money laundering receives far less attention than those that use financial institutions and couriers. As countries have budget shortfalls and realize the potential loss of revenue through fraudulent trade, they are becoming more interested in TBML. Like many problems, applying detection methods against relevant data can result in meaningful insights, and can result in the ability to investigate and bring to justice those perpetuating fraud. In this paper, we apply TBML red flag indicators, as defined by John A. Cassara, against shipping and trade data to detect and explore potentially suspicious transactions. (John A. Cassara is an expert in anti-money laundering and counter-terrorism, and author of the book Trade-Based Money Laundering. ) We use the latest detection tool in SAS^® Viya , along with SAS^® Visual Investigator.

View the e-poster or slides (PDF)

Web services are becoming more and more relied upon for serving up vast amounts of data. With such a heavy reliance on the web, and security threats increasing every day, security is a big concern. OAuth 2.0 has become a go-to way for websites to allow secure access to the services they provide. But with increased security, comes increased complexity. Accessing web services that use OAuth 2.0 is not entirely straightforward, and can cause a lot of users plenty of trouble. This paper helps clarify the basic uses of OAuth and shows how you can easily use Base SAS^® to access a few of the most popular web services out there.

Read the paper (PDF)

The University of Central Florida (UCF) Institutional Knowledge Management (IKM) office provides data analysis and reporting for all UCF divisions. These projects are logged and tracked through the Oracle PeopleSoft content management system (CMS). In the past, projects were monitored via a weekly query pulled using SAS^® Enterprise Guide^®. The output would be filtered and prioritized based on project importance and due dates. A project list would be sent to individual staff members to make updates in the CMS. As data requests were increasing, UCF IKM needed a tool to get a broad overview of the entire project list and more efficiently identify projects in need of immediate attention. A project management dashboard that all IKM staff members can access was created in SAS^® Visual Analytics. This dashboard is currently being used in weekly project management meetings and has eliminated the need to send weekly staff reports.

View the e-poster or slides (PDF)

One often uses an iterative %DO loop to execute a section of a macro repetitively. An alternative method is to use the implicit loop in the DATA step with the EXECUTE routine to generate a series of macro calls. One of the advantages in the latter approach is eliminating the need of using indirect referencing. To better understand the use of the CALL EXECUTE routine, it is essential for programmers to understand the mechanism and the timing of macro processing to avoid programming errors. These technical issues are discussed in detail in this paper.

Read the paper (PDF)

Are you tired of constantly creating new emails each and every time you run a report, frantically searching for the reports, attaching said reports, and writing emails, all the while thinking there has to be a better way? Then, have I got some code to share with you! This session provides you with code to flee from your old ways of emailing data and reports. Instead, you set up your SAS^® code to send an email to your recipients. The email attaches the most current files each and every time the code is run. You do not have to do anything manually after you run your SAS code. This session provides SAS programmers with instructions about how to create their own email in a macro that is based on their current reports. We demonstrate different options to customize the code to add the email body (and to change the body) and to add attachments (such as PDF and Excel). We show you an additional macro that checks whether a file exists and adds a note in the SAS log if it is missing so that you won't get a warning message. Using SAS code, you will become more efficient and effective by automating a tedious process and reducing errors in email attachments, wording, and recipient lists.

Read the paper (PDF)

The syntax to combine SAS^® data sets is simple: use the SET statement to concatenate, and use the MERGE and BY statements to merge. The data sets themselves, however, might be complex. Combining the data sets might not result in what you need. This paper reviews techniques to perform before you combine data sets, including checking for the following: common variables; common variables with different attributes; duplicate identifiers; duplicate observations; and acceptable match rates.

Read the paper (PDF)

SAS^® 9.4 Graph Template Language: Reference has more than 1300 pages and hundreds of options and statements. It is no surprise that programmers sometimes experience unexpected twists and turns when using the graph template language (GTL) to draw figures. Understandably, it is easy to become frustrated when your program fails to produce the desired graphs despite your best effort. Although SAS needs to continue improving GTL, this paper offers several tricks that help overcome some of the roadblocks in graphing.

Read the paper (PDF)

Can you actually get something for nothing? With PROC SQL's subquery and remerging features, then yes, you can. When working with categorical variables, there is often a need to add flag variables based on group descriptive statistics, such as group counts and minimum and maximum values. Instead of first creating the group count or minimum or maximum values, and then merging the summarized data set with the original data set with conditional statements creating a flag variable, why not take advantage of PROC SQL to complete three steps in one? With PROC SQL's subquery, CASE-WHEN clause, and summary functions by the group variable, you can easily remerge the new flag variable back with the original data set.

Read the paper (PDF)

With the 2014 launch of SAS^® University Edition, the reach of SAS^® was greatly expanded to educators, students, researchers, non-profits, and the curious, who for the first time could use a full version of Base SAS^® software for free. Because SAS University Edition allows a maximum of two CPUs, however, performance is curtailed sharply from more substantial SAS environments that can benefit from parallel and distributed processing, such as environments that implement SAS^® Grid Manager, Teradata, or Hadoop solutions. Even when comparing performance of SAS University Edition against the most straightforward implementation of the SAS windowing environment, the SAS windowing environment demonstrates greater performance when run on the same computer. With parallel processing and distributed computing becoming the status quo in SAS production environments, SAS University Edition will unfortunately fall behind counterpart SAS solutions if it cannot harness parallel processing best practices and performance. To curb this disparity, this session introduces groundbreaking programmatic methods that enable commodity hardware to be networked so that multiple instances of SAS University Edition can communicate and work collectively to divide and conquer complex tasks. With parallel processing facilitated, a SAS practitioner can now harness an endless number of computers to produce blitzkrieg solutions with the SAS University Edition that rival the performance of more costly, complex infrastructure.

Ensemble models have become increasingly popular in boosting prediction accuracy over the last several years. Stacked ensemble techniques combine predictions from multiple machine learning algorithms and use these predictions as inputs to a second level-learning algorithm. This paper shows how you can generate a diverse set of models by various methods (such as neural networks, extreme gradient boosting, and matrix factorizations) and then combine them with popular stacking ensemble techniques, including hill-climbing, generalized linear models, gradient boosted decision trees, and neural nets, by using both the SAS^® 9.4 and SAS^® Visual Data Mining and Machine Learning environments. The paper analyzes the application of these techniques to real-life big data problems and demonstrates how using stacked ensembles produces greater prediction accuracy than individual models and na ve ensembling techniques. In addition to training a large number of models, model stacking requires the proper use of cross validation to avoid overfitting, which makes the process even more computationally expensive. The paper shows how to deal with the computational expense and efficiently manage an ensemble workflow by using parallel computation in a distributed framework.

Read the paper (PDF)

This presentation brings together experiences from SAS^® professionals working as volunteers for organizations, charities, and in academic research. Pro bono work, much like that done by physicians, attorneys, and professionals in other areas, is rapidly growing in statistical practice as an important part of a statistical career, offering the opportunity to use your skills in a places where they are so needed but cannot be supported in a for-pay position. Statistical volunteers also gain important learning experiences, mentoring, networking, and other opportunities for professional development. The presenter shares experiences from volunteering for local charities, non-governmental organizations (NGOs) and other organizations and causes, both in the US and around the world. The mission, methods, and focus of some organizations are presented, including DataKind, Statistics Without Borders, Peacework, and others.

Read the paper (PDF)

Have you ever run SAS^® code with a DATA step and the results were not what you expected? Tracking down the problem can be a time-consuming task. To assist you in this common scenario, SAS^® Enterprise Guide^® 7.13 and beyond has a DATA step debugger tool. The simple and interactive DATA step debugger enables you to visually walk through the execution of your DATA step program. You can control the DATA step execution, view the variables, and set breakpoints to quickly identify data and logic errors. Come see the full capabilities of the new SAS Enterprise Guide DATA step debugger. You'll be squashing bugs in no time!

Read the paper (PDF)

Has the rapid pace of SAS/STAT^® releases left you unaware of powerful enhancements that could make a difference in your work? Are you still using PROC REG rather than PROC GLMSELECT to build regression models? Do you understand how the GENMOD procedure compares with the newer GEE and HPGENSELECT procedures? Have you grasped the distinction between PROC PHREG and PROC ICPHREG? This paper will increase your awareness of modern alternatives to well-established tools in SAS/STAT by using succinct, high-level comparisons rather than detailed descriptions to explain the relative benefits of procedures and methods. The paper focuses on alternatives in the areas of regression modeling, mixed models, generalized linear models, and survival analysis. When you see the advantages of these newer tools, you will want to put them into practice. This paper points you to helpful resources for getting started.

Read the paper (PDF)

Dealing with analysts and managers who do not know how to or want to use SAS^® can be quite tricky if everything you are doing uses SAS. This is where stored processes using SAS^® Enterprise Guide^® comes in handy. Once you know what they want to get out of the code, prompts can be defined in a smart and flexible way to give all users (whether they are SAS or not) full control over the output of the code. The key is having code that requires minimal maintenance and for you to be very flexible so that you can accommodate anything that the user comes up with. This session provides examples of credit risk stress testing where loss forecasting results were presented using different levels. Results were driven by a stored process prompt using a simple DATA step, PROC SQL, and PROC REPORT. This functionality can be used in other industries where data is shown using different levels of granularity.

Read the paper (PDF)

There is an industry-wide push toward making workflows seamless and reproducible. Incorporating reproducibility into the workflow has many benefits; among them are increased transparency, time savings, and accuracy. We walk through how to seamlessly integrate SAS^®, LaTeX, and R into a single reproducible document. We also discuss best practices for general principles such as literate programming and version control.

Read the paper (PDF)

The SAS^® LOCK statement was introduced in SAS^®7 with great pomp and circumstance, as it enabled SAS^® software to lock data sets exclusively. In a multiuser or networked environment, an exclusive file lock prevents other users and processes from accessing and accidentally corrupting a data set while it is in use. Moreover, because file lock status can be tested programmatically with the LOCK statement return code (&SYSLCKRC), data set accessibility can be validated before attempted access, thus preventing file access collisions and facilitating more reliable, robust software. Notwithstanding the intent of the LOCK statement, stress testing demonstrated in this session illustrates vulnerabilities in the LOCK statement that render its use inadvisable due to its inability to lock data sets reliably outside of the SAS/SHARE^® environment. To overcome this limitation and enable reliable data set locking, a methodology is demonstrated that uses semaphores (flags) that indicate whether a data set is available or is in use, and mutually exclusive (mutex) semaphores that restrict data set access to a single process at one time. With Base SAS^® file locking capabilities now restored, this session further demonstrates control table locking to support process synchronization and parallel processing. The LOCKSAFE macro demonstrates a busy-waiting (or spinlock) design that tests data set availability repeatedly until file access is achieved or the process times out.

Read the paper (PDF)

In SAS^® Visual Analytics, we demonstrate a search functionality that enables users to filter a LASR table for records containing a search string. The search is performed on selected character fields that are defined for the table. The search string can be portions of words. Each additional string to search for narrows the search results.

Read the paper (PDF)

At the University of Central Florida (UCF), Student Development and Enrollment Services (SDES) combined efforts with Institutional Knowledge Management (IKM), which is the official source of data at UCF, to venture in a partnership to bring to life an electronic version of the SDES Dashboard at UCF. Previously, SDES invested over two months in a manual process to create a booklet with graphs and data that was not vetted by IKM; upon review, IKM detected many data errors plus inconsistencies in the figures that had been manually collected by multiple staff members over the years. The objective was to redesign this booklet using SAS^® Web Report Studio. The result was a collection of five major reports. IKM reports use SAS^® Business Intelligence (BI) tools to surface the official UCF data, which is provided to the State of Florida. Now it just takes less than an hour to refresh these reports for the next academic year cycle. Challenges in the design, implementation, usage, and performance are presented.

Read the paper (PDF)

As a SAS^® programmer, how often does it happen where you would like to submit some code but not wait around for it to finish? SAS^® Studio has a way to achieve this and much more! This paper covers how to submit and execute SAS code in the background using SAS Studio. Background submit in the SAS Studio interface allows you to submit code and continue with your work. You receive a notification when it is finished, or you can even disconnect from your browser session and check the status of the submitted code later. Or you can choose to use SAS Studio to submit your code without bringing up the SAS Studio interface at all. This paper also covers the ability to use a command-line executable program that uses SAS Studio to execute SAS code in the background and generate log and result files without having to create a new SAS Studio session. These techniques make it much easier to spin up long-running jobs, while still being able to get your other work done in the meantime.

Read the paper (PDF)

Every sourcing and procurement department has limited resources to use for realizing productivity (cost savings). In practice, many organizations simply schedule yearly pricing negotiations with their main suppliers. They do not deviate from that approach unless there is a very large swing in the underlying commodity. Using cost data gleaned from previous quotes and SAS^® Enterprise Guide^®, we can put in place a program and methodology that move the practice from gut instinct to quantifiable and justifiable models that can easily be updated on a monthly basis. From these updated models, we can print a report of suppliers or categories that we should consider for cost downs, and suppliers or categories that we should work on to hold current pricing. By having all cost models, commodity data, and reporting functions within SAS Enterprise Guide, we are able to not only increase the precision and effectiveness of our negotiations, but also to vastly decrease the load of repetitive work that has been traditionally placed on supporting analysts. Now the analyst can execute the program, send the initial reports to the management team, and be leveraged for other projects and tasks. Moreover, the management team can have confidence in the analysis and the recommended plan of action.

View the e-poster or slides (PDF)

Survival analysis differs from other types of statistical analysis, including graphical summaries and regression modeling procedures, because data is almost always censored. The purpose of this project is to apply survival analysis techniques in SAS^® to practical survival data, aiming to understand the effects of gender and age on lung cancer patient survival at different cancer sites. Results show that both gender and age are significant variables in predicting lung cancer patient survival using the Cox proportional hazards model. Females have better survival than males when other variables in the model are fixed (p-value 0.0254). Moreover, the hazard of patients who are over 65 is 1.385 times that of patients who are under 65 (p-value 0.0145).

View the e-poster or slides (PDF)

Did you know that you could leverage the statistical power of the FREQ procedure and still be able to control the appearance of your output? Many people think they have to use procedures such as REPORT and TABULATE to be able to apply style options and control formats and headings for their output. However, if you pair PROC FREQ with a TEMPLATE procedure step, you can customize the appearance of your output and make enhancements to tables, such as adding colors and controlling headings. If you are a statistician, you know the many PROC FREQ options that produce high-level statistics. But did you also know that PROC FREQ can generate a graphical representation of those statistics? PROC FREQ can generate the graphs, and then you can use ODS Graphics and the Graph Template Language (GTL) to improve the appearance of the graphs. Written for intermediate users, this paper demonstrates how you can enhance the default output for PROC FREQ one-way and multi-way tables by modifying colors, formats, and labels. This paper also describes the syntax for creating graphs for multiple statistics, and it uses examples to show how you can customize these graphs.

Read the paper (PDF)

In the game of tag, being it is bad, but where accessibility compliance is concerned, being tagged is good! Tagging is required for PDF files to comply with accessibility standards such as Section 508 and the Web Content Accessibility Guidelines (WCAG). In the fourth maintenance release for SAS^® 9.4, the preproduction option in the ODS PDF statement, ACCESSIBLE, creates a tagged PDF file. We look at how this option changes the file that is created and focus on the SAS^® programming techniques that work best with the new option. You ll then have the opportunity to try it yourself in your own code and provide feedback to SAS.

Read the paper (PDF)

In 32 years as a SAS^® consultant at the Federal Reserve Board, I have seen some questions about common SAS tasks surface again and again. This paper collects the most common questions related to basic DATA step processing from my previous 'Tales from the Help Desk' papers, and provides code to explain and resolve them. The following tasks are reviewed: using the LAG function with conditional statements; avoiding character variable truncation; surrounding a macro variable with quotes in SAS code; handling missing values (arithmetic calculations versus functions); incrementing a SAS date value with the INTNX function; converting a variable from character to numeric or vice versa and keeping the same name; converting character or numeric values to SAS date values; using an array definition in multiple DATA steps; using values of a variable in a data set throughout a DATA step by copying the values into a temporary array; and writing data to multiple external files in a DATA step, determining file names dynamically from data values. In the context of discussing these tasks, the paper provides details about SAS processing that can help users employ SAS more effectively. See the references for seven previous papers that contain additional common questions.

Read the paper (PDF)

Have you ever used a control chart to assess the variation in a process? Did you wonder how you could modify the chart to tell a more complete story about the process? This paper explains how you can use the SHEWHART procedure in SAS/QC^® software to make the following enhancements: display multiple sets of control limits that visualize the evolution of the process, visualize stratified variation, explore within-subgroup variation with box-and-whisker plots, and add information that improves the interpretability of the chart. The paper begins by reviewing the basics of control charts and then illustrates the enhancements with examples drawn from real-world quality improvement efforts.

Read the paper (PDF)

SAS^® Macro Language can be used to enhance many report-generating processes. This presentation showcases the potential that macros have in populating predesigned RTF templates. If you have multiple report templates saved, SAS^® can choose and populate the correct ones based on macro programming and DATA _NULL_ using the TRANSTRN function. The autocall macro %TRIM, combined with a macro (for example, &TEMPLATE), can be attached to the output RTF template name. You can design and save as many templates as you like or need. When SAS assigns the macro variable TEMPLATE a value, the %TRIM(&TEMPLATE) statement in the output pathway correctly populates the appropriate template. This can make life easy if you create multiple different reports based on one data set. All that's required are stored templates on accessible pathways.

View the e-poster or slides (PDF)

Temporal text mining (TTM) is the discovery of temporal patterns in documents that are collected over time. It involves discovery of latent themes, construction of a thematic evolution graph, and analysis of thematic patterns. This paper uses text mining and time series analysis techniques to explore Don Quixote de la Mancha, a two-volume master work of Western literature. First, it uses singular value decomposition in SAS^® Text Miner to discover 25 key themes that characterize the two volumes. Then it treats the chapters of the two books as time-ordered documents and creates a semiautomated visual summary of the two volumes. It also explores the trajectory of individual themes over the course of the chapters and identifies episodes, recurring themes, and climaxes. Finally, it uses time series clustering in SAS^® Enterprise Miner to group chapters that have similar themes and to group themes that have similar trajectories. The TTM methods demonstrated in this paper lend themselves to business applications such as monitoring changes in customer sentiment and summarizing research and legislative trends.

Read the paper (PDF)

This paper discusses a set of practical recommendations for optimizing the performance and scalability of your Hadoop system using SAS^®. Topics include recommendations gleaned from actual deployments from a variety of implementations and distributions. Techniques cover tips for improving performance and working with complex Hadoop technologies such as Kerberos, techniques for improving efficiency when working with data, methods to better leverage the SAS in Hadoop components, and other recommendations. With this information, you can unlock the power of SAS in your Hadoop system.

Read the paper (PDF)

Testing is a weak spot in many data warehouse environments. A lot of the testing is focused on the correct implementation of requirements. But due to the complex nature of analytics environments, a change in a data integration process can lead to unexpected results in totally different and untouched areas. We developed a method to identify unexpected changes often and early by doing a nightly regression test. The test does a full ETL run, compares all output from the test to a baseline, and reports all the changes. This paper describes the process and the SAS^® code needed to back up existing data, trigger ETL flows, compare results, and restore situations after a nightly regression test. We also discuss the challenges we experienced while implementing the nightly regression test framework.

Read the paper (PDF)

SAS offers generation data set structure as part of the language feature that many users are familiar with. They use it in their organizations and manage it using keywords such as GENMAX and GENNUM. While SAS operates in a mainframe environment, users also have the ability to tap into the GDG (generation data group) feature available on z/OS, OS/390, OS/370, IBM 3070, or IBM 3090 machines. With cost-saving initiatives across businesses and due to some scaling factors, many organizations are in the process of migrating to mid-tier platforms to cheaper operating platforms such as UNIX and AIX. Because Linux is open source and is a cheaper alternative, several organizations have opted for the UNIX distribution of SAS that can work in UNIX and AIX environments. While this might be a viable alternative, there are certain nuances that the migration effort brings to the technical conversion teams. On UNIX, the concept of GDGs does not exist. While SAS offers generation data sets, they are good only for SAS data sets. If the business organization needs to house and operate with a GDG-like structure for text data sets, there isn't one available. While my organization had a similar initiative to migrate programs used to run the subprime mortgage analytic, incentive, and regulatory reporting, we identified the paucity of literature and research on this topic. Hence, I ended up developing the utility that addresses this need. This is a simple macro that helps us closely simulate a GDG/GDS.

Read the paper (PDF) | View the e-poster or slides (PDF)

This project described the method to classify movie genres based on synopses text data by two approaches: term frequency, and inverse document frequency (tf-idf) and C4.5 decision tree. Using the performance comparison of the classifiers by manipulating the different parameters, the strength and improvement of this method in substantial text analysis were also interpreted. As the result, these two approaches are powerful to identify movie genres.

Read the paper (PDF) | View the e-poster or slides (PDF)

SAS^® Cloud Analytic Services (CAS) is the cloud-based run-time environment for data management and analytics in SAS^®. By run-time environment, we refer to the combination of hardware and software where data management and analytics take place. In a sense, CAS is just another SAS platform to do things. CAS is a platform for high-performance analytics and distributed computing. The CAS server provides data management and an analytics framework that can run in the cloud, that can act as a cloud, and that provides the best-in-class analytics that SAS is known for. This new architecture functions as a public API, allowing access from many different clients such as Lua, Python, Java, REST, and yes, even SAS. The CAS server is designed to provide user-level sessions, to share data between sessions, and to provide fault tolerance, which allows a worker node to crash without losing data and allows the user action to continue running to completion. The isolation provided to each session allows one session to crash without affecting other sessions. The concept of 'always in memory' in CAS means that an action is not aware of what the server does to allow the action to access the data. The entire file might be in memory or just pieces of the file might be mapped into memory, just in time for the action to access the data. This allows CAS tables to be loaded that are larger than the memory available across the grid. Hadoop can be used to provide data redundancy. The server is elastic and can add or remove nodes as needed. Users can specify how many nodes they want their session to use, so that the session fits their needs.

Read the paper (PDF)

SAS^® provides an extensive set of graphs for different needs. But as a SAS programmer or someone who uses SAS^® Visual Analytics Designer to create reports, the number of possible scenarios you have to address outnumber the available graphs. This paper demonstrates how to create your own advanced graphs by intelligently combining existing graphs. This presentation explains how you can create the following types of graphs by combining existing graphs: a line-based graph that shows a line for each group such that each line is partly solid and partly dashed to show the actual and predicted values respectively; a control chart (which is currently not available as a standard graph) that lets you show how values change within multiple upper and lower limits; a line-based graph that gives you more control over attributes (color, symbol, and so on) of specific markers to depict special conditions; a visualization that shows the user only a part of the data at a given instant, and lets him move a window to see other parts of the data; a chart that lets the user compare the data values to a specific value in a visual way to make the comparison more intuitive; and a visualization that shows the overall data and at the same time shows the detailed data for the selected category. This paper demonstrates how to use the technique of combining graphs to create such advanced charts in SAS^® Visual Analytics and SAS^® Graph Builder as well as by using SAS procedures like the SGRENDER procedure.

Read the paper (PDF)

Propensity to Buy models comprise one of the most widely used techniques in supporting business strategy for customer segmentation and targeting. Some of the key challenges every data scientist faces in building predictive models are the utilization of all known predictor variables, uncovering any unknown signals, and adjusting for latent variable errors. Often, the business demands inclusion of certain variables based on a previous understanding of process dynamics. To meet such client requirements, these inputs are forced into the model, resulting in either a complex model with too many inputs or a fragile model that might decay faster than expected. West Corporation's Center for Data Science (CDS) has found a work around to strike a balance between meeting client requirements and building a robust model by using clustering techniques. A leading telecom services provider uses West's SMS Outbound Notification Platform to notify their customers about an upcoming Pay-Per-View event. As part of the modeling process, the client has identified a few variables as key business drivers and CDS used those variables to build clusters, which were then used as inputs to the predictive model. In doing so, not only all the effects of the client-mandated variables were captured successfully, but this also helped to reduce the number of inputs to the model, making it parsimonious. This paper illustrates how West has used clustering in the data preparation process and built a robust model.

Socioeconomic status (SES) is a major contributor to health disparities in the United States. Research suggests that those with a low SES versus a high SES are more likely to have lower life expectancy; participate in unhealthy behaviors such as smoking and alcohol consumption; experience higher rates of depression, childhood obesity, and ADHD; and experience problems accessing appropriate health care. Interpreting SES can be difficult due to the complexity of data, multiple data sources, and the large number of socioeconomic and demographic measures available. When SES is expanded to include additional social determinants of health (SDOH) such as language barriers and transportation barriers to care; access to employment and affordable housing; adequate nutrition, family support and social cohesion; health literacy; crime and violence; quality of housing; and other environmental conditions, the ability to measure and interpret the concept becomes even more difficult. This paper presents an approach to measuring SES and SDOH using publicly available data. Various statistical modeling techniques are used to define state-specific composite SES scores at local areas-ZIP Code and Census Tract. Once developed, the SES/SDOH models are applied to health care claims data to evaluate the relationship between health services utilization, cost, and social factors. The analysis includes a discussion of the potential impact of social factors on population risk adjustment.

Read the paper (PDF)

Every visualization tells a story. The effectiveness of showing data through visualization becomes clear as these visualizations will tell stories about differences in US mortality using the National Longitudinal Mortality Study (NLMS) data, using the Public-Use Microdata Samples (PUMS) of 1.2 million cases and 122 thousand records of mortality. SAS^® Visual Analytics is a versatile and flexible tool that easily displays the simple effects of differences in mortality rates between age groups, genders, races, places of birth (native or foreign), education and income levels, and so on. Sophisticated analyses including logistical regression (with interactions), decision trees, and neural networks that are displayed in a clear, concise manner help describe more interesting relationships among variables that influence mortality. Some of the most compelling examples are: Males who live alone have a higher mortality rate than females. White men have higher rates of suicide than black men.

Read the paper (PDF) | View the e-poster or slides (PDF)

You've all seen the job posting that looks more like an advertisement for the ever-elusive unicorn. It begins by outlining the required skills that include a mixture of tools, technologies, and masterful things that you should be able to do. Unfortunately, many such postings begin with restrictions to those with advanced degrees in math, science, statistics, or computer science and experience in your specific industry. They must be able to perform predictive modeling, natural language processing, and, for good measure, candidates should apply only if they know artificial intelligence, cognitive computing, and machine learning. The candidate should be proficient in SAS^®, R, Python, Hadoop, ETL, real-time, in-cloud, in-memory, in-database and must be a master storyteller. I know of no one who would be able to fit that description and still be able to hold a normal conversation with another human. In our work, we have developed a competency model for analytics, which describes nine performance domains that encompass the knowledge, skills, behaviors, and dispositions that today's analytic professional should possess in support of a learning, analytically driven organization. In this paper, we describe the model and provide specific examples of job families and career paths that can be followed based on the domains that best fit your skills and interests. We also share with participants a self-assessment tool so that they can see where the stack up!

Read the paper (PDF)

As statistics students striving to discover new impacts that can be made in a data-driven world, we applied our trade to a modern topic. Studying a sport that owns a day of the week and learning how variables can influence any given series or result in a game can lead to a much larger impact. Using Base SAS^®, we used predictive analysis methods to determine the chance any given team would win a game versus a given opponent. To take it a step further, we deciphered which decision should really be made by a coach on fourth down and how that stacked up to what they actually did. With information like this, the football world might soon see an impact on how people play the game.

Read the paper (PDF)

As computer technology advances, SAS^® continually pursues opportunities to implement state-of-the-art systems that solve problems in data preparation and analysis faster and more efficiently. In this pursuit, we have extended the TRANSPOSE procedure to operate in a distributed fashion within both Teradata and Hadoop, using dynamically generated DS2 executed by the SAS^® Embedded Process and within SAS^® Viya , using its native transpose action. With its new ability to work within these environments, PROC TRANSPOSE provides you with access to its parallel processing power and produces results that are compatible with your existing SAS programs.

Read the paper (PDF)

Have you ever had your macro code not work and you couldn't figure out why? Maybe even something as simple as %if &sysscp=WIN %then LIBNAME libref 'c:\temp'; ? This paper is designed for programmers who know %LET and can write basic macro definitions already. Now let's take your macro skills a step farther by adding to your skill set. The %IF statement can be a deceptively tricky statement due to how IF statements are processed in a DATA step and how that differs from how %IF statements are processed by the macro processor. Focus areas of this paper are: 1) emphasizing the importance of the macro facility as a code-generation facility; 2) how an IF statement in a DATA step differs from a macro %IF statement and when to use which; 3) why semicolons can be misinterpreted in an %IF statement.

Read the paper (PDF)

JSON is quickly becoming the industry standard for data interchanges, especially in supporting REST APIs. But until now, importing JSON content into SAS^® software and leveraging it in SAS has required significant custom code. Developing that code can be laborious, requiring transcoding, manual text parsing, and creating handlers for unexpected structure changes. Fortunately, the new JSON LIBNAME engine (in the fourth maintenance release for SAS^® 9.4 and later) delivers a robust, efficient method for importing JSON content into SAS data structures. This paper demonstrates several real-world examples of the JSON LIBNAME using open data APIs. The first example contrasts the traditional custom code and JSON LIBNAME approach using big data from the United Nations Comtrade Database. The two approaches are compared in terms of complexity of code, time to execute, and the resulting data structures. The same method is applied to data from Google and the US Census Bureau's APIs. Finally, to demonstrate the ability of the JSON LIBNAME to handle unexpected changes to a JSON data structure, we use the SAS JSON procedure to write a JSON file and then simulate changes to that structure to show how one JSON LIBNAME process can easily adjust the import to handle those changes.

Read the paper (PDF)

You might scream in pain or cry with joy that SAS^® software can directly produce output in Microsoft Excel as .xlsx workbooks. Excel is an excellent vehicle for delivering large amounts of summary information that needs to be partitioned for human review, exploratory filtering, and sorting. SAS supports ODS EXCEL as a production destination. This paper discusses using the ODS EXCEL statement and the TABULATE and REPORT procedures in the domain of summarizing cross-sectional data extracted from a medical claims database. The discussion covers data preparation, report preparation, and tabulation statements such as CLASS, CLASSLEV, and TABLE. The effects of STYLE options and the TAGATTR suboption for inserting features that are specific to Excel such as formulas, formats, and alignment are covered in detail. A short discussion of reusing these concepts in PROC REPORT statements such as DEFINE, COMPUTE, and CALL DEFINE are also covered.

Read the paper (PDF)

As a freshman at a large university, life can be fun as well as stressful. The choices a freshman makes while in college might impact his or her overall health. In order to examine the overall health and different behaviors of students at Oklahoma State University, a survey was conducted among the freshmen students. The survey focused on capturing the psychological, environmental, diet, exercise, and alcohol and drug use among students. A total of 795 out of 1,036 freshman students completed the survey, which included around 270 questions that covered the range of issues mentioned above. An exploratory factor analysis identified 26 factors. For example, two factors that relate to the behavior of students under stress are eating and relaxing. Further understanding the variables that contribute to alcohol and drug use might help the university in planning appropriate interventions and preventions. Factor analysis with Cronbach's alpha provided insight into a more defined set of variables to help address these types of issues. We used SAS^® to do factor analysis as well as to create different clusters of students with unique characteristics and profiled these clusters

Read the paper (PDF)

Does your job require you to create reports in Microsoft Excel on a quarterly, monthly, or even weekly basis? Are you creating all or part of these reports by hand, referencing another sheet containing rows and rows and rows of data? If so, stop! There is a better way! The new ODS destination for Excel enables you to create native Excel files directly from SAS^®. Now you can include just the data you need, create great-looking tabular output, and do it all in a fraction of the time! This paper shows you how to use the REPORT procedure to create polished tables that contain formulas, colored cells, and other customized formatting. Also presented in the paper are the destination options used to create various workbook structures, such as multiple tables per worksheet. Using these techniques to automate the creation of your Excel reports will save you hours of time and frustration, enabling you to pursue other endeavors.

Read the paper (PDF)

In the 2015-2016 season of the National Basketball Association (NBA), the Golden State Warriors achieved a record-breaking 73 regular-season wins. This accomplishment would not have been possible without their reigning Most Valuable Player (MVP) champion Stephen Curry and his historic shooting performance. Shattering his previous NBA record of 286 three-point shots made during the 2014-2015 regular season, he accrued an astounding 402 in the next season. With an increased emphasis on the advantages of the three-point shot and guard-heavy offenses in the NBA today, organizations are naturally eager to investigate player statistics related to shooting at long ranges, especially for the best of shooters. Furthermore, the addition of more advanced data-collecting entities such as SportVU creates an incredible opportunity for data analysis, moving beyond simply using aggregated box scores. This work uses quantile regression within SAS^® 9.4 to explore the relationships between the three-point shot and other relevant advanced statistics, including some SportVU player-tracking data, for the top percentile of three-point shooters from the 2015-2016 NBA regular season.

View the e-poster or slides (PDF)

In the next ten years, the Industrial Internet of Things (IIoT) will dramatically alter nearly all sectors of the industrial economy, which account for nearly 2/3 of the gross domestic product according to findings at the World Economic Forum in Davos, Switzerland. The IIoT will radically change how humans, machines, and our current infrastructure operate to achieve results and compete in the new digital world. Considered a disruptive technology, the IIoT will create new value streams for industries, ranging from automated decisions and reactions in real time to massively improved operational efficiencies, connected infrastructure platforms and much better interaction between machines and humans. Erik Brynjolfsson, Director, MIT Initiative on the Digital Economy, said Humans must adapt to collaborate with machines, and when that collaboration happens, the end result is stronger. This session outlines three ways analytics can help bridge the gap between humans and machines to achieve value in the IIoT world: 1) predictive analytics from a case study in the oil and gas world; 2) an automation case study; and 3) text analysis from a tax example. The session outlines the people, process, and technologies needed to enable this infrastructure. As a special bonus, this session covers key pitfalls to avoid regarding systems, silos, and the human barriers to understanding artificial intelligence.

Read the paper (PDF)

You might encounter people who used SAS^® long ago (perhaps in university), or people who had a very limited use of SAS in a job. Some of these people with limited knowledge and experience think that SAS is just a statistics package or just a GUI. Those that think of it as a GUI usually reference SAS^® Enterprise Guide^® or, if it was a really long time ago, SAS/AF^® or SAS/FSP^®. The reality is that the modern SAS system is a very large and complex ecosystem, with hundreds of software products and diversified tools for programmers and users. This poster provides diagrams and tables that illustrate the complexity of the SAS system from the perspective of a programmer. Diagrams and illustrations include the functional scope and operating systems in the ecosystem; different environments that program code can run in; cross-environment interactions and related tools; SAS^® Grid Computing: parallel processing; how SAS can run with files in memory (the legacy SAFILE statement and big data and Hadoop); and how some code can run in-database. We end with a tabulation of the many programming languages and SQL dialects that are directly or indirectly supported within SAS. This poster should enlighten those who think that SAS is an old, dated statistics package or just a GUI.

View the e-poster or slides (PDF)

As a SAS^® Visual Analytics administrator, how do you efficiently manage your SAS^® LASR environment? How do you ensure reliable data availability to your end users? How do you ensure that your users have the proper permissions to perform their tasks in SAS Visual Analytics? This paper covers some common management issues in SAS Visual Analytics, why and how they might arise, and how to resolve them. It discusses methods of programmatically managing your SAS^® LASR Analytic Server and tables, as well as using SAS^® Visual Analytics Administrator. Furthermore, it provides a better understanding of the roles in SAS Visual Analytics and demonstrates how to set up appropriate user permissions. Using the methods discussed in this paper can help you improve the end-user experience as well as system performance.

Read the paper (PDF)

This paper has two goals: 1) Determine which client factors have the highest influence on whether a client purchases a term deposit; 2) Determine the levels of those influential client factors that produce the most term deposit purchases. Achievement of these goals can aid a bank in gaining operating capital by targeting clients that are more likely to make term deposit purchases. Since the target response variable was binary in nature, a logistic regression model and binary decision tree model were used to analyze the marketing campaign data. The ROC curves and fit statistics of the logistic regression model and decision tree were compared to see which model fit the data best. The logistic regression model was the optimal model with a higher area under the ROC curve and a lower misclassification rate. Per the logistic regression model, the three factors that had the largest impact on term deposit purchases were: the type of job the client had, whether a client had credit in default, and whether the client had a personal loan. It was concluded that banks should focus on selling term deposits to clients that display levels of these three factors that lead to the most probable term deposit purchases.

Read the paper (PDF)

SAS^® environments are evolving in multiple directions. Modern web interfaces such as SAS^® Studio are replacing the traditional SAS^® Display Manager system. At the same time, distributed analytic computing, centrally managed by SAS^® Grid Manager, is becoming the standard topology for many enterprises. SAS administrators are faced with the task of providing business users properly configured, tuned, and monitored applications. The tips included in this paper provide SAS administrators with best practices to centrally manage SAS Studio options and repositories, proper grid tuning, effective monitoring of user sessions, high-availability considerations and more.

Read the paper (PDF)

The advent of robust and thorough data collection has resulted in the term big data. With Census data becoming richer, more nationally representative, and voluminous, we need methodologies that are designed to handle the manifold survey designs that Census data sets implement. The relatively nascent PROC SURVEYLOGISTIC, an experimental procedure in SAS^®9 and fully supported in SAS 9.1, addresses some of these methodologies, including clusters, strata, and replicate weights. PROC SURVEYLOGISTIC handles data that is not a straightforward random sample. Using Census data sets, this paper provides examples highlighting the appropriate use of survey weights to calculate various estimates, as well as the calculation and interpretation of odds ratios between categorical variable interactions when predicting a binary outcome.

Read the paper (PDF)

SAS^® programming skills are much in-demand, and numerous free tools are available for students who want to develop those skills. This paper introduces students to SAS^® Studio and the Jupyter Notebook interface within SAS^® University Edition. To make this introduction more tangible, the paper uses a large data set of baseball statistics as an example. In particular, statistical analysis using SAS^® Studio examines the relationship between salary and performance for major leaguers. From importing text files to creating basic statistics to doing a more advanced analysis, this paper shows multiple ways to carry out tasks so that you can choose whichever method works best for you. Additional statistics that use t tests and linear regression are simple with SAS University Edition. For completeness, the paper shows the same code that is used in SAS Studio examples in the context of Jupyter Notebook in SAS University Edition. The paper also provides additional information about SAS e-learning and SAS Certification to show students how to be fully equipped in order to apply themselves to analytics and data exploration.

Read the paper (PDF)

Time series analysis and forecasting have always been popular as businesses realize the power and impact they can have. Getting students to learn effective and correct ways to build their models is key to having successful analyses as more graduates move into the business world. Using SAS^® University Edition is a great way for students to learn analysis, and this talk focuses on the time series tasks. A brief introduction to time series is provided, as well as other important topics that are key to building strong models.

Read the paper (PDF)

Many organizations need to analyze large numbers of time series that have time-varying or frequency-varying properties (or both). The time-varying properties can include time-varying trends, and the frequency-varying properties can include time-varying periodic cycles. Time-frequency analysis simultaneously analyzes both time and frequency; it is particularly useful for monitoring time series that contain several signals of differing frequency. These signals are commonplace in data that are associated with the internet of things. This paper introduces techniques for large-scale time-frequency analysis and uses SAS^® Forecast Server and SAS/ETS^® software to demonstrate these techniques.

Read the paper (PDF)

Predictive analytics is powerful in its ability to predict likely future behavior, and is widely used in marketing. However, the old adage timing is everything continues to hold true the right customer and the right offer at the wrong time is less than optimal. In life, timing matters a great deal, but predictive analytics seldom takes timing into account explicitly. We should be able to do better. Financial service consumption changes often have precursor changes in behavior, and a behavior change can lead to multiple subsequent consumption changes. One way to improve our awareness of the customer situation is to detect significant events that have meaningful consequences that warrant proactive outreach. This session presents a simple time series approach to event detection that has proven to work successfully. Real case studies are discussed to illustrate the approach and implementation. Adoption of this practice can augment and enhance predictive analytics practice to elevate our game to the next level.

Read the paper (PDF)

Using SAS^® to query relational databases can be challenging, even for seasoned SAS programmers. SAS/ACCESS^® software makes it easy to directly access data on nearly any platform, but there is a lot of under-the-hood functionality that takes time to learn. Here are tips that will get you on your way fast, including understanding and mastering SQL pass-through; efficiently bulk-loading data from SAS into other databases; tuning your SQL queries; and when to use native database versus SAS functionality.

View the e-poster or slides (PDF)

Public water supplies contain disease-causing microorganisms in the water or distribution ducts. To kill off these pathogens, a disinfectant, such as chlorine, is added to the water. Chlorine is the most widely used disinfectant in all US water treatment facilities. Chlorine is known to be one of the most powerful disinfectants to restrict harmful pathogens from reaching the consumer. In the interest of obtaining a better understanding of what variables affect the levels of chlorine in the water, this presentation analyzed a particular set of water samples randomly collected from locations in Orange County, Florida. Thirty water samples were collected and their chlorine level, temperature, and pH were recorded. A linear regression analysis was performed on the data collected with several qualitative and quantitative variables. Water storage time, temperature, time of day, location, pH, and dissolved oxygen level were the independent variables collected from each water sample. All data collected was analyzed using various SAS^® procedures. Partial residual plots were used to determine possible relationships between the chlorine level and the independent variables. A stepwise selection was used to eliminate possible insignificant predictors. From there, several possible models for the data were selected. F-tests were conducted to determine which of the models appeared to be the most useful.

View the e-poster or slides (PDF)

Trends in predictive modeling, specifically in machine learning, are moving toward automated approaches where well-known predictive algorithms are applied and the best one is chosen according to some evaluation metric. This approach's efficacy relies on the underlying data preparation in general, and on data preprocessing in particular. Commonly used data preprocessing techniques include missing value imputation, outlier detection and treatment, functional transformation, discretization, and nominal grouping. Excluding toy problems, the composition of the best data preprocessing step depends on the modeling task at hand. This necessitates an iterative generation and evaluation of predictive pipelines, which consist of a mix of data preprocessing techniques and predictive algorithms. This is a combinatorial problem that can be a bottleneck in the analytics workflow. In this paper, we discuss the SAS^® Cloud Analytic Services (CAS) actions in SAS^® Viya that can be used to effect this end-to-end predictive pipeline in a scalable way, with special emphasis on CAS actions for data exploration, preprocessing, and feature transformation. In addition, we discuss how the whole process can be automated.

Knowing which SAS^® products are being used in your organization, by whom, and how often helps you decide whether you have the right mix and quantity licensed. These questions are not easy to answer. We present an innovative technique using three SAS utilities to answer these questions. This paper includes example code written for Linux that can easily be modified for Windows and other operating systems.

Read the paper (PDF) | View the e-poster or slides (PDF)

As the IT industry moves to further embrace cloud computing and the benefits it enables, many companies have been slow to adopt these changes due to concerns around data compliance. Compliance with state and federal law and the relevant regulations often leads decision makers to insist that systems dealing with protected health information or similarly sensitive data remain on-premises, as the risks for non-compliance are so high. In this session, we detail BNL Consulting s standard practices for transitioning solutions that are compliant with the Health Insurance Portability and Accountability Act (HIPAA) from on-premises to a cloud-based environment hosted by Amazon Web Services (AWS). We explain that by following best practices and doing plenty of research, HIPAA compliance in a cloud environment is no more challenging than compliance in an on-premises environment. We discuss the role of best-in-practice dev-ops tools like Docker, Consul, ELK Stack, and others, which improve the reliability and the repeat-ability of your HIPAA-compliant solutions. We tie these recommendations to the use of common SAS tools and show how they can work in concert to stabilize and improve the performance of the solution over the on-premises alternatives. Although this presentation is focused on health care and HIPAA-specific examples, many of the described practices and processes apply to any sensitive-data solutions that are being considered for the cloud.

Read the paper (PDF)

Transport Layer Security (TLS) configuration for SAS^® components is essential to protect data in motion. All necessary encryption arrangement is established through a TLS handshake between the client and the server side. Many SAS^® 9.4 and SAS^® Viya components can be a client side, a server side, or both. SAS documentation primarily provides how-to steps for the configuration. This paper examines the X.509 certificate and the TLS handshake protocol, which are the basic building blocks of the secure communication. The paper focuses on the logic behind the setup and how various types of certificates are used in the configuration. Many unique client and server combinations of SAS components are illustrated and explained with the best-practice suggestions.

Read the paper (PDF)

We are always looking for ways to improve the performance, efficiency, and availability of our investment in SAS^® solutions. To address those needs, SAS offers the ability to cluster many of its constituent software components. A cluster is a set of systems that work together with the goal of providing a single service. This session identifies 12 different technologies to create clusters of SAS software components and describes how they are designed to boost the capabilities of SAS to function in the enterprise.

Read the paper (PDF)

SAS^® Embedded Process enables user-written DS2 code and scoring models to run inside Hadoop. It taps into the massively parallel processing (MPP) architecture of Hadoop for scalable performance. SAS Embedded Process explores and complies with many Hadoop components. This paper explains how SAS Embedded Process interacts with existing Hadoop security technologies, such as Apache Sentry and RecordServices.

Read the paper (PDF)

Within a SOX-compliant environment, a batch job is run. During the process, an FTP server needs to be accessed. The batch user password is not known and the FTP credentials are not known either. How safely and securely can we achieve this? The approach is to have an authentication domain within the SAS metadata created that has the FTP credentials. Create an internal SAS user within the SAS metadata. This user exists only within the SAS metadata, so it does not pose any risk. Create an FTP server within the SAS metadata. Add and link everything together within the SAS metadata. Within the SAS batch job, the SAS internal user will be used (with the use of the hashed password) to connect to the metadata to get the FTP credentials stored within the authentication domain and retrieve or upload the data.

Read the paper (PDF)

Machine learning is not just for data scientists. Business analysts can use machine learning to discover rules from historical decision data or from historical performance data. Decision tree learning and logistic regression scorecard learning are available for standard data tables, and Associations Analysis is available for transactional event tables. These rules can be edited and optimized for changing business conditions and policies, and then deployed into automated decision-making systems. Users will see demonstrations using real data and will learn how to apply machine learning to business problems.

Read the paper (PDF)

This presentation explores the steps taken by a large public research institution to develop a five-year enrollment forecasting model to support the critical enrollment management process at an institution. A key component of the process is providing university stakeholders with a self-service, secure, and flexible tool that enables them to quickly generate different enrollment projections using the most up-to-date information as possible in Microsoft Excel. The presentation shows how we integrated both SAS^® Enterprise Guide^® and the SAS^® Add-In for Microsoft Office to support this critical process, which had very specific stakeholder requirements and expectations.

Read the paper (PDF)

The traditional model of SAS^® source-code production is for all code to be directly written by users or indirectly written (that is, generated by user-written macros, Lua code, or with DATA steps). This model was recently extended to enable SAS macro code to operate on arbitrary text (for example, on HTML) using the STREAM procedure. In contrast, SAS includes many products that operate in the client/server environment and function as follows: 1) the user interacts with the product via a GUI to specify the processing desired; 2) the product saves the user-specifications in metadata and generates SAS source code for the target processing; 3) the source code is then run (per user directions) to perform the processing. Many of these products give users the ability to modify the generated code and/or insert their own user-written code. Also, the target code (system-generated plus optional user-written) can be exported or deployed to be run as a stored process, in batch, or in another SAS environment. In this paper, we review the SAS ecosystem contexts where source code is produced, the pros and cons of each approach, discuss why some system-generated code is inelegant, and make some suggestions for determining when to write the code manually, and when and how to use system-generated code.

Read the paper (PDF)

As part of the Talanx Group, HDI Insurance has been one of the leading insurers in Brazil. Recently HDI Brazil implemented an innovative and integrated solution to prevent fraud in the Auto Claims process based on SAS^® Fraud Framework and SAS^® Real-time Decision Manager. A car fix or a refund is approved immediately after the claim registration for those customers who have no suspicious information. On the other hand, the high-scored claims are checked by the inspectors using SAS^® Social Network Analysis. In terms of analytics, the solution has a hybrid approach working with predictive models, business rules, anomalies, and network relationship. The main benefits are a reduction in the amount of fraud, more accuracy in determining the claims to be investigated, a decrease in the false-positive rate, and the use of a relationship network to investigate suspicious connections.

Read the paper (PDF)

Visualizing the movement of people over time in an animation can provide insights that tables and static graphs cannot. There are many options, but what if you want to base the visualization on large amounts of data from several sources? SAS^® is a great tool for this type of project. This paper summarizes how visualizing movement is accomplished using several data sets, large and small, and using various SAS procedures to pull it together. The use of a custom shape file is also highlighted. The end result is a GIF, which can be shared, that provides insights not available with other methods.

Read the paper (PDF)

This paper discusses a specific example of using graph analytics or social network analysis (SNA) in predictive modeling in the life insurance industry. The methods of social network analysis are applied to agents that share compensation, and the results are used to derive input variables for a model to predict the likelihood of certain behavior by insurance agents. Both SAS^® code and SAS^® Enterprise Miner are used to illustrate implementing different graph analytical methods. This paper assumes that the reader is familiar with the basic process of creating predictive models using multiple (linear or logistic) regression, and, in some sections, familiarity with SAS Enterprise Miner.

Read the paper (PDF)

Hash tables are powerful tools when building an electronic code book, which often requires a lot of match-merging between the SAS^® data sets. In projects that span multiple years (e.g., longitudinal studies), there are usually thousands of new variables introduced at the end of every year or at the end of each phase of the project. These variables usually have the same stem or core as the previous year's variables. However, they differ only in a digit or two that usually signifies the year number of the project. So, every year, there is this extensive task of comparing thousands of new variables to older variables for the sake of carrying forward key database elements corresponding to the previously defined variables. These elements can include the length of the variable, data type, format, discrete or continuous flag, and so on. In our SAS program, hash objects are efficiently used to cut down not only time, but also the number of DATA and PROC steps used to accomplish the task. Clean and lean code is much easier to understand. A macro is used to create the data set containing new and older variables. For a specific new variable, the FIND method in hash objects is used in a loop to find the match to the most recent older variable. What was taking around a dozen PROC SQL steps is now a single DATA step using hash tables.

Read the paper (PDF)

If you run SAS^® and Teradata software with default application and database client encodings, some operations with international character sets will appear to work because you are actually treating character strings as streams of bytes instead of streams of characters. As long as no one in the chain of custody tries to interpret the data as anything other than a stream of bytes, then data can sometimes flow in and out of a database without being altered, giving the appearance of correct operation. But when you need to compare a particular character to an international character, or when your data approaches the maximum size of a character field, then you will run into trouble. To correctly handle international character sets, every layer of software that touches the data needs to agree on the encoding of the data. UTF-8 encoding is a flexible way to handle many single-byte and multi-byte international character sets. To use UTF-8 encoding with international character sets, we need to configure the SAS session encoding, Teradata client encoding, and Teradata server encoding to all agree, so that they are handling UTF-8 encoded data. This paper shows you how to configure SAS and Teradata so that your applications run successfully with international characters.

Read the paper (PDF)

Do you have a complex report involving multiple tables, text items, and graphics that could best be displayed in a multi-tabbed spreadsheet format? The Output Delivery System (ODS) destination for Excel, introduced in SAS^® 9.4, enables you to create Microsoft Excel workbooks that easily integrate graphics, text, and tables, including column labels, filters, and formatted data values. In this paper, we examine the syntax used to generate a multi-tabbed Excel report that incorporates output from the REPORT, PRINT, SGPLOT, and SGPANEL procedures.

Read the paper (PDF)

This paper uses a simulation comparison to evaluate quantile approximation methods in terms of their practical usefulness and potential applicability in an operational risk context. A popular method in modeling the aggregate loss distribution in risk and insurance is the Loss Distribution Approach (LDA). Many banks currently use the LDA for estimating regulatory capital for operational risk. The aggregate loss distribution is a compound distribution resulting from a random sum of losses, where the losses are distributed according to some severity distribution and the number (of losses) distributed according to some frequency distribution. In order to estimate the regulatory capital, an extreme quantile of the aggregate loss distribution has to be estimated. A number of numerical approximation techniques have been proposed to approximate the extreme quantiles of the aggregate loss distribution. We use PROC SEVERITY to fit various severity distributions to simulated samples of individual losses from a preselected severity distribution. The accuracy of the approximations obtained is then evaluated against a Monte Carlo approximation of the extreme quantiles of the compound distribution resulting from the preselected severity distribution. We find that the second-order perturbative approximation, a closed-form approximation, performs very well at the extreme quantiles and over a wide range of distributions and is very easy to implement.

Read the paper (PDF)

With SAS^® Viya and SAS^® Cloud Analytic Services (CAS), SAS is moving into a new territory where SAS^® Analytics is accessible to popular scripting languages using open APIs. Python is one of those client languages. We demonstrate how to connect to CAS, run CAS actions, explore data, build analytical models, and then manipulate and visualize the results using standard Python packages such as Pandas and Matplotlib. We cover a wide variety of topics to give you a bird's eye view of what is possible when you combine the best of SAS with the best of open source.

Read the paper (PDF)

A Middle Eastern company is responsible for daily distribution of over 230 million liters of oil products. For this distribution network, a failure scenario is defined as occurring when oil transport is interrupted or slows down, and/or when product demands fluctuate outside the normal range. Under all failure scenarios, the company plans to provide additional transport capacity at minimum cost so as to meet all point-to-point product demands. Currently, the company uses a wait-and-see strategy, which carries a high operating cost and depends on the availability of third-party transportation. This paper describes the use of the OPTMODEL procedure to implement a mixed integer programming model to model and solve this problem. Experimental results are provided to demonstrate the utility of this approach. It was discovered that larger instances of the problem, with greater numbers of potential failure scenarios, can become computationally extensive. In order to efficiently handle such instances of the problem, we have also implemented a Benders decomposition algorithm in PROC OPTMODEL.

Read the paper (PDF)

Are you frustrated with manually setting options to control your SAS^® Display Manager sessions but become daunted every time you look at all the places you can set options and window layouts? In this paper, we look at various files SAS^® accesses when starting, what can (and cannot) go into them, and what takes precedence after all are executed. We also look at the SAS registry and how to programmatically change settings. By the end of the paper, you will be comfortable in knowing where to make the changes that best fit your needs.

Read the paper (PDF)

Data is a valuable corporate asset that, when managed improperly, can detract from a company's ability to achieve strategic goals. At 1-800-Flowers.com, Inc. (18F), we have embarked on a journey toward data governance through embracing Master Data Management (MDM). Along the path, we've recognized that in order to protect and increase the value of our data, we must take data quality into consideration at all aspects of data movement in the organization. This presentation discusses the ways that SAS^® Data Management is being leveraged by the team at 18F to create and enhance our data quality strategy to ensure data quality for MDM.

Read the paper (PDF)

This paper discusses the techniques I used at the US Census Bureau to overcome the issue of dealing with large amounts of data while modernizing some of their public-facing web applications by using service-oriented architecture (SOA) to deploy JavaScript web applications powered by SAS^®. The paper covers techniques that resulted in reducing 1,753,926 records (82 MB) down to 58 records (328 KB), a 99.6% size reduction in summarized data on the server side.

Read the paper (PDF)

Oi S.A. (Oi) is a pioneer in providing convergent services in Brazil. It currently has the greatest network capillarity and WiFi availability Brazil. The company offers fixed lines, mobile services, broadband, and cable TV. In order to improve service to over 70 million customers, The Customer Intelligence Department manages the data generated by 40,000 call center operators. The call center produces more than a hundred million records per month, and we use SAS^® Visual Analytics to collect, analyze, and distribute these results to the company. This new system changed the paradigm of data analysis in the company. SAS Visual Analytics is user-friendly and enabled the data analysis team to reduce IT time. Now it is possible to focus on business analysis. Oi started developing its SAS Visual Analytics project in June 2014. The test period lasted only 15 days and involved 10 people. The project became relevant to the company. It led us to the next step, in which 30 employees and 20 executives used the tool. During the last phase, we applied that to a larger scale with 300 users, including local managers, executives, and supervisors. The benefits brought by the fast implementation (two months) are many. We reduced the time it takes to produce reports by 80% and the time to complete business analysis by 40%.

Your SAS^® Visual Analytics users begin to create and share reports. As an administrator, you want to track performance of the reports over time, analyzing timing metrics for key tasks such as data query and rendering, relative to total user workload for the system. Logging levels can be set for the SAS Visual Analytics reporting services that provide timing metrics for each report execution. The log files can then be mined to create a data source for a time series plot in SAS Visual Analytics. You see report performance over time with peak workloads and how this impacts the user experience. Isolation on key metrics can identify performance bottlenecks for improvement. First we look at how logging levels are modified for the reporting services and focus on tracking a single user viewing a report. Next, we extract data from a long running log file to create a report performance data source. Using SAS Visual Analytics, we analyze the data with a time series plot, looking at times of peak work load and how the user experience changes.

Read the paper (PDF)

Access to care for Medicaid beneficiaries is a topic of frequent study and debate. Section 1202 of the Affordable Care Act (ACA) requires states to raise Medicaid primary care payment rates to Medicare levels in 2013 and 2014. The federal government paid 100% of the increase. This program was designed to encourage primary care providers to participate in Medicaid, since this has long been a challenge for Medicaid. Whether this fee increase has increased access to primary care providers is still debated. Using SAS^®, we evaluated whether Medicaid patients have a higher incidence of non-urgent visits to local emergency departments (ED) than do patients with other payment sources. The National Hospital Ambulatory Medical Care Survey (NHAMCS) data set, obtained from the Centers for Disease Control (CDC), was selected, since it contains data relating to hospital emergency departments. This emergency room data, for years 2003 2011, was analyzed by diagnosis, expected payment method, reason for the visit, region, and year. To evaluate whether the ED visits were considered urgent or non-urgent, we used the NYU Billings algorithm for classifying ED utilization (NYU Wagner 2015). Three models were used for the analyses: Binary Classification, Multi-Classification, and Regression. In addition to finding no regional differences, decision trees and SAS^® Visual Analytics revealed that Medicaid patients do not have a higher rate of non-emergent visits when compared to other payment types.

Read the paper (PDF)

One of the research goals in public health is to estimate the burden of diseases on the US population. We describe burden of disease by analyzing the statistical association of various diseases with hospitalizations, emergency department (ED) visits, ambulatory/outpatient (doctors' offices) visits, and deaths. In this short paper, we discuss the use of large, nationally representative databases, such as those offered by the National Center for Health Statistics (NCHS) or the Agency for Healthcare Research and Quality (AHRQ), to produce reliable estimates of diseases for studies. In this example, we use SAS^® and SUDAAN to analyze the Nationwide Emergency Department Sample (NEDS), offered by AHRQ, to estimate ED visits for hand, foot, and mouth disease (HFMD) in children less than five years old.

Read the paper (PDF) | View the e-poster or slides (PDF)

Chemical incidents involving irritant chemicals such as chlorine pose a significant threat to life and require rapid assessment. Data from the Validating Triage for Chemical Mass Casualty Incidents A First Step R01 grant was used to determine the most predictive signs and symptoms (S/S) for a chlorine mass casualty incident. SAS^® 9.4 was used to estimate sensitivity, specificity, positive and negative predictive values, and other statistics of irritant gas syndrome agent S/S for two exiting systems designed to assist emergency responders in hazardous material incidents (Wireless Information System for Emergency Responders (WISER) and CHEMM Intelligent Syndrome Tool (CHEMM-IST)). The results for WISER showed the sensitivity was .72 to 1.0; specificity .25 to .47; and the positive predictive value and negative predictive value were .04 to .87 and .33 to 1.0, respectively. The results for CHEMM-IST showed the sensitivity was .84 to .97; specificity .29 to .45; and the positive predictive value and negative predictive value were .18 to .42 and .86 to .97, respectively.

Read the paper (PDF) | View the e-poster or slides (PDF)

What will your customer do next? Customers behave differently; they are not all average. Segmenting your customers into different groups enables you to build more powerful and meaningful predictive models. You can use SAS^® Visual Analytics to instantaneously visualize and build your segments identified by a decision tree or cluster analysis with respect to customer attributes. Then, you can save the cluster/segment membership, and use that as a separate predictor or as a group variable for building stratified predictive models. Dividing your customer population into segments is useful because what drives one group of people to exhibit a behavior can be quite different from what drives another group. By analyzing the segments separately, you are able to reduce the overall error variance or noise in the models. As a result, you improve the overall performance of the predictive models. This paper covers the building and use of segmentation in predictive models and demonstrates how SAS Visual Analytics, with its point-and-click functionality and in-memory capability, can be used for an easy and comprehensive understanding of your customers, as well as predicting what they are likely to do next.

Read the paper (PDF)

Using shared accounts to access third-party database servers is a common architecture in SAS^® environments. SAS software can support seamless user access to shared accounts in databases such as Oracle and MySQL, via group definitions and outbound authentication domains in metadata. However, the configurations necessary to leverage shared accounts in Kerberized Hadoop clusters are more complicated. Kerberos tickets must often be generated and maintained in order to simply access the Hadoop environment, and those tickets must allow access as the shared account instead of as an individual user's account. In all cases, key prerequisites and configurations must be put into place in order for seamless Hadoop access to function with the shared account. Methods for implementing these arrangements in SAS environments can be non-intuitive. This paper starts by outlining general architectures of shared accounts in third-party database environments. It then presents several methods of managing remote access to shared accounts in Kerberized Hadoop environments using SAS, including specific implementation details, code samples, and security implications.

Read the paper (PDF)

Transformation of raw data into sensible and useful information for prediction purposes is a priceless skill nowadays. Vast amounts of data, easily accessible at each step in a process, gives us a great opportunity to use it for countless applications. Unfortunately, not all of the valuable data is available for processing using classical data mining techniques. What happens if textual data is also used to create the analytical base table (ABT)? The goal of this study is to investigate whether scoring models that also use textual data are significantly better than models that include only quantitative data. This thesis is focused on estimating the probability of default (PD) for the social lending platform kokos.pl. The same methods used in banks are used to evaluate the accuracy of reported PDs. Data used for analysis is gathered directly from the platform via the API. This paper describes in detail the steps of the data mining process that is built using SAS^® Enterprise Miner . The results of the study support the thesis that models with a properly conducted text-mining process have better classification quality than models without text variables. Therefore, the use of this data mining approach is recommended when input data includes text variables.

Read the paper (PDF)

In industrial systems, vibration signals are the most important measurements for indicating asset health. Based on these measurements, an engineer with expert knowledge about the assets, industrial process, and vibration monitoring can perform spectral analysis to identify failure modes. However, this is still a manual process that heavily depends on the experience and knowledge of the engineer analyzing the vibration data. Moreover, when measurements are performed continuously, it becomes impossible to act in real time on this data. The objective of this paper is to examine using analytics to perform vibration spectral analysis in real time to predict asset failures. The first step in this approach is to translate engineering knowledge and features into analytic features in order to perform predictive modeling. This process involves converting the time signal into the frequency domain by applying a fast Fourier transform (FFT). Based on the specific design characteristics of the asset, it is possible to derive the relevant features of the vibration signal to predict asset failures. This approach is illustrated using a bearing data set available from the Prognostics Data Repository of the National Aeronautics and Space Administration (NASA). Modeling is done using R and is integrated within SAS^® Asset Performance Analytics. In essence, this approach helps the engineers to make better data-driven decisions. The approach described in this paper shows the strength of combining ex

Read the paper (PDF)

Panel data, which are collected on a set (panel) of individuals over several time points, are ubiquitous in economics and other analytic fields because their structure allows for individuals to act as their own control groups. The PANEL procedure in SAS/ETS^® software models panel data that have a continuous response, and it provides many options for estimating regression coefficients and their standard errors. Some of the available estimation methods enable you to estimate a dynamic model by using a lagged dependent variable as a regressor, thus capturing the autoregressive nature of the underlying process. Including lagged dependent variables introduces correlation between the regressors and the residual error, which necessitates using instrumental variables. This paper guides you through the process of using the typical estimation method for this situation-the generalized method of moments (GMM)-and the process of selecting the optimal set of instrumental variables for your model. Your goal is to achieve unbiased, consistent, and efficient parameter estimates that best represent the dynamic nature of the model.

Read the paper (PDF)

In many healthcare settings, patients are like customers they have a choice. One example is whether to participate in a procedure. In population-based screening in which the goal is to reduce deaths, the success of a program hinges on the patient's choice to accept and comply with the procedure. Like in many other industries, this not only relies on the program to attract new eligible patients to attend for the first time, but it also relies on the ability of the program to retain existing customers. The success of a new customer retention strategy within a breast screening environment is examined by applying a population averaged model (also know as marginal models), which uses generalized estimating equations (GEEs) to account for the lack of independence of the observations. Arguments for why a population average model was applied instead of a mixed effects model (or random effects model) are provided. This business case provides a great introductory session for people to better understand the difference between mixed effects and marginal models, and illustrates how to implement a population average model within SAS^® by using the GENMOD procedure.

Read the paper (PDF)

It is often necessary to assess multi-rater agreement for multiple-observation categories in case-controlled studies. The Kappa statistic is one of the most common agreement measures for categorical data. The purpose of this paper is to show an approach for using SAS^® 9.4 procedures and the SAS^® Macro Language to estimate Kappa with 95% CI for pairs of nurses that used two different triage systems during a computer-simulated chemical mass casualty incident (MCI). Data from the Validating Triage for Chemical Mass Casualty Incidents A First Step R01 grant was used to assess the performance of a typical hospital triage system called the Emergency Severity Index (ESI), compared with an Irritant Gas Syndrome Agent (IGSA) triage algorithm being developed from this grant, to quickly prioritize the treatment of victims of IGSA incidents. Six different pairs of nurses used ESI triage, and seven pairs of nurses used the IGSA triage prototype to assess 25 patients exposed to an IGSA and 25 patients not exposed. Of the 13 pairs of nurses in this study, two pairs were randomly selected to illustrate the use of the SAS Macro Language for this paper. If the data was not square for two nurses, a square-form table for observers using pseudo-observations was created. A weight of 1 for real observations and a weight of .0000000001 for pseudo-observations were assigned. Several macros were used to reduce programming. In this paper, we show only the results of one pair of nurses for ESI.

Read the paper (PDF) | View the e-poster or slides (PDF)

The challenge is to assign outbound calling agents in a telemarketing campaign to geographic districts. The districts have a variable number of leads, and each agent needs to be assigned entire districts with the total number of leads being as close as possible to a specified number for each of the agents (usually, but not always, an equal number). In addition, there are constraints concerning the distribution of assigned districts across time zones, in order to maximize productivity and availability. The SAS/OR^® CLP procedure solves the problem by formulating the challenge as a constraint satisfaction problem (CSP). Our use of PROC CLP places the actual leads within a specified percentage of the target number.

Read the paper (PDF)

The rapidly evolving informatics capabilities of the past two decades have resulted in amazing new data-based opportunities. Large public use data sets are now available for easy download and utilization in the classroom. Days of classroom exercises based on static, clean, easily maneuverable samples of 100 or less are over. Instead, we have large and messy real-world data at our fingertips allowing for educational opportunities not available in years past. There are now hundreds of public-use data sets available for download and analysis in the classroom. Many of these sources are survey-based and require the understanding of weighting techniques. These techniques are necessary for proper variance estimation allowing for sound inferences through statistical analysis. This example uses the California Health Interview Survey to present and compare weighted and non-weighted results using the SURVEYLOGISTIC procedure.

Read the paper (PDF)

The ODS EXCEL destination has made sharing SAS^® reports and graphs much easier. What is even more exciting is that this destination is available for use regardless of the platform. This is extremely useful when reporting is performed on remote servers. This presentation goes through the basics of using the ODS EXCEL destination and shows specific examples of how to use this in a remote environment. Examples for both SAS^® on Windows and in SAS^® Enterprise Guide^® are provided.

Read the paper (PDF)

Students now have access to a SAS^® learning tool called SAS^® University Edition. This online tool is freely available to all, for non-commercial use. This means it is basically a free version of SAS that can be used to teach yourself or someone else how to use SAS. Since a large part of my body of writings has focused upon moving data between SAS and Microsoft Excel, I thought I would take some time to highlight the tasks that permit movement of data between SAS and Excel using SAS University Edition. This paper is directed toward sending graphs to Excel using the new ODS EXCEL destination.

Read the paper (PDF)

Graphs are mathematical structures capable of representing networks of objects and their relationships. Clustering is an area in graph theory where objects are split into groups based on their connections. Depending on the application domain, object clusters have various meanings (for example, in market basket analysis, clusters are families of products that are frequently purchased together). This paper provides a SAS^® macro featuring PROC OPTGRAPH, which enables the transformation of transactional data, or any data with a many-to-many relationship between two entities, into graph data, allowing for the generation and application of the co-occurrence graph and the probability graph.

Read the paper (PDF)

JMP^® integrates very nicely with SAS^® software, so you can do some pretty amazing things by combining the power of JMP and SAS. You can submit some code to run something on a SAS server and bring the results back as a JMP table. Then you can do lots of things with the JMP table to analyze the data returned. This workshop shows you how to access data via SAS servers, run SAS code and bring data back to JMP, and use JMP to do many things very quickly and easily. Explore the synergies between these tools; having both is a powerful combination that far outstrips just having one, or not using them together.

Read the paper (PDF) | Download the data file (ZIP)

More than ever, customers are demanding consistent and relevant interaction across all channels. Businesses are having to develop omnichannel marketing capabilities to please these customers. Implementing omnichannel marketing is often difficult, especially when using digital channels. Most products designed solely for digital channels lack capabilities to integrate with traditional channels that have on-premises processes and data. SAS^® Customer Intelligence 360 is a new offering that enables businesses to leverage both cloud and on-premises channels and data. This is possible due to the solution's hybrid cloud architecture. This paper discusses the SAS Customer Intelligence 360 approach to the hybrid cloud, and covers key capabilities on security, throughput, and integration.

Read the paper (PDF)

Bivariate Cox proportional models are used when we test the association between a single covariate and the outcome. The test repeats for each covariate of interest. SAS^® uses the last category as the default reference. This raises problems when we want to keep using 0 as our reference for each covariate. The reference group can be changed in the CLASS statement. But, if a format is associated with a covariate, we have to use the corresponding format instead of raw numeric data. This problem becomes even worse when we have to repeat the test and manually enter the reference every single time. This presentation demonstrates one way of fixing the problem using the MACRO function and SYMPUT function.

Read the paper (PDF) | Download the data file (ZIP)

Increasingly, customers are using social media and other Internet-based applications such as review sites and discussion boards to voice their opinions and express their sentiments about brands. Such spontaneous and unsolicited customer feedback can provide brand managers with valuable insights about competing brands. There is a general consensus that listening to and reacting to the voice of the customer is a vital component of brand management. However, the unstructured, qualitative, and textual nature of customer data that is obtained from customers poses significant challenges for data scientists and business analysts. In this paper, we propose a methodology that can help brand managers visualize the competitive structure of a market based on an analysis of customer perceptions and sentiments that are obtained from blogs, discussion boards, review sites, and other similar sources. The brand map is designed to graphically represent the association of product features with brands, thus helping brand managers assess a brand's true strengths and weaknesses based on the voice of customers. Our multi-stage methodology uses the principles of topic modeling and sentiment analysis in text mining. The results of text mining are analyzed using correspondence analysis to graphically represent the differentiating attributes of each brand. We empirically demonstrate the utility of our methodology by using data collected from Edmunds.com, a popular review site for car buyers.

Read the paper (PDF)

SAS^® Theme Designer provides a rich set of colors and graphs that enables customers to create a custom application and report themes. Users can also preview their work within SAS^® Visual Analytics. The features of SAS Theme Designer enable the user to bring a new look and feel to their entire application and to their reports. Users can customize their reports to use a unique theme across the organization, yet they have the ability to customize these reports based on their individual business requirements. Providing this capability involves meeting the customers demands from the theming perspectives of customization, branding, and logo, and making them seamless within their application. This paper walks users through the process of using SAS Theme Designer in SAS Visual Analytics. It further highlights the following features of SAS Theme Designer: creating and modifying application and report themes, previewing output in SAS Visual Analytics, and importing and exporting themes for reuse.

Read the paper (PDF)

Visualization of complex data can be a valuable tool for researchers and policy makers, and Base SAS^® has powerful tools for such data exploration. In particular, SAS/GRAPH^® software is a flexible tool that enables the analyst to create a wide variety of data visualizations. This paper uses SAS^® to visualize complex demographic data related to the membership of a large American healthcare provider. Kaiser Permanente (KP) has demographic data about 4 million active members in Southern California. We use SAS to create a number of geographic visualizations of KP demographic data related to membership at the census-block level of detail and higher. Demographic data available from the US Census' American Community Survey (ACS) at the same level of geographic organization are also used as comparators to show how the KP membership differs from the demographics of the geographies from which it draws. In addition, we use SAS to create a number of visualizations of KP demographic data related to utilizations (inpatient and outpatient) at the medical center area level through time. As with the membership data, data available from the ACS is used as a comparator to show how patterns of KP utilizations at various medical centers compare to the demographics of the populations that these medical centers serve. The paper will be of interest to programmers learning how to use SAS to visualize data and to researchers interested in the demographics of one of the largest health care providers in the US.

Read the paper (PDF)

Over the years, the use of SAS^® has grown immensely within Royal Bank of Scotland (RBS), making platform support and maintenance overly complicated and time consuming. At RBS, we realized that we have been living 'war and peace' every day for many years and that the time has come to re-think how we support SAS platforms. With our approach to rationalize and consolidate the ways our organization uses SAS came the need to review and improve the processes and procedures we have in place. This paper explains why we did it, what we've changed or reinvented, and how all these have changed our way of operation by bringing us closer to DevOps and helping us to improve our relationship with our customers as well as building trust in the service we deliver.

Read the paper (PDF)

Weight of evidence (WOE) coding of a nominal or discrete variable X is widely used when preparing predictors for usage in binary logistic regression models. The concept of WOE is extended to ordinal logistic regression for the case of the cumulative logit model. If the target (dependent) variable has L levels, then L-1 WOE variables are needed to recode X. The appropriate setting for implementing WOE coding is the cumulative logit model with partial proportionate odds. As in the binary case, it is important to bin X to achieve parsimony before the WOE coding. SAS^® code to perform this binning is discussed. An example is given that shows the implementation of WOE coding and binning for a cumulative logit model with the target variable having three levels.

Read the paper (PDF)

Health care has long been focused on providing reactive care for illness, injury, or chronic conditions. But the rising cost of providing health care has forced many countries, health insurance payers, and health care providers to shift approaches. A new focus on patient value includes providing financial incentives that emphasize clinical outcomes instead of treatments. This focus also means that providers and wellness programs are required to take a segmentation approach to the population under their care, targeting specific people based on their individual risks. This session discusses the benefits of a shift from thinking about health care data as a series of clinical or financial transactions, to one that is centered on patients and their respective clinical conditions. This approach allows for insights pertaining to care delivery processes and treatment patterns, including identification of potentially avoidable complications, variations in care provided, and inefficient care that contributes to waste. All of which contributes to poor clinical outcomes.

Read the paper (PDF)

In the last few years, machine learning and statistical learning methods have gained increasing popularity among data scientists and analysts. Statisticians have sometimes been reluctant to embrace these methodologies, partly due to a lack of familiarity, and partly due to concerns with interpretability and usability. In fact, statisticians have a lot to gain by using these modern, highly computational tools. For certain types of problems, machine learning methods can be much more accurate predictors than traditional methods for regression and classification, and some of these methods are particularly well suited for the analysis of big and wide data. Many of these methods have origins in statistics or at the boundary of statistics and computer science, and some are already well established in statistical procedures, including LASSO and elastic net for model selection and cross validation for model evaluation. In this talk, I go through some examples illustrating the application of machine learning methods and interpretation of their results, and show how these methods are similar to, and differ from, traditional statistical methodology for the same problems.

Read the paper (PDF)

It has become a need-it-now world, and many managers and decision-makers need their reports and information quicker than ever before to compete. As SAS^® developers, we need to acknowledge this fact and write code that gets us the results we need in seconds or minutes, rather than in hours. SAS is a great tool for extracting, transferring, and loading data, but as with any tool, it is most efficient when used in the most appropriate way. Using the SQL pass-through techniques presented in this paper can reduce run time by up to 90% by passing the processing to the database instead of moving the data back to SAS to be consumed. You can reap these benefits with only a minor increase in coding difficulty.

Read the paper (PDF) | View the e-poster or slides (PDF)

The latest releases of SAS^® Data Management software provide a comprehensive and integrated set of capabilities for collecting, transforming, and managing your data. The latest features in the product suite include capabilities for working with data from a wide variety of environments and types including Hadoop, cloud data sources, RDBMS, files, unstructured data, streaming, and others, and the ability to perform ETL and ELT transformations in diverse run-time environments including SAS^®, database systems, Hadoop, Spark, SAS^® Analytics, cloud, and data virtualization environments. There are also new capabilities for lineage, impact analysis, clustering, and other data governance features for enhancements to master data and support metadata management. This paper provides an overview of the latest features of the SAS^® Data Management product suite and includes use cases and examples for leveraging product capabilities.

Read the paper (PDF)

SAS^® Visual Analytics gives customers the power to quickly and easily make sense of any data that matters to them. SAS^® Visual Analytics 7.4 delivers requested enhancements to familiar features. These enhancements include dynamic text, custom geographical regions, improved PDF printing, and enhanced prompted filter controls. There are also enhancements to report parameters and calculated data items. This paper provides an overview of the latest features of SAS Visual Analytics 7.4, including use cases and examples for leveraging these new capabilities.

Whether you are a new SAS^® administrator or you are switching to a Linux environment, you have a complex mission. This job becomes even more formidable when you are working with a system like SAS^® Visual Analytics that requires multiple users loading data daily. Eventually a user has data issues or creates a disruption that causes the system to malfunction. When that happens, what do you do next? In this paper, we go through the basics of a SAS Visual Analytics Linux environment and how to troubleshoot the system when issues arise.

Read the paper (PDF)

Have you ever been working on a task and wondered whether there might be a SAS^® function that could save you some time? Let alone, one that might be able to do the work for you? Data review and validation tasks can be time-consuming efforts. Any gain in efficiency is highly beneficial, especially if you can achieve a standard level where the data itself can drive parts of the process. The ANY and NOT functions can help alleviate some of the manual work in many tasks such as data review of variable values, data compliance, data formats, and derivation or validation of a variable's data type. The list goes on. In this poster, we cover the functions and their details and use them in an example of handling date and time data and mapping it to ISO 8601 date and time formats.

Read the paper (PDF) | View the e-poster or slides (PDF)

In the world of gambling, superstition drives behavior, which can be difficult to explain. Conflicting evidence suggests that slot machines, like BCLC's Cleopatra, perform well regardless of where they are placed on a casino floor. Other evidence disputes this, arguing that performance is driven by their strategic placement (for example, in high-traffic areas). We explore and quantify the location sensitivity of slot machines by leveraging SAS^® to develop robust models. We test various methodologies and data import techniques (such as casino CAD floor plans) to unlock some of the nebulous concepts of player behavior, product performance, and superstition. By demystifying location sensitivity, key drivers of performance can be identified to aid in optimizing the placement of slot machines.

Read the paper (PDF)

We are at a tipping point for credit risk modeling. To meet the technical and regulatory challenges of IFRS 9 and stress testing, and to strengthen model risk management, CBS aims to create an integrated, end-to-end, tools-based solution across the model lifecycle, with strong governance and controls and an improved scenario testing and forecasting capability. SAS has been chosen as the technology partner to enable CBS to meet these aims. A new predictive analytics platform combining well-known tools such as SAS^® Enterprise Miner , SAS^® Model Manager, and SAS^® Data Management alongside SAS^® Model Implementation Platform powered by SAS^® High-Performance Risk is being deployed. Driven by technology, CBS has also considered the operating model for credit risk, restructuring resources around the new technology with clear lines of accountability, and has incorporated a dedicated data engineering function within the risk modeling team. CBS is creating a culture of collaboration across credit risk that supports the development of technology-led, innovative solutions that not only meet regulatory and model risk management requirements but that set a platform for the effective use of predictive analytics enterprise-wide.

For the past couple of years, it seems that big data has been a buzzword in the industry. We have more and more data coming in from more and more places, and it is our job to figure out how best to handle it. One way to attempt to organize data is with arrays, but what do you do when the array you are attempting to populate is so large that it cannot be handled in memory? How do you handle a large array when most of the elements are missing? This paper and presentation deals with the concept of a sparse matrix. A sparse matrix is a large array with relatively few actual elements. We address methods for handling such a construct while keeping memory, CPU, clock, and programmer time to their respective minimums.

High-quality analytics works best with the best-quality data. Preparing your data ranges from activities like text manipulation and filtering to creating calculated items and blending data from multiple tables. This paper covers the range of activities you can easily perform to get your data ready. High-performance analytics works best with in-memory data. Getting your data into an in-memory server, as well as keeping it fresh and secure, are considerations for in-memory data management. This paper covers how to make small or large data available and how to manage it for analytics. You can choose to perform these activities in a graphical user interface or via batch scripts. This paper describes both ways to perform these activities. You ll be well-prepared to get your data wrangled into shape for analytics!

Read the paper (PDF)

A SAS^® program with the extension .SAS is simply a text file. This fact opens the door to many powerful results. You can read a typical SAS program into a SAS data set as a text file with a character variable, with one line of the program being one record in the data set. The program's code can be changed, and a new program can be written as a simple text file with a .SAS extension. This presentation shows an example of dynamically editing SAS code on the fly and generating statistics about SAS programs.

Read the paper (PDF)

SAS^® has many methods of doing table lookups in DATA steps: formats, arrays, hash objects, the SASMSG function, indexed data sets, and so on. Of these methods, hash objects and indexed data sets enable you to specify multiple lookup keys and to return multiple table values. Both methods can be updated dynamically in the middle of a DATA step as you obtain new information (such as reading new keys from an input file or creating new synthetic keys). Hash objects are very flexible, fast, and fairly easy to use, but they are limited by the amount of data that can be held in memory. Indexed data sets can be slower, but they are not limited by what can be held in memory. As a result, they might be your only option in some circumstances. This presentation discusses how to use an indexed data set for table lookup and how to update it dynamically using the MODIFY statement and its allies.

Read the paper (PDF)

The SAS RAKING macro, introduced in 2000, has been implemented by countless survey researchers worldwide. The authors receive messages from users who tirelessly rake survey data using all three generations of the macro. In this poster, we present the fourth generation of the macro, cleaning up remnants from the previous versions, and resolving user-reported confusion. Most important, we introduce a few helpful enhancements including: 1) An explicit indicator for trimming (or not trimming) the weight that substantially saves run time when no trimming is needed. 2) Two methods of weight trimming, AND and OR, that enable users to overcome a stubborn non-convergence. When AND is indicated, weight trimming occurs only if both (individual and global) high weight cap values are true. Conversely, weight increase occurs only if both low weight cap values are true. When OR is indicated, weight trimming occurs if either of the two (individual or global) high weight cap values is true. Conversely, weight increase occurs if either of the two low weight cap values is true. 3) Summary statistics related to the number of cases with trimmed or increased weights have been expanded. 4) We introduce parameters that enable users to use different criteria of convergence for different raking marginal variables. We anticipate that these innovations will be enthusiastically received and implemented by the survey research community.

View the e-poster or slides (PDF)

Global trade and more people and freight moving across international borders present border and security agencies with a difficult challenge. While supporting freedom of movement, agencies must minimize risks, preserve national security, guarantee that correct duties are collected, deploy human resources to the right place, and ensure that additional checks do not result in increased delays for passengers or cargo. To meet these objectives, border agencies must make the most efficient use of their data, which is often found across disparate intelligence sources. Bringing this data together with powerful analytics can help them identify suspicious events, highlight areas of risk, process watch lists, and notify relevant agents so that they can investigate, take immediate action to intercept illegal or high-risk activities, and report findings. With SAS^® Visual Investigator, organizations can use advanced analytical models and surveillance scenarios to identify and score events, and to deliver them to agents and intelligence analysts for investigation and action. SAS Visual Investigator provides analysts with a holistic view of people, cargo, relationships, social networks, patterns, and anomalies, which they can explore through interactive visualizations before capturing their decision and initiating an action.

Read the paper (PDF)

Since databases often lack the extensive string-handling capabilities available in SAS^®, SAS users are often forced to extract complex character data from the database into SAS for string manipulation. As database vendors make regular expression functionality more widely available for use in SQL, the need to move data into SAS for pattern matching, string replacement, and character extraction is necessary less often. This paper covers enough regular expression patterns to make you dangerous, demonstrates the various REGEXP SQL functions, and provides practical applications for each.

Read the paper (PDF)

How often have you pulled oodles of data out of the corporate data warehouse down into SAS^® for additional processing? Additional processing, sometimes thought to be unique to SAS, includes FIRST. logic, cumulative totals, lag functionality, specialized summarization, and advanced date manipulation. Using the Analytical/OLAP and Windowing functionality available in many databases (for example, Teradata and Netezza) all of this processing can be performed directly in the database without moving and reprocessing detail data unnecessarily. This presentation illustrates how to increase your coding and execution efficiency by using the database's power through your SAS environment.

Read the paper (PDF)

In 2013, the Centers for Medicare & Medicaid Services (CMS) changed the pharmacy mail-order member-acquisition process so that Humana Pharmacy may only call a member with cost savings greater than $2.00 to educate the member on the potential savings and instruct the member to call back. The Rx Education call center asked for analytics work to help prioritize member outreach, improve conversions, and decrease the number of members who are unable to be contacted. After a year of contacting members using this additional insight, the conversions after agreement rate rose from 71.5% to 77.5% and the unable to contact rate fell from 30.7% to 17.4%. This case study takes you on an analytics journey from the initial problem diagnosis and analytics solution, followed by refinements, as well as test and learn campaigns.