Data Scientist Papers A-Z

A
Session 1066-2017:
A Big-Data Challenge: Visualizing Social Media Trends about Cancer Using SAS® Text Miner
Analyzing big data and visualizing trends in social media is a challenge that many companies face as large sources of publicly available data become accessible. While the sheer size of usable data can be staggering, knowing how to find trends in unstructured textual data is just as important an issue. At a big data conference, data scientists from several companies were invited to participate in tackling this challenge by identifying trends in cancer using unstructured data from Twitter users and presenting their results. This paper explains how our approach using SAS® analytical methods were superior to other big data approaches in investigating these trends.
Read the paper (PDF)
Scott Koval, Pinnacle Solutions, Inc
Yijie Li, Pinnacle Solutions, Inc
Mia Lyst, Pinnacle Solutions, Inc
Session 0823-2017:
A Hadoop Journey along the SAS® Road
As the open-source community has been taking the technology world by storm, especially in the big data space, large corporations such as SAS, IBM, and Oracle have been working to embrace this new, quickly evolving ecosystem to continue to foster innovation and to remain competitive. For example, SAS, IBM, and others have aligned with the Open Data Platform initiative and are continuing to build out Hadoop and Spark solutions. And, Oracle has partnered with Cloudera to create the Big Data Appliance. This movement challenges companies that are consuming these products to select the right products and support partners. The hybrid approach using all tools available seems to be the methodology chosen by most successful companies. West Corporation, an Omaha-based provider of technology-enabled communication solutions, is no exception. West has been working with SAS for 10 years in the ETL, BI, and advanced analytics space, and West began its Hadoop journey a year ago. This paper focuses on how West data teams use both technologies to improve customer experience in the interactive voice response (IVR) system by storing massive semi-structure call logs in HDFS and by building models that predict a caller s intent to route the caller more efficiently and to reduce customer effort using familiar SAS code and the very user friendly SAS® Enterprise Miner .
Read the paper (PDF)
Sumit Sukhwani, West Corporation
Krutharth Peravalli, West Corporation
Amit Gautam, West Corporation
Session 0689-2017:
A Practical Guide to Getting Started with Propensity Scores
This presentation gives you the tools to begin using propensity scoring in SAS® to answer research questions involving observational data. It is for both those attendees who have never used propensity scores and those who have a basic understanding of propensity scores but are unsure how to begin using them in SAS. It provides a brief introduction to the concept of propensity scores, and then turns its attention to giving you the tips and resources you need to get started. The presentation walks you through how the code in the book 'Analysis of Observational Health Care Data Using SAS®', which was published by SAS Institute, is used to answer how a particular health care intervention impacted a health care outcome. It details how propensity scores are created and how propensity score matching is used to balance covariates between treated and untreated observations. With this case study in hand, you will feel confident that you have the tools necessary to begin answering some of your own research questions using propensity scores.
Read the paper (PDF)
Thomas Gant, Kaiser Permanente
Session 0839-2017:
Accelerate Your Data Prep with SAS® Code Accelerator
Accelerate your data preparation by having your DS2 execute without translation inside the Teradata database or on the Hadoop platform with SAS® Code Accelerator. This presentation shows how easy it is to use SAS Code Accelerator via a live demonstration.
Read the paper (PDF)
Paul Segal, Teradata
Session SAS0245-2017:
Accessing DBMS with the GROOVY Procedure and a JDBC Connection
SAS/ACCESS® software grants access to data in third-party database management systems (DBMS), but how do you access data in DBMS not supported by SAS/ACCESS products? The introduction of the GROOVY procedure in SAS® 9.3 lets you retrieve this formerly inaccessible data through a JDBC connection. Groovy is an object-oriented, dynamic programming language executed on the Java Virtual Machine (JVM). Using Microsoft Azure HDInsight as an example, this paper demonstrates how to access and read data into a SAS data set using PROC GROOVY and a JDBC connection.
Read the paper (PDF)
Lilyanne Zhang, SAS
Session SAS1492-2017:
An Overview of SAS® Visual Data Mining and Machine Learning on SAS® Viya™
Machine learning is in high demand. Whether you are a citizen data scientist who wants to work interactively or you are a hands-on data scientist who wants to code, you have access to the latest analytic techniques with SAS® Visual Data Mining and Machine Learning on SAS® Viya . This offering surfaces in-memory machine learning techniques such as gradient boosting, factorization machines, neural networks, and much more through its interactive visual interface, SAS® Studio tasks, procedures, and a Python client. Learn about this multi-faceted new product and see it in action.
Read the paper (PDF)
Jonathan Wexler, SAS
Susan Haller, SAS
Session SAS0752-2017:
Analytics Using SAS® Customer Intelligence: Multivariate Testing Processes for Digital Marketing Campaigns
This presentation illustrates new technology that has been added to the SAS® Customer Intelligence 360 analytics testing suite for digital campaign marketing. SAS Customer Intelligence 360 now has a multivariate testing tool (MVT). In digital marketing, MVT has become an increasingly popular process by which multiple components of a campaign can be tested in a live environment with the goal of finding an optimal mix, which drives a defined response metric. In simple terms, MVT is equivalent to running numerous A/B tests performed simultaneously. In theory, MVT can test the effectiveness of limitless combinations of factors. The number of factor-level combinations determines the test duration and the number of samples required to make statistically sound predictions for all permutations of a full factorial design. SAS® applies experimental design analytics, driven by an interactive process that considers constraints and control cell definition, to guide the user toward an optimal reduced design of the test that can still adequately predict all factor-level combinations, given available resources. The results of the analysis enable the marketer to compare responses for both observed factor-level combinations to the predicted responses to the untested combinations.
Read the paper (PDF)
Thomas Lehman, SAS
Session 0334-2017:
Analytics of Healthcare Things (AoHT) IS THE Next Generation of Real World Data
As you know, real world data (RWD) provides highly valuable and practical insights. But as valuable as RWD is, it still has limitations. It is encounter-based, and we are largely blind to what happens between encounters in the health-care system. The encounters generally occur in a clinical setting that might not reflect actual patient experience. Many of the encounters are subjective interviews, observations, or self-reports rather than objective data. Information flow can be slow (even real time is not fast enough in health care anymore). And some data that could be transformative cannot be captured currently. Select Internet of Things (IoT) data can fill the gaps in our current RWD for certain key conditions and provide missing components that are key to conducting Analytics of Healthcare Things (AoHT), such as direct, objective measurements; data collected in usual patient settings rather than artificial clinical settings; data collected continuously in a patient s setting; insights that carry greater weight in Regulatory and Payer decision-making; and insights that lead to greater commercial value. Teradata has partnered with an IoT company whose technology generates unique data for conditions impacted by mobility or activity. This data can fill important gaps and provide new insights that can help distinguish your value in your marketplace. Join us to hear details of successful pilots that have been conducted as well as ongoing case studies.
Read the paper (PDF)
Joy King, Teradata
Session SAS0282-2017:
Applying Text Analytics and Machine Learning to Assess Consumer Financial Complaints
The Consumer Financial Protection Bureau (CFPB) collects tens of thousands of complaints against companies each year, many of which result in the companies in question taking action, including making payouts to the individuals who filed the complaints. Given the volume of the complaints, how can an overseeing organization quantitatively assess the data for various trends, including the areas of greatest concern for consumers? In this presentation, we propose a repeatable model of text analytics techniques to the publicly available CFPB data. Specifically, we use SAS® Contextual Analysis to explore sentiment, and machine learning techniques to model the natural language available in each free-form complaint against a disposition code for the complaint, primarily focusing on whether a company paid out money. This process generates a taxonomy in an automated manner. We also explore methods to structure and visualize the results, showcasing how areas of concern are made available to analysts using SAS® Visual Analytics and SAS® Visual Statistics. Finally, we discuss the applications of this methodology for overseeing government agencies and financial institutions alike.
Read the paper (PDF)
Tom Sabo, SAS
Session SAS0514-2017:
Automated Hyperparameter Tuning for Effective Machine Learning
Machine learning predictive modeling algorithms are governed by hyperparameters that have no clear defaults agreeable to a wide range of applications. A few examples of quantities that must be prescribed for these algorithms are the depth of a decision tree, number of trees in a random forest, number of hidden layers and neurons in each layer in a neural network, and degree of regularization to prevent overfitting. Not only do ideal settings for the hyperparameters dictate the performance of the training process, but more importantly they govern the quality of the resulting predictive models. Recent efforts to move from a manual or random adjustment of these parameters have included rough grid search and intelligent numerical optimization strategies. This paper presents an automatic tuning implementation that uses SAS/OR® local search optimization for tuning hyperparameters of modeling algorithms in SAS® Visual Data Mining and Machine Learning. The AUTOTUNE statement in the NNET, TREESPLIT, FOREST, and GRADBOOST procedures defines tunable parameters, default ranges, user overrides, and validation schemes to avoid overfitting. Given the inherent expense of training numerous candidate models, the paper addresses efficient distributed and parallel paradigms for training and tuning in SAS® Viya . It also presents sample tuning results that demonstrate improved model accuracy over default configurations and offers recommendations for efficient and effective model tuning.
Read the paper (PDF)
Patrick Koch, SAS
Brett Wujek, SAS
Oleg Golovidov, SAS
Steven Gardner, SAS
B
Session SAS0761-2017:
Breaking through the Barriers: Innovative Sampling Techniques for Unstructured Data Analysis
As in any analytical process, data sampling is a definitive step for unstructured data analysis. Sampling is of paramount importance if your data is fed from social media reservoirs such as Twitter, Facebook, Amazon, and Reddit, where information proliferation happens minute by minute. Unless you have a sophisticated analytical engine and robust physical servers to handle billions of pieces of data, you can't use all your data for analysis without sampling. So, how do you sample textual data? The standard method is to generate either a simple random sample, or a stratified random sample if a stratification variable exists in the data. Neither of these two methods can reliably produce a representative sample of documents from the population data simply because the process does not encompass a step to ensure that the distribution of terms between the population and sample sets remains similar. This shortcoming can cause the supervised or unsupervised learning to yield inaccurate results. If the generated sample is not representative of the population data, it is difficult to train and validate categories or sub-categories for those rare events during taxonomy development. In this paper, we show you new methods for sampling text data. We rely on a term-by-document matrix and SAS® macros to generate representative samples. Using these methods helps generate sufficient samples to train and validate every category in rule-based modeling approaches using SAS® Contextual Analysis.
Read the paper (PDF)
Murali Pagolu, SAS
Session 0921-2017:
Bridging the Gap between Agile Model Development and IT Productionisation
Often the burden of productionisation of analytical models falls upon the analyst, so every Monday morning the analyst comes in and presses the Run button. Now this is obviously fraught with danger (for example, the source data isn't available, the analyst goes on holidays, or the analyst resigns), and might lead to invalid results being consumed by downstream systems. There are many reasons that this might occur, but the most common one is that it takes IT too long to put a model into full production (especially if that model contains new data sources). In this presentation, I show a tested architecture that allows for the typical rapid development of models (and in fact it actually significantly speeds up the discovery phase), as well as allows for an orderly handover to IT for them to productionise without disrupting the regular run of the models. This allows for notification of downstream users if there is a delay in the arrival of data, as well as rapid IT Operations response if there is a problem during the loading and creation.
Paul Segal, Teradata
Session SAS0474-2017:
Building Bayesian Network Classifiers Using the HPBNET Procedure
A Bayesian network is a directed acyclic graphical model that represents probability relationships and conditional independence structure between random variables. SAS® Enterprise Miner implements a Bayesian network primarily as a classification tool; it includes na ve Bayes, tree-augmented na ve Bayes, Bayesian-network-augmented na ve Bayes, parent-child Bayesian network, and Markov blanket Bayesian network classifiers. The HPBNET procedure uses a score-based approach and a constraint-based approach to model network structures. This paper compares the performance of Bayesian network classifiers to other popular classification methods, such as classification tree, neural network, logistic regression, and support vector machines. The paper also shows some real-world applications of the implemented Bayesian network classifiers and a useful visualization of the results.
Read the paper (PDF)
Ye Liu, SAS
Weihua Shi, SAS
Wendy Czika, SAS
Session 0929-2017:
Building a Member-Centric World from a Transactional Data Galaxy
Health insurers have terabytes of transactional data. However, transactional data does not tell a member-level story. Humana Inc. is often faced with requirements for tagging (identifying) members with various clinical conditions such as diabetes, depression, hypertension, hyperlipidemia, and various member-level utilization metrics. For example, Consumer Health Tags are built to identify the condition (that is, diabetes, hypertension, and so on) and to estimate the intensity of the disease using medical and pharmacy administrative claims data. This case study takes you on an analytics journey from the initial problem diagnosis and analytics solution using SAS®.
Read the paper (PDF)
Brian Mitchell, Humana Inc.
C
Session 1484-2017:
Can Incumbents Take the Digital Curve ?!
Digital transformation and analytics for incumbents isn't a question of choice or strategy. It's a question of business survival. Go analytics!
Liav Geffen, Harel Insurance & Finance
Session 1254-2017:
Change in Themes of Billboard Top 100 Songs Over Time
Rapid advances in technology have empowered musicians all across the globe to share their music easily, resulting in intensified competition in the music industry. For this reason, musicians and record labels need to be aware of factors that can influence the popularity of their songs. The focus of our study is to determine how themes, topics, and terms within song lyrics have changed over time and how these changes might have influenced the popularity of songs. Moreover, we plan to run time series analysis on the numeric attributes of Billboard Top 100 songs in order to determine the appropriate combination of relevant attributes that influences a song's popularity. The findings of our study can potentially benefit musicians and record labels in understanding the necessary lyrical construction, overall themes, and topics that might enable a song to reach the highest chart position on the Billboard Top 100. The Billboard Top 100 is an optimal source of data, as it is an objective measure of popularity. Our data has been collected from open sources. Our data set consists of all 334,784 Billboard Top 100 observations for the years 1955-2015, with metadata covering all 26,869 unique songs that have appeared on the chart for that period. Our expanding lyric data set currently contains 18,002 of those songs, which were used to conduct our analysis. SAS® Enterprise Miner and SAS® Sentiment Analysis Studio were the primary tools of our analysis.
View the e-poster or slides (PDF)
Jayant Sharma, Oklahoma State University
John Harden, Sandia National Laboratories
Session SAS1414-2017:
Churn Prevention in the Telecom Services Industry: A Systematic Approach to Prevent B2B Churn Using SAS®
It takes months to find a customer and only seconds to lose one Unknown. Though the Business-to-Business (B2B) churn problem might not be as common as Business-to-Consumer (B2C) churn, it has become crucial for companies to address this effectively as well. Using statistical methods to predict churn is the first step in the process of retaining customers, which also includes model evaluation, prescriptive analytics (including outreach optimization), and performance reporting. Providing visibility into model and treatment performance enables the Data and Ops teams to tune models and adjust treatment strategy. West Corporation's Center for Data Science (CDS) has partnered with one of the lines of businesses in order to measure and prevent B2B customer churn. CDS has coupled firmographic and demographic data with internal CRM and past outreach data to build a Propensity to Churn model using SAS®. CDS has provided the churn model output to an internal Client Success Team (CST), who focuses on high-risk/high-value customers in order to understand and provide resolution to any potential concerns that might be expressed by such customers. Furthermore, CDS automated weekly performance reporting using SAS and Microsoft Excel that not only focuses on model statistics, but also on CST actions and impact. This paper focuses on all of the steps involved in the churn-prevention process, including building and reviewing the model, treatment design and implementation, as well as performance reporting.
Krutharth Peravalli, West Corporation
Dmitriy Khots, West Corporation
Session 1409-2017:
Classification Decision Accuracy and Consistency Using SAS/IML® Software
In this paper, we introduce a SAS/IML® program of Classification Accuracy and Classification Consistency (CA/CC) that provides useful resources to test analysts or psychometricians. Our program optimizes functions of SAS® by offering the CA/CC statistics not only with dichotomous items, but also with polytomous items. Classification Decision (CD) is a method to categorize examinees into achievement groups based on cut scores (Quinn and Cheng, 2013). CD has been predominantly used in educational and vocational situations such as admissions, selection, placement, or certification. This method needs to be accurate because its use has been important to examinees' professional and academic futures. Classification Accuracy and Classification Consistency (CA/CC) statistics are indices representing the precision of CD, and they need to be reported in order to affirm the validity of the CD. Classification Accuracy is referred to as the degree to which the classification of observed scores matches with the classification of true scores, and Classification Consistency is defined as the degree to which examinees are classified in the same category when taking two parallel test forms (Lee, 2010). Under item response theory (IRT), there are two methods to calculate CA/CC: Rudner (2001) and Lee (2010) approaches. This research deals with these two approaches for CA/CC with the examinee level.
View the e-poster or slides (PDF)
Sung-Hyuck Lee, ACT
Kyung Yong Kim, University of Iowa
Session 1312-2017:
Construction of a Disease Network and a Prediction Model for Dementia
Regarding a human disease network, most studies have estimated the associations of disorders primarily with gene or protein information. Those studies, however, have some difficulties in the data because of the massive volume of data and the huge computational cost. Instead, we constructed a human disease network that can describe the associations between diseases, using the claim data of Korean health insurance. Through several statistical analyses, we show the applicability and suitability of the disease network. Furthermore, we develop a statistical model that can predict a prevalence rate for dementia by using significant associations of the network in a statistical perspective.
Read the paper (PDF)
Jinwoo Cho, Sung Kyun Kwan University
Session SAS0687-2017:
Convergence of Big Data, the Cloud, and Analytics: A Docker Toolbox for the Data Scientist
Learn how SAS® works with analytics, big data, and the cloud all in one product: SAS® Analytics for Containers. This session describes the architecture of containers running in the public, private, or hybrid cloud. The reference architecture also shows how SAS leverages the distributed compute of Hadoop. Topics include how SAS products such as Base SAS®, SAS/STAT® software, and SAS/GRAPH® software can all run in a container in the cloud. This paper discusses how to work with a SAS container running in a variety of Infrastructure as a Service (IaaS) models, including Amazon Web Services and OpenStack cloud. Additional topics include provisioning web-browser-based clients via Jupyter Notebooks and SAS® Studio to provide data scientists with the tool of their choice. A customer use case is discussed that describes how SAS Analytics for Containers enables an IT department to meet the ad hoc, compute-intensive, and scaling demands of the organization. An exciting differentiator for the data scientist is the ability to send some or all of the analytic workload to run inside their Hadoop cluster by using the SAS accelerators for Hadoop. Doing so enables data scientists to dive inside the data lake and harness the power of all the data.
Read the paper (PDF)
Donna De Capite, SAS
Session 0142-2017:
Create a Unique Datetime Stamp for Filenames or Many Other Purposes
This paper shows how to use Base SAS® to create unique datetime stamps that can be used for naming external files. These filenames provide automatic versioning for systems and are intuitive and completely sortable. In addition, they provide enhanced flexibility compared to generation data sets, which can be created by SAS® or by the operating system.
Read the paper (PDF)
Joe DeShon, Boehringer Ingelheim Animal Health
D
Session SAS0118-2017:
DATA Step in SAS® Viya™: Essential New Features
The DATA step is the familiar and powerful data processing language in SAS® and now SAS Viya . The DATA step's simple syntax provides row-at-a-time operations to edit, restructure, and combine data. New to the DATA step in SAS Viya are a varying-size character data type and parallel execution. Varying-size character data enables intuitive string operations that go beyond the 32KB limit of current DATA step operations. Parallel execution speeds the processing of big data by starting the DATA step on multiple machines and dividing data processing among threads on these machines. To avoid multi-threaded programming errors, the run-time environment for the DATA step is presented along with potential programming pitfalls. Come see how the DATA step in SAS Viya makes your data processing simpler and faster.
Read the paper (PDF)
Jason Secosky, SAS
Session SAS0677-2017:
Developing Your Own SAS® Studio Custom Tasks for Advanced Analytics
Standard SAS® Studio tasks already include many advanced analytic procedures for data mining and other high-performance models, enabling point-and-click generation and execution of SAS® code. However, you can extend the power of tasks by creating tasks of your own to enable point-and-click access to the latest SAS statistical procedures, to your own default model definitions, or to your previously developed SAS/STAT® or SAS macro code. Best of all, these point-and-click tasks can be developed directly in SAS Studio without the need to compile binaries or build DLL files using third-party software. In this paper, we demonstrate three approaches to developing custom tasks. First, we build a custom task to provide point-and-click access to PROC IRT, including recently added functionality to PROC IRT used to analyze educational test and opinion survey data. Second, we build a custom task that calls a macro for previously developed SAS code, and we show how point-and-click options can be set up to allow users to guide the execution of complex macro code. Third, we demonstrate just enough of the underlying Apache Velocity Template Language code to enable developers to take advantage of the benefits of that language to support their SAS process. Finally, we show how these tasks can easily be shared with a user community, increasing the efficiency of analytic modeling across the organization.
Read the paper (PDF)
Elliot Inman, SAS
Olivia Wright, SAS
E
Session 1139-2017:
Enabling Advanced Customer Value Management with SAS®
Join this breakout session hosted by the Customer Value Management team from Saudi Telecommunications Company to understand the journey we took with SAS® to evolve from simple below-the-line campaign communication to advanced customer value management (CVM). Learn how the team leveraged SAS tools ranging from SAS® Enterprise Miner to SAS® Customer Intelligence Suite in order to gain a deeper understanding of customers and move toward targeting customers with the right offer at the right time through the right channel.
Read the paper (PDF)
Noorulain Malik, Saudi Telecom
Session 0986-2017:
Estimation of Student Growth Percentile Using SAS® Procedures
Student growth percentile (SGP) is one of the most widely used score metrics for measuring a student's academic growth. Using longitudinal data, SGP describes a student's growth as the relative standing among students who had a similar level of academic achievement in previous years. Although several models for SGP estimation have been introduced, and some models have been implemented with R, no studies have yet described using SAS®. As a result, this research describes various types of SGP models and demonstrates how practitioners can use SAS procedures to fit these models. Specifically, this study covers three types of statistical models for SGP: 1) quantile regression-based model 2) conditional cumulative density function-based model 3) multidimensional item response theory-based model. Each of the three models partly uses procedures in SAS, such as PROC QUANTREG, PROC LOGISTIC, PROC TRANSREG, PROC IRT, or PROC MCMC, for its computation. The program code is illustrated using a simulated longitudinal data set over two consecutive years, which is generated by SAS/IML®. In addition, the interpretation of the estimation results and the advantages and disadvantages of implementing these three approaches in SAS are discussed.
View the e-poster or slides (PDF)
Hongwook Suh, ACT
Robert Ankenmann, The University of Iowa
Session SAS0587-2017:
Exploring the Art and Science of SAS® Text Analytics: Best Practices in Developing Rule-Based Models
Traditional analytical modeling, with roots in statistical techniques, works best on structured data. Structured data enables you to impose certain standards and formats in which to store the data values. For example, a variable indicating gas mileage in miles per gallon should always be a number (for example, 25). However, with unstructured data analysis, the free-form text no longer limits you to expressing this information in only one way (25 mpg, twenty-five mpg, and 25M/G). The nuances of language, context, and subjectivity of text make it more complex to fit generalized models. Although statistical methods using supervised learning prove efficient and effective in some cases, sometimes you need a different approach. These situations are when rule-based models with Natural Language Processing capabilities can add significant value. In what context would you choose a rule-based modeling versus a statistical approach? How do you assess the tradeoffs of choosing a rule-based modeling approach with higher interpretability versus a statistical model that is black-box in nature? How can we develop rule-based models that optimize model performance without compromising accuracy? How can we design, construct, and maintain a complex rule-based model? What is a data-driven approach to rule writing? What are the common pitfalls to avoid? In this paper, we discuss all these questions based on our experiences working with SAS® Contextual Analysis and SAS® Sentiment Analysis.
Read the paper (PDF)
Murali Pagolu, SAS
Cheyanne Baird, SAS
Christina Engelhardt, SAS
F
Session SAS0388-2017:
Factorization Machines: A New Tool for Sparse Data
Factorization machines are a new type of model that is well suited to very high-cardinality, sparsely observed transactional data. This paper presents the new FACTMAC procedure, which implements factorization machines in SAS® Visual Data Mining and Machine Learning. This powerful and flexible model can be thought of as a low-rank approximation of a matrix or a tensor, and it can be efficiently estimated when most of the elements of that matrix or tensor are unknown. Thanks to a highly parallel stochastic gradient descent optimization solver, PROC FACTMAC can quickly handle data sets that contain tens of millions of rows. The paper includes examples that show you how to use PROC FACTMAC to recommend movies to users based on tens of millions of past ratings, predict whether fine food will be highly rated by connoisseurs, restore heavily damaged high-resolution images, and discover shot styles that best fit individual basketball players. ®
Read the paper (PDF)
Jorge Silva, SAS
Ray Wright, SAS
Session 0202-2017:
Fitting a Flexible Model for Longitudinal Count Data Using the NLMIXED Procedure
Longitudinal count data arise when a subject's outcomes are measured repeatedly over time. Repeated measures count data have an inherent within subject correlation that is commonly modeled with random effects in the standard Poisson regression. A Poisson regression model with random effects is easily fit in SAS® using existing options in the NLMIXED procedure. This model allows for overdispersion via the nature of the repeated measures; however, departures from equidispersion can also exist due to the underlying count process mechanism. We present an extension of the cross-sectional COM-Poisson (CMP) regression model established by Sellers and Shmueli (2010) (a generalized regression model for count data in light of inherent data dispersion) to incorporate random effects for analysis of longitudinal count data. We detail how to fit the CMP longitudinal model via a user-defined log-likelihood function in PROC NLMIXED. We demonstrate the model flexibility of the CMP longitudinal model via simulated and real data examples.
Read the paper (PDF)
Darcy Morris, U.S. Census Bureau
Session 0881-2017:
Fuzzy Matching and Predictive Models for Acquisition of New Customers
The acquisition of new customers is fundamental to the success of every business. Data science methods can greatly improve the effectiveness of acquiring prospective customers and can contribute to the profitability of business operations. In a business-to-business (B2B) setting, a predictive model might target business prospects as individual firms listed by, for example, Dunn & Bradstreet. A typical acquisition model can be defined using a binary response with the values categorizing a firm's customers and non-customers, for which it is then necessary to identify which of the prospects are actually customers of the firm. The methods of fuzzy logic, for example, based on the distance between strings, might help in matching customers' names and addresses with the overall universe of prospects. However, two errors can occur: false positives (when the prospect is incorrectly classified as a firm's customer), and false negatives (when the prospect is incorrectly classified as a non-customer). In the current practice of building acquisition models, these errors are typically ignored. In this presentation, we assess how these errors affect the performance of the predictive model as measured by a lift. In order to improve the model's performance, we suggest using a pre-determined sample of correct matches and to calibrate its predicted probabilities based on actual take-up rates. The presentation is illustrated with real B2B data and includes elements of SAS® code that was used in the research.
Read the paper (PDF)
Daniel Marinescu, Concordia University
G
Session 0187-2017:
Guidelines for Protecting Your Computer, Network, and Data from Malware Threats
Because many SAS® users either work for or own companies that house big data, the threat that malicious software poses becomes even more extreme. Malicious software, often abbreviated as malware, includes many different classifications, ways of infection, and methods of attack. This E-Poster highlights the types of malware, detection strategies, and removal methods. It provides guidelines to secure essential assets and prevent future malware breaches.
Read the paper (PDF) | View the e-poster or slides (PDF)
Ryan Lafler
H
Session SAS2010-2017:
Hands-On Workshop: Accessing and Manipulating Data in SAS® Viya™
In this course you will learn how to access and manage SAS and Excel data in SAS® Viya .
Davetta Dunlap, SAS
Session SAS2005-2017:
Hands-On Workshop: Exploring SAS Visual Analytics on SAS® Viya™
Nicole Ball, SAS
Session SAS2002-2017:
Hands-On Workshop: SAS® Studio for SAS Programmers
This workshop provides hands-on experience with SAS® Studio. Workshop participants will use SAS's new web-based interface to access data, write SAS programs, and generate SAS code through predefined tasks. This workshop is intended for SAS programmers from all experience levels.
Stacey Syphus, SAS
Session SAS2004-2017:
Hands-On Workshop: Text Mining using SAS® Text Miner
This workshop provides hands-on experience using SAS® Text Miner. For a collection of documents, workshop participants will learn how to: read and convert documents for use by SAS Text Miner; retrieve information from the collection using query features of the software; identify the dominant themes and concepts in the collection; and classify documents having pre-assigned categories.
Terry Woodfield, SAS
Session 1082-2017:
Hash Objects: When LAGging Behind Just Doesn't Work
When modeling time series data, we often use a LAG of the dependent variable. The LAG function works great for this, until you try to predict out into the future and need the model's predicted value from one record as an independent value for a future record. This paper examines exactly how the LAG function works, and explains why it doesn't in this case. It also explains how to create a hash object that will accomplish a LAG of any value, how to load the initial data, how to add predicted values to the hash object, and how to extract those values when needed as an independent variable for future observations.
Read the paper (PDF)
Andrea Wainwright-Zimmerman, Experis
Session SAS0378-2017:
How SAS® Customers Are Using Hadoop: Year in Review
Another year implementing, validating, securing, optimizing, migrating, and adopting the Hadoop platform. What have been the top 10 accomplishments with Hadoop seen over the last year? We also review issues, concerns, and resolutions from the past year as well. We discuss where implementations are and some best practices for moving forward with Hadoop and SAS® releases.
Read the paper (PDF)
Howard Plemmons, SAS
Mauro Cazzari, SAS
Session 0340-2017:
How to Use SAS® to Filter Stock for Trade
Investors usually trade stocks or exchange-traded funds (ETFs) based on a methodology, such as a theory, a model, or a specific chart pattern. There are more than 10,000 securities listed on the US stock market. Picking the right one based on a methodology from so many candidates is usually a big challenge. This paper presents the methodology based on the CANSLIM1 theorem and momentum trading (MT) theorem. We often hear of the cup and handle shape (C&H), double bottoms and multiple bottoms (MB), support and resistance lines (SRL), market direction (MD), fundamental analyses (FA), and technical analyses (TA). Those are all covered in CANSLIM theorem. MT is a trading theorem based on stock moving direction or momentum. Both theorems are easy to learn but difficult to apply without an appropriate tool. The brokers' application system usually cannot provide such filtering due to its complexity. For example, for C&H, where is the handle located? For the MB, where is the last bottom you should trade at? Now, the challenging task can be fulfilled through SAS®. This paper presents the methods on how to apply the logic and graphically present them though SAS. All SAS users, especially those who work directly on capital market business, can benefit from reading this document to achieve their investment goals. Much of the programming logic can also be adopted in SAS finance packages for clients.
Read the paper (PDF)
Brian Shen, Merlin Clinical Service LLC
I
Session SAS0668-2017:
I Am Multilingual: A Comparison of the Python, Java, Lua, and REST Interfaces to SAS® Viya™
The openness of SAS® Viya , the new cloud analytic platform that uses SAS® Cloud Analytic Services (CAS), emphasizes a unified experience for data scientists. You can now execute the analytics capabilities of SAS® in different programming languages including Python, Java, and Lua, as well as use a RESTful endpoint to execute CAS actions directly. This paper provides an introduction to these programming languages. For each language, we illustrate how the API is surfaced from the CAS server, the types of data that you can upload to a CAS server, and the result tables that are returned. This paper also provides a comprehensive comparison of using these programming languages to build a common analytical process, including loading data to a CAS server; exploring, manipulating, and visualizing data; and building statistical and machine learning models.
Read the paper (PDF)
Xiangxiang Meng, SAS
Kevin Smith, SAS
Session SAS0645-2017:
Identifying Abnormal Equipment Behavior and Filtering Data near the Edge for IoT Applications
What if you had analytics near the edge for your Internet of Things (IoT) devices that would tell you whether a piece of equipment is operating within its normal range? And what if those same analytics could help you intelligently determine what data you should keep and what data should be filtered at the edge? This session focuses on classifying streaming data near the edge by showcasing a demo that implements a single-class classification model within a gateway device. The model identifies observations that are normal and abnormal to help determine possible machine issues and preventative maintenance opportunities. The classification also helps to provide a method for filtering data at the edge by capturing all abnormal data but taking only a sample of the normal operating data. The model is developed using SAS® Viya and implemented on a gateway device using SAS® Event Stream Processing. By using a single-class classification technique, the demo also illustrates how to avoid issues with binary classification that would require failure observations in order to build an accurate model. Problems that this demo addresses include: identifying potential and future equipment failures in near real time; filtering sensor data near the edge to prevent unnecessary transport and storage of less valuable data; and building a classification model for failure that doesn't require observations relating to failures.
Read the paper (PDF)
Ryan Gillespie, SAS
Robert Moreira, SAS
Session 1349-2017:
Inference from Smart Meter Data Using the Fourier Transform
This presentation demonstrates that applying Fast Fourier Transformation (FFT) on smart meter data can provide enhanced customer segmentation and discovery. The FFT is a mathematical method for transforming a function of time into a function of frequency. It's vastly used in analyzing sound but is also relevant for utilities. Advanced Metering Infrastructure (AMI) refers to the full measurement and collection system that includes meters at the customer site and communication networks between the customer and the utility. With the inception of AMI, utilities experienced an explosion of data that provides vast analytical opportunities to improve reliability, customer satisfaction, and safety. However, the data explosion comes with its own challenges. The first challenge is the volume. Consider that just 20,000 customers with AMI data can reach over 300 GB of data per year. Simply aggregating the data from minutes to hours or even days can skew results and not provide accurate segmentations. The second challenge is the bad data that is being collected. Outliers caused by missing or incorrect reads, outages, or other factors must be addressed. FFT can eliminate this noise. The proposed framework is expected to identify various customer segments that could be used for demand response programs. The framework also has the potential to investigate diversion or fraud or failing meters (revenue protection), which is a big problem for many utilities.
Tom Anderson, SAS
Prasenjit Shil, Ameren
Session 1513-2017:
Integrating SAS® Visual Analytics with Google Maps for Analysis and Information Visualization
Exploring, analyzing, and presenting information are strengths of SAS® Visual Analytics. However, when we need to expand the viewing of this information to an unlimited public outside the boundaries of the organization, we must aggregate geographic information to facilitate interaction and the use of information. This application uses JavaScript and CSS programming language, integrated with SAS® programming, to present information about 4,239 programs of postgraduate study in Brazil. This information was evaluated by the Brazilian Federal Agency for Support and Evaluation of Graduate Education (CAPES, Brazil) with cartographic precision, enabling the visualization of the data generated in SAS Visual Analytics integrated with Google Maps and Google Street View. Users can select from Brazilian postgraduate programs, know information about the program, learn about theses and dissertations, and see the location of the institution and the campus. The application that is presented can be accessed at http://goo.gl/uAjvGw.
Read the paper (PDF)
Marcus Palheta, CAPES
Sergio Costa Cortes, CAPES
Session SAS0539-2017:
Interactive Modeling in SAS® Visual Analytics
SAS® Visual Analytics has two offerings, SAS® Visual Statistics and SAS® Visual Data Mining and Machine Learning, that provide knowledge workers and data scientists an interactive interface for data partition, data exploration, feature engineering, and rapid modeling. These offerings are powered by the SAS® Viya platform, thus enabling big data and big analytic problems to be solved. This paper focuses on the steps a user would perform during an interactive modeling session.
Read the paper (PDF)
Don Chapman, SAS
Jonathan Wexler, SAS
J
Session 0933-2017:
Jumping and Cutting: Using the Hash Object to Implement a Polygon Clipping Algorithm
Data with a location component is naturally displayed on a map. Base SAS® 9.4 provides libraries of map data sets to assist in creating these images. Sometimes, a particular sub-region is all that needs to be displayed. SAS/GRAPH® software can create a new subset of the map using the GPROJECT procedure minimum and maximum latitude and longitude options. However, this method is capable only of cutting out a rectangular area. This paper presents a polygon clipping algorithm that can be used to create arbitrarily shaped custom map regions. Maps are nothing more than sets of polygons, defined by sets of border points. Here, a custom polygon shape overlays the map polygons and saves the intersection of the two. The DATA step hash object is used for easier bookkeeping of the added and deleted points needed to maintain the correct shape of the clipped polygons.
Read the paper (PDF)
Seth Hoffman, GEICO
L
Session SAS0395-2017:
Location Analytics: Minority Report Is Here. Real-Time Geofencing Using SAS® Event Stream Processing
Geofencing is one of the most promising and exciting concepts that has developed with the advent of the internet of things. Like John Anderton in the 2002 movie Minority Report, you can now enter a mall and immediately receive commercial ads and offers based on your personal taste and past purchases. Authorities can track vessels positions and detect when a ship is not in the area it should be, or they can forecast and optimize harbor arrivals. When a truck driver breaks from the route, the dispatcher can be alerted and can act immediately. And there are countless examples from manufacturing, industry, security, or even households. All of these applications are based on the core concept of geofencing, which consists of detecting whether a device s position is within a defined geographical boundary. Geofencing requires real-time processing in order to react appropriately. In this session, we explain how to implement real-time geofencing on streaming data with SAS® Event Stream Processing and achieve high-performance processing, in terms of millions of events per second, over hundreds of millions of geofences.
Read the paper (PDF)
Frederic Combaneyre, SAS
M
Session SAS0434-2017:
Methods of Multinomial Classification Using Support Vector Machines
Many practitioners of machine learning are familiar with support vector machines (SVMs) for solving binary classification problems. Two established methods of using SVMs in multinomial classification are the one-versus-all approach and the one-versus-one approach. This paper describes how to use SAS® software to implement these two methods of multinomial classification, with emphasis on both training the model and scoring new data. A variety of data sets are used to illustrate the pros and cons of each method.
Read the paper (PDF)
Ralph Abbey, SAS
Taiping He, SAS
Tao Wang, SAS
Session SAS0366-2017:
Microservices and Many-Task Computing for High-Performance Analytics
A microservice architecture prescribes the design of your software application as suites of independently deployable services. In this paper, we detail how you can design your SAS® 9.4 programs so that they adhere to a microservice architecture. We also describe how you can leverage Many-Task Computing (MTC) in your SAS® programs to gain a high level of parallelism. Under these paradigms, your SAS code will gain encapsulation, robustness, reusability, and performance. The design principles discussed in this paper are implemented in the SAS® Infrastructure for Risk Management (IRM) solution. Readers with an intermediate knowledge of Base SAS® and the SAS macro language will understand how to design their SAS code so that it follows these principles and reaps the benefits of a microservice architecture.
Read the paper (PDF)
Henry Bequet, SAS
Session SAS0324-2017:
Migrating Dashboards from SAS® BI Dashboard to SAS® Visual Analytics
SAS® BI Dashboard is an important business intelligence and data visualization product used by many customers worldwide. They still rely on SAS BI Dashboard for performance monitoring and decision support. SAS® Visual Analytics is a new-generation product, which empowers customers to explore huge volumes of data very quickly and view visualized results with web browsers and mobile devices. Since SAS Visual Analytics is used by more and more regular customers, some SAS BI Dashboard customers might want to migrate existing dashboards to SAS Visual Analytics to take advantage of new technologies. In addition, some customers might hope to deploy the two products in parallel and keep everyone on the same page. Because the two products use different data models and formats, a special conversion tool is developed to convert SAS BI Dashboard dashboards into SAS Visual Analytics dashboards and reports. This paper comprehensively describes the guidelines, methods, and detailed steps to migrate dashboards from SAS BI Dashboard to SAS Visual Analytics. Then the converted dashboards can be shown in supported viewers of SAS Visual Analytics including mobile devices and modern browsers.
Read the paper (PDF)
Roc (Yipeng) Zhang, SAS
Junjie Li, SAS
Wei Lu, SAS
Huazhang Shao, SAS
N
Session 0770-2017:
Name That Function: Punny Function Names with Multiple MEANings and Why You Do Not Want to MISS Out
The SAS® DATA step is one of the best (if not the best) data manipulators in the programming world. One of the areas that gives the DATA step its power is the wealth of functions that are available to it. This paper takes a PEEK at some of the functions whose names have more than one MEANing. While the subject matter is very serious, the material is presented in a humorous way that is guaranteed not to BOR the audience. With so many functions available, we have to TRIM our list so that the presentation can be made within the TIME allotted. This paper also discusses syntax and shows several examples of how these functions can be used to manipulate data.
Read the paper (PDF)
Ben Cochran, The Bedford Group, Inc.
Art Carpenter, California Occidental Consultants
Session SAS0127-2017:
New for SAS® 9.4: Including Text and Graphics in Your Microsoft Excel Workbooks, Part 2
A new ODS destination for creating Microsoft Excel workbooks is available starting in the third maintenance release for SAS® 9.4. This destination creates native Microsoft Excel XLSX files, supports graphic images, and offers other advantages over the older ExcelXP tagset. In this presentation, you learn step-by-step techniques for quickly and easily creating attractive multi-sheet Excel workbooks that contain your SAS® output. The techniques can be used regardless of the platform on which SAS software is installed. You can even use them on a mainframe! Creating and delivering your workbooks on demand and in real time using SAS server technology is discussed. Using earlier versions of SAS to create multi-sheet workbooks is also discussed. Although the title is similar to previous presentations by this author, this presentation contains new and revised material not previously presented.
Read the paper (PDF) | Download the data file (ZIP)
Vince DelGobbo, SAS
O
Session SAS0747-2017:
Open Your Mind: Use Cases for SAS® and Open-Source Analytics
As a data scientist, you need analytical tools and algorithms, whether commercial or open source, and you have some favorites. But how do you decide when to use what? And how can you integrate their use to your maximum advantage? This presentation provides several best practices for deploying both SAS® and open-source analytical tools to increase productivity and efficiency in your enterprise ecosystem. See an example of a marketing analysis using SAS and R algorithms in SAS® Enterprise Miner to develop a predictive model, and then operationalize that model for performance monitoring and in-database scoring. Also learn about using Python and SAS integration for developing predictive models from a Jupyter Notebook environment. Seeing these cases will help you decide how to improve your analytics with similar integration of SAS and open source.
Read the paper (PDF)
Tuba Islam, SAS
Session 0302-2017:
Optimize My Stock Portfolio! A Case Study with Three Different Estimates of Risk
People typically invest in more than one stock to help diversify their risk. These stock portfolios are a collection of assets that each have their own inherit risk. If you know the future risk of each of the assets, you can optimize how much of each asset to keep in the portfolio. The real challenge is trying to evaluate the potential future risk of these assets. Different techniques provide different forecasts, which can drastically change the optimal allocation of assets. This talk presents a case study of portfolio optimization in three different scenarios historical standard deviation estimation, capital asset pricing model (CAPM), and GARCH-based volatility modeling. The structure and results of these three approaches are discussed.
Read the paper (PDF)
Aric LaBarr, Institute for Advanced Analytics at NC State University
Session 0851-2017:
Optimizing Delivery Routes with SAS® Software
Optimizing delivery routes and efficiently using delivery drivers are examples of classic problems in Operations Research, such as the Traveling Salesman Problem. In this paper, Oberweis and Zencos collaborate to describe how to leverage SAS/OR® procedures to solve these problems and optimize delivery routes for a retail delivery service. Oberweis Dairy specializes in home delivery service that delivers premium dairy products directly to customers homes. Because freshness is critical to delivering an excellent customer experience, Oberweis is especially motivated to optimize their delivery logistics. As Oberweis works to develop an expanding footprint and a growing business, Zencos is helping to ensure that delivery routes are optimized and delivery drivers are used efficiently.
Read the paper (PDF)
Ben Murphy, Zencos
Bruce Bedford, Oberweis Dairy, Inc.
Session SAS0278-2017:
Optimizing SAS® Grid Computing with SAS® Scalable Performance Data Server and Dynamic Data Partitioning
Making optimal use of SAS® Grid Computing relies on the ability to spread the workload effectively across all of the available nodes. With SAS® Scalable Performance Data Server (SPD Server), it is possible to partition your data and spread the processing across the SAS Grid Computing environment. In an ideal world it would be possible to adjust the size and number of partitions according to the data volumes being processed on any given day. This paper discusses a technique that enables the processing performed in the SAS Grid Computing environment to be dynamically reconfigured, automatically at run time, to optimize the use of SAS Grid Computing, and to provide significant performance benefits.
Read the paper (PDF)
Andy Knight, SAS
P
Session 0942-2017:
Predictive Accuracy: A Misleading Performance Measure for Highly Imbalanced Data
The most commonly reported model evaluation metric is the accuracy. This metric can be misleading when the data are imbalanced. In such cases, other evaluation metrics should be considered in addition to the accuracy. This study reviews alternative evaluation metrics for assessing the effectiveness of a model in highly imbalanced data. We used credit card clients in Taiwan as a case study. The data set contains 30,000 instances (22.12% risky and 77.88% non-risky) assessing the likeliness of a customer defaulting on a payment. Three different techniques were used during the model building process. The first technique involved down-sampling the majority class in the training subset. The second used the original imbalanced data whereas prior probabilities were set to account for oversampling in the third technique. The same sets of predictive models were then built for each technique after which the evaluation metrics were computed. The results suggest that model evaluation metrics might reveal more about distribution of classes than they do about the actual performance of models when the data are imbalanced. Moreover, some of the predictive models were identified to be very sensitive to imbalance. The final decision in model selection should consider a combination of different measures instead of relying on one measure. To minimize imbalance-biased estimates of performance, we recommend reporting both the obtained metric values and the degree of imbalance in the data.
Read the paper (PDF)
Josephine Akosa, Oklahoma State University
Session 1334-2017:
Predictive Models: Storing, Scoring, and Evaluating
Predictive modeling might just be the single most thrilling aspect of data science. Who among us can deny the allure: to observe a naturally occurring phenomenon, conjure a mathematical model to explain it, and then use that model to make predictions about the future? Though many SAS® users are familiar with using a data set to generate a model, they might not use the awesome power of SAS to store the model and score other data sets. In this paper, we distinguish between parametric and nonparametric models and discuss the tools that SAS provides for storing and scoring each. Along the way, you come to know the STORE statement and the SCORE procedure. We conclude with a brief overview of the PLM procedure and demonstrate how to effectively load and evaluate models that have been stored during the model building process.
Read the paper (PDF)
Matthew Duchnowski, Educational Testing Service (ETS)
Session 1326-2017:
Price Recommendation Engine for Airbnb
Airbnb is the world's largest home-sharing company and has over 800,000 listings in more than 34,000 cities and 190 countries. Therefore, the pricing of their property, done by the Airbnb hosts, is crucial to the business. Setting low prices during a high-demand period might hinder profits, while setting high prices during a low-demand period might result in no bookings at all. In this paper, we suggest a price recommendation methodology for Airbnb hosts that helps in overcoming the problems of overpricing and underpricing. Through this methodology, we try to identify key factors related to Airbnb pricing: factors influential in determining a price for a property; the relation between the price of a property and the frequency of its booking; and similarities among successful and profitable properties. The constraints outlined in the analysis were entered into SAS® optimization procedures to achieve a best possible price. As a part of this methodology, we built a scraping tool to get details of New York City host user data along with their metrics. Using this data, we build a pricing model to predict the optimal price of an Airbnb home.
Read the paper (PDF)
Praneeth Guggilla, Oklahoma State University
Singdha Gutha, Oklahoma State University
Goutam Chakraborty, Oklahoma State University
Q
Session 0173-2017:
Quick Results with SAS® Enterprise Guide®
SAS® Enterprise Guide® empowers organizations, programmers, business analysts, statisticians, and end users with all the capabilities that SAS has to offer. This hands-on workshop presents the SAS Enterprise Guide graphical user interface (GUI). It covers access to multi-platform enterprise data sources, various data manipulation techniques that do not require you to learn complex coding constructs, built-in wizards for performing reporting and analytical tasks, the delivery of data and results to a variety of mediums and outlets, and support for data management and documentation requirements. Attendees learn how to use the graphical user interface to access SAS® data sets and tab-delimited and Microsoft Excel input files; to subset and summarize data; to join (or merge) two tables together; to flexibly export results to HTML, PDF, and Excel; and to visually manage projects using flow diagrams.
Read the paper (PDF)
Kirk Paul Lafler, Software Intelligence Corporation
Ryan Lafler
R
Session 0242-2017:
Random Forests with Approximate Bayesian Model Averaging
A random forest is an ensemble of decision trees that often produce more accurate results than a single decision tree. The predictions of the individual trees in the forest are averaged to produce a final prediction. The question now arises whether a better or more accurate final prediction cannot be obtained by a more intelligent use of the trees in the forest. In particular, in the way random forests are currently defined, every tree contributes the same fraction to the final result (for example, if there are 50 trees, each tree contributes 1/50th to the final result). This ignores model uncertainty as less accurate trees are treated exactly like more accurate trees. Replacing averaging with Bayesian Model Averaging will give better trees the opportunity to contribute more to the final result, which might lead to more accurate predictions. However, there are several complications to this approach that have to be resolved, such as the computation of an SBC value for a decision tree. Two novel approaches to solving this problem are presented and the results compared to that obtained with the standard random forest approach.
Read the paper (PDF)
Tiny Du Toit, North-West University
Andre De Waal, SAS
Session 1323-2017:
Real AdaBoost: Boosting for Credit Scorecards and Similarity to WOE Logistic Regression
AdaBoost (or Adaptive Boosting) is a machine learning method that builds a series of decision trees, adapting each tree to predict difficult cases missed by the previous trees and combining all trees into a single model. I discuss the AdaBoost methodology, introduce the extension called Real AdaBoost, which is so similar to stepwise weight of evidence logistic regression (SWOELR) that it might offer a framework with which we can understand the power of the SWOELR approach. I discuss the advantages of Real AdaBoost, including variable interaction and adaptive, stage-wise binning, and demonstrate a SAS® macro that uses Real AdaBoost to generate predictive models.
Read the paper (PDF)
Paul Edwards, ScotiaBank
Session 0847-2017:
Revenue Score: Forecasting Credit Card Products with Zero Inflated Beta Regression and Gradient Boosting
Using zero inflated beta regression and gradient boosting, a solution to forecast the gross revenue of credit card products was developed. This solution was based on 1) A set of attributes from invoice information. 2) Zero inflated beta regression for forecasts of interchange and revolving revenue (by using PROC NLMIXED and by building data processing routines (with attributes and a target variable)). 3) Gradient boosting models for different product forecasts (annuity, insurance, etc.) using PROC TREEBOOST, exploring its parameters, and creating a routine for selecting and adjusting models. 4) Construction of ranges of revenue for policies and monitoring. This presentation introduces this credit card revenue forecasting solution.
Read the paper (PDF)
Marc Witarsa, Serasa Experian
Paulo Di Cellio Dias, Serasa Experian
Session 0838-2017:
Revolutionizing Statistical Computing in SAS® with the Jupyter Notebook
From state-of-the-art research to routine analytics, the Jupyter Notebook offers an unprecedented reporting medium. Historically, tables, graphics, and other types of output had to be created separately, and then integrated into a report piece by piece, amidst the drafting of text. The Jupyter Notebook interface enables you to create code cells and markdown cells in any arrangement. Markdown cells allow all typical formatting. Code cells can run code in the document. As a result, report creation happens naturally and in a completely reproducible way. Handing a colleague a Jupyter Notebook file to be re-run or revised is much easier and simpler for them than passing along, at a minimum, two files: one for the code and one for the text. Traditional reports become dynamic documents that include both text and living SAS® code that is run during document creation. With the new SAS kernel for Jupyter, all of this is possible and more!
Read the paper (PDF)
Hunter Glanz
S
Session 0911-2017:
Self-Service Data Management for Analytics Users across the Enterprise
With the proliferation of analytics expanding across every function of the enterprise, the need for broader access to data by experienced data scientists and non-technical users to produce reports and do discovery is growing exponentially. The unintended consequence of this trend is a bottleneck within IT to deliver the necessary data while still maintaining the necessary governance and data security standards required to safeguard this critical corporate asset. This presentation illustrates how organizations are solving this challenge and enabling users to both access larger quantities of existing data and add new data to their own models without negatively impacting the quality, security, or cost to store that data. It also highlights some of the cost and performance benefits achieved by enabling self-service data management.
Ken Pikulik, Teradata
Session 0383-2017:
Setting Relative Server Paths in SAS® Enterprise Guide®
Imagine if you will a program, a program that loves its data, a program that loves its data to be in the same directory as the program itself. Together, in the same directory. True love. The program loves its data so much, it just refers to it by filename. No need to say what directory the data is in; it is the same directory. Now imagine that program being thrust into the world of the server. The server knows not what directory this program resides in. The server is an uncaring, but powerful, soul. Yet, when the program is executing, and the program refers to the data just by filename, the server bellows nay, no path, no data. A knight in shining armor emerges, in the form of a SAS® macro, who says lo, with the help of the SAS® Enterprise Guide® macro variable minions, I can gift you with the location of the program directory and send that with you to yon mighty server. And there was much rejoicing. Yay. This paper shows you a SAS macro that you can include in your SAS Enterprise Guide pre-code to automatically set your present working directory to the same directory where your program is saved on your UNIX or Linux operating system. This is applicable to submitting to any type of server, including a SAS Grid Server. It gives you the flexibility of moving your code and data to different locations without having to worry about modifying the code. It also helps save time by not specifying complete pathnames in your programs. And can't we all use a little more time?
Read the paper (PDF)
Michelle Buchecker, ThotWave Technologies, LLC.
Session SAS0437-2017:
Stacked Ensemble Models for Improved Prediction Accuracy
Ensemble models have become increasingly popular in boosting prediction accuracy over the last several years. Stacked ensemble techniques combine predictions from multiple machine learning algorithms and use these predictions as inputs to a second level-learning algorithm. This paper shows how you can generate a diverse set of models by various methods (such as neural networks, extreme gradient boosting, and matrix factorizations) and then combine them with popular stacking ensemble techniques, including hill-climbing, generalized linear models, gradient boosted decision trees, and neural nets, by using both the SAS® 9.4 and SAS® Visual Data Mining and Machine Learning environments. The paper analyzes the application of these techniques to real-life big data problems and demonstrates how using stacked ensembles produces greater prediction accuracy than individual models and na ve ensembling techniques. In addition to training a large number of models, model stacking requires the proper use of cross validation to avoid overfitting, which makes the process even more computationally expensive. The paper shows how to deal with the computational expense and efficiently manage an ensemble workflow by using parallel computation in a distributed framework.
Read the paper (PDF)
Funda Gunes, SAS
Russ Wolfinger, SAS
Pei-Yi Tan, SAS
Session 1157-2017:
Statistical Volunteering with SAS: Experiences and Opportunities
This presentation brings together experiences from SAS® professionals working as volunteers for organizations, charities, and in academic research. Pro bono work, much like that done by physicians, attorneys, and professionals in other areas, is rapidly growing in statistical practice as an important part of a statistical career, offering the opportunity to use your skills in a places where they are so needed but cannot be supported in a for-pay position. Statistical volunteers also gain important learning experiences, mentoring, networking, and other opportunities for professional development. The presenter shares experiences from volunteering for local charities, non-governmental organizations (NGOs) and other organizations and causes, both in the US and around the world. The mission, methods, and focus of some organizations are presented, including DataKind, Statistics Without Borders, Peacework, and others.
Read the paper (PDF)
David Corliss, Peace-Work
T
Session SAS0523-2017:
Temporal Text Mining: A Thematic Exploration of Don Quixote
Temporal text mining (TTM) is the discovery of temporal patterns in documents that are collected over time. It involves discovery of latent themes, construction of a thematic evolution graph, and analysis of thematic patterns. This paper uses text mining and time series analysis techniques to explore Don Quixote de la Mancha, a two-volume master work of Western literature. First, it uses singular value decomposition in SAS® Text Miner to discover 25 key themes that characterize the two volumes. Then it treats the chapters of the two books as time-ordered documents and creates a semiautomated visual summary of the two volumes. It also explores the trajectory of individual themes over the course of the chapters and identifies episodes, recurring themes, and climaxes. Finally, it uses time series clustering in SAS® Enterprise Miner to group chapters that have similar themes and to group themes that have similar trajectories. The TTM methods demonstrated in this paper lend themselves to business applications such as monitoring changes in customer sentiment and summarizing research and legislative trends.
Read the paper (PDF)
Ray Wright, SAS
Session SAS0309-2017:
The Architecture of the SAS® Cloud Analytic Services in SAS® Viya™
SAS® Cloud Analytic Services (CAS) is the cloud-based run-time environment for data management and analytics in SAS®. By run-time environment, we refer to the combination of hardware and software where data management and analytics take place. In a sense, CAS is just another SAS platform to do things. CAS is a platform for high-performance analytics and distributed computing. The CAS server provides data management and an analytics framework that can run in the cloud, that can act as a cloud, and that provides the best-in-class analytics that SAS is known for. This new architecture functions as a public API, allowing access from many different clients such as Lua, Python, Java, REST, and yes, even SAS. The CAS server is designed to provide user-level sessions, to share data between sessions, and to provide fault tolerance, which allows a worker node to crash without losing data and allows the user action to continue running to completion. The isolation provided to each session allows one session to crash without affecting other sessions. The concept of 'always in memory' in CAS means that an action is not aware of what the server does to allow the action to access the data. The entire file might be in memory or just pieces of the file might be mapped into memory, just in time for the action to access the data. This allows CAS tables to be loaded that are larger than the memory available across the grid. Hadoop can be used to provide data redundancy. The server is elastic and can add or remove nodes as needed. Users can specify how many nodes they want their session to use, so that the session fits their needs.
Read the paper (PDF)
Jerry Pendergrass, SAS
Session SAS1407-2017:
The Benefit of Using Clustering as Input to a Propensity to Buy Predictive Model
Propensity to Buy models comprise one of the most widely used techniques in supporting business strategy for customer segmentation and targeting. Some of the key challenges every data scientist faces in building predictive models are the utilization of all known predictor variables, uncovering any unknown signals, and adjusting for latent variable errors. Often, the business demands inclusion of certain variables based on a previous understanding of process dynamics. To meet such client requirements, these inputs are forced into the model, resulting in either a complex model with too many inputs or a fragile model that might decay faster than expected. West Corporation's Center for Data Science (CDS) has found a work around to strike a balance between meeting client requirements and building a robust model by using clustering techniques. A leading telecom services provider uses West's SMS Outbound Notification Platform to notify their customers about an upcoming Pay-Per-View event. As part of the modeling process, the client has identified a few variables as key business drivers and CDS used those variables to build clusters, which were then used as inputs to the predictive model. In doing so, not only all the effects of the client-mandated variables were captured successfully, but this also helped to reduce the number of inputs to the model, making it parsimonious. This paper illustrates how West has used clustering in the data preparation process and built a robust model.
Krutharth Peravalli, West Corporation
Sumit Sukhwani, West Corporation
Dmitriy Khots, West Corporation
Session 1061-2017:
The Rise of Chef Curry: Studying Advanced Basketball Metrics with Quantile Regression in SAS®
In the 2015-2016 season of the National Basketball Association (NBA), the Golden State Warriors achieved a record-breaking 73 regular-season wins. This accomplishment would not have been possible without their reigning Most Valuable Player (MVP) champion Stephen Curry and his historic shooting performance. Shattering his previous NBA record of 286 three-point shots made during the 2014-2015 regular season, he accrued an astounding 402 in the next season. With an increased emphasis on the advantages of the three-point shot and guard-heavy offenses in the NBA today, organizations are naturally eager to investigate player statistics related to shooting at long ranges, especially for the best of shooters. Furthermore, the addition of more advanced data-collecting entities such as SportVU creates an incredible opportunity for data analysis, moving beyond simply using aggregated box scores. This work uses quantile regression within SAS® 9.4 to explore the relationships between the three-point shot and other relevant advanced statistics, including some SportVU player-tracking data, for the top percentile of three-point shooters from the 2015-2016 NBA regular season.
View the e-poster or slides (PDF)
Taylor Larkin, The University of Alabama
Denise McManus, The University of Alabama
Session 1489-2017:
Timing is Everything: Detecting Important Behavior Triggers
Predictive analytics is powerful in its ability to predict likely future behavior, and is widely used in marketing. However, the old adage timing is everything continues to hold true the right customer and the right offer at the wrong time is less than optimal. In life, timing matters a great deal, but predictive analytics seldom takes timing into account explicitly. We should be able to do better. Financial service consumption changes often have precursor changes in behavior, and a behavior change can lead to multiple subsequent consumption changes. One way to improve our awareness of the customer situation is to detect significant events that have meaningful consequences that warrant proactive outreach. This session presents a simple time series approach to event detection that has proven to work successfully. Real case studies are discussed to illustrate the approach and implementation. Adoption of this practice can augment and enhance predictive analytics practice to elevate our game to the next level.
Read the paper (PDF)
Daymond Ling, Seneca College
Session SAS0763-2017:
Toward End-to-End Automated Machine Learning in SAS® Viya™
Trends in predictive modeling, specifically in machine learning, are moving toward automated approaches where well-known predictive algorithms are applied and the best one is chosen according to some evaluation metric. This approach's efficacy relies on the underlying data preparation in general, and on data preprocessing in particular. Commonly used data preprocessing techniques include missing value imputation, outlier detection and treatment, functional transformation, discretization, and nominal grouping. Excluding toy problems, the composition of the best data preprocessing step depends on the modeling task at hand. This necessitates an iterative generation and evaluation of predictive pipelines, which consist of a mix of data preprocessing techniques and predictive algorithms. This is a combinatorial problem that can be a bottleneck in the analytics workflow. In this paper, we discuss the SAS® Cloud Analytic Services (CAS) actions in SAS® Viya that can be used to effect this end-to-end predictive pipeline in a scalable way, with special emphasis on CAS actions for data exploration, preprocessing, and feature transformation. In addition, we discuss how the whole process can be automated.
Biruk Gebremariam, SAS
U
Session 0612-2017:
Using Big Data to Visualize People Movement Using SAS® Basics
Visualizing the movement of people over time in an animation can provide insights that tables and static graphs cannot. There are many options, but what if you want to base the visualization on large amounts of data from several sources? SAS® is a great tool for this type of project. This paper summarizes how visualizing movement is accomplished using several data sets, large and small, and using various SAS procedures to pull it together. The use of a custom shape file is also highlighted. The end result is a GIF, which can be shared, that provides insights not available with other methods.
Read the paper (PDF)
Stephanie Thompson, Datamum
Session 1169-2017:
Using Graph Analytics for Predictive Modeling in Life Insurance
This paper discusses a specific example of using graph analytics or social network analysis (SNA) in predictive modeling in the life insurance industry. The methods of social network analysis are applied to agents that share compensation, and the results are used to derive input variables for a model to predict the likelihood of certain behavior by insurance agents. Both SAS® code and SAS® Enterprise Miner are used to illustrate implementing different graph analytical methods. This paper assumes that the reader is familiar with the basic process of creating predictive models using multiple (linear or logistic) regression, and, in some sections, familiarity with SAS Enterprise Miner.
Read the paper (PDF)
Robert Moore, Thrivent Financial
Session 1216-2017:
Using ODS EXCEL to Integrate Tables, Graphics, and Text into Multi-Tabbed Microsoft Excel Reports
Do you have a complex report involving multiple tables, text items, and graphics that could best be displayed in a multi-tabbed spreadsheet format? The Output Delivery System (ODS) destination for Excel, introduced in SAS® 9.4, enables you to create Microsoft Excel workbooks that easily integrate graphics, text, and tables, including column labels, filters, and formatted data values. In this paper, we examine the syntax used to generate a multi-tabbed Excel report that incorporates output from the REPORT, PRINT, SGPLOT, and SGPANEL procedures.
Read the paper (PDF)
Caroline Walker, Warren Rogers Associates
Session SAS0152-2017:
Using Python with SAS® Cloud Analytic Services
With SAS® Viya and SAS® Cloud Analytic Services (CAS), SAS is moving into a new territory where SAS® Analytics is accessible to popular scripting languages using open APIs. Python is one of those client languages. We demonstrate how to connect to CAS, run CAS actions, explore data, build analytical models, and then manipulate and visualize the results using standard Python packages such as Pandas and Matplotlib. We cover a wide variety of topics to give you a bird's eye view of what is possible when you combine the best of SAS with the best of open source.
Read the paper (PDF)
Kevin Smith, SAS
Xiangxiang Meng, SAS
Session 0484-2017:
Using Text Analysis to Improve the Quality of Scoring Models with SAS® Enterprise Miner™
Transformation of raw data into sensible and useful information for prediction purposes is a priceless skill nowadays. Vast amounts of data, easily accessible at each step in a process, gives us a great opportunity to use it for countless applications. Unfortunately, not all of the valuable data is available for processing using classical data mining techniques. What happens if textual data is also used to create the analytical base table (ABT)? The goal of this study is to investigate whether scoring models that also use textual data are significantly better than models that include only quantitative data. This thesis is focused on estimating the probability of default (PD) for the social lending platform kokos.pl. The same methods used in banks are used to evaluate the accuracy of reported PDs. Data used for analysis is gathered directly from the platform via the API. This paper describes in detail the steps of the data mining process that is built using SAS® Enterprise Miner . The results of the study support the thesis that models with a properly conducted text-mining process have better classification quality than models without text variables. Therefore, the use of this data mining approach is recommended when input data includes text variables.
Read the paper (PDF)
Piotr Malaszek, SCS Expert
Session SAS0527-2017:
Using Vibration Spectral Analysis to Predict Failures by Integrating R into SAS® Asset Performance Analytics
In industrial systems, vibration signals are the most important measurements for indicating asset health. Based on these measurements, an engineer with expert knowledge about the assets, industrial process, and vibration monitoring can perform spectral analysis to identify failure modes. However, this is still a manual process that heavily depends on the experience and knowledge of the engineer analyzing the vibration data. Moreover, when measurements are performed continuously, it becomes impossible to act in real time on this data. The objective of this paper is to examine using analytics to perform vibration spectral analysis in real time to predict asset failures. The first step in this approach is to translate engineering knowledge and features into analytic features in order to perform predictive modeling. This process involves converting the time signal into the frequency domain by applying a fast Fourier transform (FFT). Based on the specific design characteristics of the asset, it is possible to derive the relevant features of the vibration signal to predict asset failures. This approach is illustrated using a bearing data set available from the Prognostics Data Repository of the National Aeronautics and Space Administration (NASA). Modeling is done using R and is integrated within SAS® Asset Performance Analytics. In essence, this approach helps the engineers to make better data-driven decisions. The approach described in this paper shows the strength of combining ex
Read the paper (PDF)
Adriaan Van Horenbeek, SAS
V
Session 1185-2017:
Visualizing Market Structure Using Brand Sentiments
Increasingly, customers are using social media and other Internet-based applications such as review sites and discussion boards to voice their opinions and express their sentiments about brands. Such spontaneous and unsolicited customer feedback can provide brand managers with valuable insights about competing brands. There is a general consensus that listening to and reacting to the voice of the customer is a vital component of brand management. However, the unstructured, qualitative, and textual nature of customer data that is obtained from customers poses significant challenges for data scientists and business analysts. In this paper, we propose a methodology that can help brand managers visualize the competitive structure of a market based on an analysis of customer perceptions and sentiments that are obtained from blogs, discussion boards, review sites, and other similar sources. The brand map is designed to graphically represent the association of product features with brands, thus helping brand managers assess a brand's true strengths and weaknesses based on the voice of customers. Our multi-stage methodology uses the principles of topic modeling and sentiment analysis in text mining. The results of text mining are analyzed using correspondence analysis to graphically represent the differentiating attributes of each brand. We empirically demonstrate the utility of our methodology by using data collected from Edmunds.com, a popular review site for car buyers.
Read the paper (PDF)
praveen kumar kotekal, Oklahoma state university
Amit K Ghosh, Cleveland State University
Goutam Chakraborty, Oklahoma State University
W
Session 0883-2017:
What Statisticians Should Know about Machine Learning
In the last few years, machine learning and statistical learning methods have gained increasing popularity among data scientists and analysts. Statisticians have sometimes been reluctant to embrace these methodologies, partly due to a lack of familiarity, and partly due to concerns with interpretability and usability. In fact, statisticians have a lot to gain by using these modern, highly computational tools. For certain types of problems, machine learning methods can be much more accurate predictors than traditional methods for regression and classification, and some of these methods are particularly well suited for the analysis of big and wide data. Many of these methods have origins in statistics or at the boundary of statistics and computer science, and some are already well established in statistical procedures, including LASSO and elastic net for model selection and cross validation for model evaluation. In this talk, I go through some examples illustrating the application of machine learning methods and interpretation of their results, and show how these methods are similar to, and differ from, traditional statistical methodology for the same problems.
Read the paper (PDF)
D. Cutler, Utah State University
Session SAS0195-2017:
What's New in SAS® Data Management
The latest releases of SAS® Data Management software provide a comprehensive and integrated set of capabilities for collecting, transforming, and managing your data. The latest features in the product suite include capabilities for working with data from a wide variety of environments and types including Hadoop, cloud data sources, RDBMS, files, unstructured data, streaming, and others, and the ability to perform ETL and ELT transformations in diverse run-time environments including SAS®, database systems, Hadoop, Spark, SAS® Analytics, cloud, and data virtualization environments. There are also new capabilities for lineage, impact analysis, clustering, and other data governance features for enhancements to master data and support metadata management. This paper provides an overview of the latest features of the SAS® Data Management product suite and includes use cases and examples for leveraging product capabilities.
Read the paper (PDF)
Nancy Rausch, SAS
Session SAS0567-2017:
Wrangling Your Data into Shape for In-Memory Analytics
High-quality analytics works best with the best-quality data. Preparing your data ranges from activities like text manipulation and filtering to creating calculated items and blending data from multiple tables. This paper covers the range of activities you can easily perform to get your data ready. High-performance analytics works best with in-memory data. Getting your data into an in-memory server, as well as keeping it fresh and secure, are considerations for in-memory data management. This paper covers how to make small or large data available and how to manage it for analytics. You can choose to perform these activities in a graphical user interface or via batch scripts. This paper describes both ways to perform these activities. You ll be well-prepared to get your data wrangled into shape for analytics!
Read the paper (PDF)
Gary Mehler, SAS
Z
Session 0935-2017:
Zeroing In on Effective Member Communication: An Rx Education Study
In 2013, the Centers for Medicare & Medicaid Services (CMS) changed the pharmacy mail-order member-acquisition process so that Humana Pharmacy may only call a member with cost savings greater than $2.00 to educate the member on the potential savings and instruct the member to call back. The Rx Education call center asked for analytics work to help prioritize member outreach, improve conversions, and decrease the number of members who are unable to be contacted. After a year of contacting members using this additional insight, the conversions after agreement rate rose from 71.5% to 77.5% and the unable to contact rate fell from 30.7% to 17.4%. This case study takes you on an analytics journey from the initial problem diagnosis and analytics solution, followed by refinements, as well as test and learn campaigns.
Read the paper (PDF)
Brian Mitchell, Humana Inc.
back to top