Although rapid development of information technologies in the past decade has provided forecasters with both huge amounts of data and massive computing capabilities, these advancements do not necessarily translate into better forecasts. Different industries and products have unique demand patterns. There is not yet a one-size-fits-all forecasting model or technique. A good forecasting model must be tailored to the data in order to capture the salient features and satisfy the business needs. This paper presents a multistage modeling strategy for demand forecasting. The proposed strategy, which has no restrictions on the forecasting techniques or models, provides a general framework for building a forecasting system in multiple stages.
Pu Wang, SAS
This work presents a macro for generating a ternary graph that can be used to solve a problem in the ceramics industry. The ceramics industry uses mixtures of clay types to generate a product with good properties that can be assessed according to breakdown pressure after incineration, the porosity, and water absorption. Beyond these properties, the industry is concerned with managing geological reserves of each type of clay. Thus, it is important to seek alternative compositions that present properties similar to the default mixture. This can be done by analyzing the surface response of these properties according to the clay composition, which is easily done in a ternary graph. SAS® documentation does not describe how to adjust an analysis grid in nonrectangular forms on graphs. A macro triaxial, however, can generate ternary graphical analysis by creating a special grid, using an annotated database, and making linear transformations of the original data. The GCONTOUR procedure is used to generate three-dimensional analysis.
IGOR NASCIMENTO, UNB
Zacarias Linhares, IFPI
ROBERTO SOARES, IFPI
This paper presents supervised and unsupervised pattern recognition techniques that use Base SAS® and SAS® Enterprise Miner™ software. A simple preprocessing technique creates many small image patches from larger images. These patches encourage the learned patterns to have local scale, which follows well-known statistical properties of natural images. In addition, these patches reduce the number of features that are required to represent an image and can decrease the training time that algorithms need in order to learn from the images. If a training label is available, a classifier is trained to identify patches of interest. In the unsupervised case, a stacked autoencoder network is used to generate a dictionary of representative patches, which can be used to locate areas of interest in new images. This technique can be applied to pattern recognition problems in general, and this paper presents examples from the oil and gas industry and from a solar power forecasting application.
Patrick Hall, SAS
Alex Chien, SAS Institute
ILKNUR KABUL, SAS Institute
JORGE SILVA, SAS Institute
This paper demonstrates techniques using SAS® software to combine data from devices and sensors for analytics. Base SAS®, SAS® Data Integration Studio, and SAS® Visual Analytics are used to query external services, import, fuzzy match, analyze, and visualize data. Finally, the benefits of SAS® Event Stream Processing models and alerts is discussed. To bring the analytics of things to life, the following data are collected: GPS data from countryside walks, GPS and score card data from smart phones while playing golf, and meteorological data feeds. These data are combined to test the old adage that golf is a good walk spoiled. Further, the use of alerts and potential for predictive analytics is discussed.
David Shannon, Amadeus Software
Massive Open Online Courses (MOOC) have attracted increasing attention in educational data mining research areas. MOOC platforms provide free higher education courses to Internet users worldwide. However, MOOCs have high enrollment but notoriously low completion rates. The goal of this study is apply frequentist and Bayesian logistic regression to investigate whether and how students' engagement, intentions, education levels, and other demographics are conducive to course completion in MOOC platforms. The original data used in this study came from an online eight-week course titled Big Data in Education, taught within the Coursera platform (MOOC) by Teachers College, Columbia University. The data sets for analysis were created from three different sources--clickstream data, a pre-course survey, and homework assignment files. The SAS system provides multiple procedures to perform logistic regression, with each procedure having different features and functions. In this study, we apply two approaches--frequentist and Bayesian logistic regression, to MOOC data. PROC LOGISTIC is used for the frequentist approach, and PROC GENMOD is used for Bayesian analysis. The results obtained from the two approaches are compared. All the statistical analyses are conducted in SAS® 9.3. Our preliminary results show that MOOC students with higher course engagement and higher motivation are more likely to complete the MOOC course.
Yan Zhang, Educational Testing Service
Ryan Baker, Columbia University
Yoav Bergner, Educational Testing Service
Building representative machine learning models that generalize well on future data requires careful consideration both of the data at hand and of assumptions about the various available training algorithms. Data are rarely in an ideal form that enables algorithms to train effectively. Some algorithms are designed to account for important considerations such as variable selection and handling of missing values, whereas other algorithms require additional preprocessing of the data or appropriate tweaking of the algorithm options. Ultimate evaluation of a model's quality requires appropriate selection and interpretation of an assessment criterion that is meaningful for the given problem. This paper discusses the most common issues faced by machine learning practitioners and provides guidance for using these powerful algorithms to build effective models.
Brett Wujek, SAS
Funda Gunes, SAS Institute
Patrick Hall, SAS
Higher education is taking on big data initiatives to help improve student outcomes and operational efficiencies. This paper shows how one institution went form White Paper to action. Challenges such as developing a common definition of big data, access to certain types of data, and bringing together institutional groups who had little previous contact are highlighted. The stepwise, collaborative approach taken is highlighted. How the data was eventually brought together and analyzed to generate actionable results is also covered. Learn how to make big data a part of your analytics toolbox.
Stephanie Thompson, Datamum
The electrical grid has become more complex; utilities are revisiting their approaches, methods, and technology to accurately predict energy demands across all time horizons in a timely manner. With the advanced analytics of SAS® Energy Forecasting, Old Dominion Electric Cooperative (ODEC) provides data-driven load predictions from next hour to next year and beyond. Accurate intraday forecasts means meeting daily peak demands saving millions of dollars at critical seasons and events. Mid-term forecasts provide a baseline to the cooperative and its members to accurately anticipate regional growth and customer needs, in addition to signaling power marketers where, when, and how much to hedge future energy purchases to meet weather-driven demands. Long-term forecasts create defensible numbers for large capital expenditures such as generation and transmission projects. Much of the data for determining load comes from disparate systems such as supervisory control and data acquisition (SCADA) and internal billing systems combined with external market data (PJM Energy Market), weather, and economic data. This data needs to be analyzed, validated, and shaped to fully leverage predictive methods. Business insights and planning metrics are achieved when flexible data integration capabilities are combined with advanced analytics and visualization. These increased computing demands at ODEC are being achieved by leveraging Amazon Web Services (AWS) for expanded business discovery and operational capacity. Flexible and scalable data and discovery environments allow ODEC analysts to efficiently develop and test models that are I/O intensive. SAS® visualization for the analyst is a graphic compute environment for information-sharing that is memory intensive. Also, ODEC IT operations require deployment options tuned for process optimization to meet service level agreements that can be quickly evaluated, tested, and promoted into production. What was once very difficult for most ut
ilities to embrace is now achievable with new approaches, methods, and technology like never before.
David Hamilton, ODEC
Steve Becker, SAS
Emily Forney, SAS
The Health Indicators Warehouse (HIW) is part of the US Department of Health and Human Services' (DHHS) response to make federal data more accessible. Through it, users can access data and metadata for over 1,200 indicators from approximately 180 federal and nonfederal sources. The HIW also supports data access by applications such as SAS® Visual Analytics through the use of an application programming interface (API). An API serves as a communication interface for integration. As a result of the API, HIW data consumers using SAS Visual Analytics can avoid difficult manual data processing. This paper provides detailed information about how to access HIW data with SAS Visual Analytics in order to produce easily understood visualizations with minimal effort through a methodology that automates HIW data processing. This paper also shows how to run SAS® macros inside a stored process to make HIW data available in SAS Visual Analytics for exploration and reporting via API calls; the SAS macros are provided. Use cases involving dashboards are also examined in order to demonstrate the value of streaming data directly from the HIW. Both IT professionals and population health analysts will benefit from understanding how to import HIW data into SAS Visual Analytics using SAS® Stored Processes, macros, and APIs. This can be very helpful to organizations that want to lower maintenance costs associated with data management while gaining insights into health data with visualizations. This paper provides a starting point for any organization interested in deriving full value from SAS Visual Analytics while augmenting their work with HIW data.
Li Hui Chen, US Consumer Product Safety Commission
Manuel Figallo, SAS
Nowadays, the recommender system is a popular tool for online retailer businesses to predict customers' next-product-to-buy (NPTB). Based on statistical techniques and the information collected by the retailer, an efficient recommender system can suggest a meaningful NPTB to customers. A useful suggestion can reduce the customer's searching time for a wanted product and improve the buying experience, thus increasing the chance of cross-selling for online retailers and helping them build customer loyalty. Within a recommender system, the combination of advanced statistical techniques with available information (such as customer profiles, product attributes, and popular products) is the key element in using the retailer's database to produce a useful suggestion of an NPTB for customers. This paper illustrates how to create a recommender system with the SAS® RECOMMEND procedure for online business. Using the recommender system, we can produce predictions, compare the performance of different predictive models (such as decision trees or multinomial discrete-choice models), and make business-oriented recommendations from the analysis.
Miao Nie, ABN AMRO Bank
Shanshan Cong, SAS Institute
Fire department facilities in Belgium are in desperate need of renovation due to relentless budget cuts during the last decade. But time has changed and the current location is troubled by city center traffic jams or narrow streets. In other words, their current location is not ideal. The approach for our project was to find the best possible locations for new facilities, based on historical interventions and live traffic information for fire trucks. The goal was not to give just one answer, but to provide instead SAS® Visual Analytics dashboards that allowed policy makers to play with some of the deciding factors and view the impact of cutting off or adding facilities on the estimated intervention time. For each option the optimal locations are given. SAS® software offers many great products but the true power of SAS as a data mining tool is in using the full SAS suite. SAS® Data Integration Studio was used for its parallel processing capabilities in extracting and parsing the traveling time via an online API to MapQuest. SAS/OR® was used for the optimization of the model using a Lagrangian multiplier. SAS® Enterprise Guide® was used to combine and integrate the data preparation. SAS Visual Analytics was used to visualize the results and to serve as an interface to execute the model via a stored process.
Wouter Travers, Deloitte
Customer lifetime value (LTV) estimation involves two parts: the survival probabilities and profit margins. This article describes the estimation of those probabilities using discrete-time logistic hazard models. The estimation of profit margins is based on linear regression. In the scenario, when outliers are present among margins, we suggest applying robust regression with PROC ROBUSTREG.
Vadim Pliner, Verizon Wireless
Ensemble models are a popular class of methods for combining the posterior probabilities of two or more predictive models in order to create a potentially more accurate model. This paper summarizes the theoretical background of recent ensemble techniques and presents examples of real-world applications. Examples of these novel ensemble techniques include weighted combinations (such as stacking or blending) of predicted probabilities in addition to averaging or voting approaches that combine the posterior probabilities by adding one model at a time. Fit statistics across several data sets are compared to highlight the advantages and disadvantages of each method, and process flow diagrams that can be used as ensemble templates for SAS® Enterprise Miner™ are presented.
Wendy Czika, SAS
Ye Liu, SAS Institute
SAS® In-Memory Statistics uses a powerful interactive programming interface for analytics, aimed squarely at the data scientist. We show how the custom tasks that you can create in SAS® Studio (a web-based programming interface) can make everyone a data scientist! We explain the Common Task Model of SAS Studio, and we build a simple task in steps that carries out the basic functionality of the IMSTAT procedure. This task can then be shared amongst all users, empowering everyone on their journey to becoming a data scientist. During the presentation, it will become clear that not only can shareable tasks be created but the developer does not have to understand coding in Java, JavaScript, or ActionScript. We also use the task we created in the Visual Programming perspective in SAS Studio.
Stephen Ludlow, SAS
We introduce age-period-cohort (APC) models, which analyze data in which performance is measured by age of an account, account open date, and performance date. We demonstrate this flexible technique with an example from a recent study that seeks to explain the root causes of the US mortgage crisis. In addition, we show how APC models can predict website usage, retail store sales, salesperson performance, and employee attrition. We even present an example in which APC was applied to a database of tree rings to reveal climate variation in the southwestern United States.
Joseph Breeden, Prescient Models
As more of your work is performed in an off-premises cloud environment, understanding how to get the data you want to analyze and report on becomes important. In addition, working in a cloud environment like Amazon Web Services might be something that is new to you or to your organization when you use a product like SAS® Visual Analytics. This presentation discusses how to get the best use out of cloud resources, how to efficiently transport information between systems, and how to continue to leverage data in an on-premises database management system (DBMS) in your future cloud system.
Gary Mehler, SAS
Since its inception, SAS® Text Miner has used the singular value decomposition (SVD) to convert a term-document matrix to a representation that is crucial for building successful supervised and unsupervised models. In this presentation, using SAS® code and SAS Text Miner, we compare these models with those that are based on SVD representations of subcomponents of documents. These more granular SVD representations are also used to provide further insights into the collection. Examples featuring visualizations, discovery of term collocations, and near-duplicate subdocument detection are shown.
Russell Albright, SAS
James Cox, SAS Institute
Ning Jin, SAS Institute
The SAS® Scalable Performance Data Server and SAS® Scalable Performance Data Engine are data formats from SAS® that support the creation of analytical base tables with tens of thousands of columns. These analytical base tables are used to support daily predictive analytical routines. Traditionally Storage Area Network (SAN) storage has been and continues to be the primary storage platform for the SAS Scalable Performance Data Server and SAS Scalable Performance Data Engine formats. Due to cost constraints associated with SAN storage, companies have added Hadoop to their environments to help minimize storage costs. In this paper we explore how the SAS Scalable Performance Data Server and SAS Scalable Performance Data Engine leverage the Hadoop Distributed File System.
Steven Sober, SAS
Working with big data is often time consuming and challenging. The primary goal in programming is to maximize throughputs while minimizing the use of computer processing time, real time, and programmers' time. By using the Multiprocessing (MP) CONNECT method on a symmetric multiprocessing (SMP) computer, a programmer can divide a job into independent tasks and execute the tasks as threads in parallel on several processors. This paper demonstrates the development and application of a parallel processing program on a large amount of health-care data.
Shuhua Liang, Kaiser Permanente
While there is no shortage of MS and MA programs in data science, PhD programs have been slow to develop. Is there a role for a PhD in data science? How would this differ from an advanced degree in statistics? Computer science? This session explores the issues related to the curriculum content for a PhD in data science and what those students would be expected to do after graduation. Come join the conversation.
Jennifer Priestley, Kennesaw State University
The current study looks at several ways to investigate latent variables in longitudinal surveys and their use in regression models. Three different analyses for latent variable discovery are briefly reviewed and explored. The latent analysis procedures explored in this paper are PROC LCA, PROC LTA, PROC TRAJ, and PROC CALIS. The latent variables are then included in separate regression models. The effect of the latent variables on the fit and use of the regression model compared to a similar model using observed data is briefly reviewed. The data used for this study was obtained via the National Longitudinal Study of Adolescent Health, a study distributed and collected by Add Health. Data was analyzed using SAS® 9.4. This paper is intended for any level of SAS® user. This paper is also written to an audience with a background in behavioral science and/or statistics.
Deanna Schreiber-Gregory, National University
Logic model produced propensity scores have been intensively used to assist direct marketing name selections. As a result, only customers with an absolute higher likelihood to respond are mailed offers in order to achieve cost reduction. Thus, event ROI is increased. There is a fly in the ointment, however. Compared to the model building performance time window, usually 6 months to 12 months, a marketing event time period is usually much shorter. As such, this approach lacks of the ability to deselect those who have a high propensity score but are unlikely to respond to an upcoming campaign. To consider dynamically building a complete propensity model for every upcoming camping is nearly impossible. But, incorporating time to respond has been of great interest to marketers to add another dimension for response prediction enhancement. Hence, this paper presents an inventive modeling technique combining logistic regression and the Cox Proportional Hazards Model. The objective of the fusion approach is to allow a customer's shorter next to repurchase time to compensate for his or her insignificant lower propensity score in winning selection opportunities. The method is accomplished using PROC LOGISTIC, PROC LIFETEST, PROC LIFEREF, and PROC PHREG on the fusion model that is building in a SAS® environment. This paper also touches on how to use the results to predict repurchase response by demonstrating a case of repurchase time-shift prediction on the 12-month inactive customers of a big box store retailer. The paper also shares a results comparison between the fusion approach and logit alone. Comprehensive SAS macros are provided in the appendix.
Hsin-Yi Wang, Alliance Data Systems
In recent years, big data has been in the limelight as a solution for business issues. Implementation of big data mining has begun in a variety of industries. The variety of data types and the velocity of increasing data have been astonishing, represented as structured data stored in a relational database or unstructured data (for example, text data, GPS data, image data, and so on). In the pharmaceutical industry, big data means real-world data such as Electronic Health Record, genomics data, medical imaging data, social network data, and so on. Handling these types of big data often requires the special environment infrastructure for statistical computing. Our presentation covers case study 1: IMSTAT implementation as a large-scale parallel computation environment; conversion from business issue to data science issue in pharma; case study 2: data handling and machine learning for vertical and horizontal big data by using PROC IMSTAT; the importance of the analysis result integration; and caution points of big data mining.
Yoshitake Kitanishi, Shionogi & Co., Ltd.
Ryo Kiguchi, Shionogi & Co., Ltd.
Akio Tsuji, Shionogi & Co., Ltd.
Hideaki Watanabe, Shionogi & Co., Ltd.
Fast detection of forest fires is a great concern among environmental experts and national park managers because forest fires create economic and ecological damage and endanger human lives. For effective fire control and resource preparation, it is necessary to predict fire occurrences in advance and estimate the possible losses caused by fires. For this purpose, using real-time sensor data of weather conditions and fire occurrence are highly recommended in order to support the predicting mechanism. The objective of this presentation is to use SAS® 9.4 and SAS® Enterprise Miner™ 14.1 to predict the probability of fires and to figure out special weather conditions resulting in incremental burned areas in Montesinho Park forest (Portugal). The data set was obtained from the Center for Machine Learning and Intelligent Systems at the University of California, Irvine and contains 517 observations and 13 variables from January 2000 to December 2003. Support Vector Machine analyses with variable selection were performed on this data set for fire occurrence prediction with a validation accuracy of approximately 60%. The study also incorporates Incremental Response technique and hypothesis testing to estimate the increased probability of fire, as well as the extra burned area under various conditions. For example, when there is no rain, a 27% higher chance of fires and 4.8 hectares of extra burned area are recorded, compared to when there is rain.
Quyen Nguyen, Oklahoma State University
In recent years, many companies are trying to understand the rare events that are very critical in the current business environment. But a data set with rare events is always imbalanced and the models developed using this data set cannot predict the rare events precisely. Therefore, to overcome this issue, a data set needs to be sampled using specialized sampling techniques like over-sampling, under-sampling, or the synthetic minority over-sampling technique (SMOTE). The over-sampling technique deals with randomly duplicating minority class observations, but this technique might bias the results. The under-sampling technique deals with randomly deleting majority class observations, but this technique might lose information. SMOTE sampling deals with creating new synthetic minority observations instead of duplicating minority class observations or deleting the majority class observations. Therefore, this technique can overcome the problems, like biased results and lost information, found in other sampling techniques. In our research, we used an imbalanced data set containing results from a thyroid test with 3,163 observations, out of which only 4.7 percent of the observations had positive test results. Using SAS® procedures like PROC SURVERYSELECT and PROC MODECLUS, we created over-sampled, under-sampled, and the SMOTE sampled data set in SAS® Enterprise Guide®. Then we built decision tree, gradient boosting, and rule induction models using four different data sets (non-sampled, majority under-sampled, minority over-sampled with majority under-sampled, and minority SMOTE sampled with majority under-sampled) in SAS® Enterprise Miner™. Finally, based on the receiver operating characteristic (ROC) index, Kolmogorov-Smirnov statistics, and the misclassification rate, we found that the models built using minority SMOTE sampled with the majority under-sampled data yields better output for this data set.
Rhupesh Damodaran Ganesh Kumar, Oklahoma State University (SAS and OSU data mining Certificate)
Kiren Raj Mohan Jagan Mohan, Zions Bancorporation
A bubble map is a useful tool for identifying trends and visualizing the geographic proximity and intensity of events. This session describes how to use readily available map data sets in SAS® along with PROC GEOCODE and PROC GMAP to turn a data set of addresses and events into a bubble map of the United States with scaled bubbles depicting the location and intensity of events.
Caroline Walker, Warren Rogers Associates
JavaScript Object Notation (JSON) has quickly become the de-facto standard for data transfer on the Internet, due to an increase in both web data and the usage of full-stack JavaScript. JSON has become dominant in the emerging technologies of the web today, such as the Internet of Things and the mobile cloud. JSON offers a light and flexible format for data transfer, and can be processed directly from JavaScript without the need for an external parser. The SAS® JSON procedure lacks the ability to read in JSON. However, the addition of the GROOVY procedure in SAS® 9.3 allows for execution of Java code from within SAS, allowing for JSON data to be read into a SAS data set through XML conversion. This paper demonstrates the method for parsing JSON into data sets with Groovy and the XML LIBNAME Engine, all within Base SAS®.
John Kennedy, Mesa Digital
Value-based reimbursement is the emerging strategy in the US healthcare system. The premise of value-based care is simple in concept--high quality and low cost provides the greatest value to patients and the various parties that fund their coverage. The basic equation for value is equally simple to compute: value=quality/cost. However, there are significant challenges to measuring it accurately. Error or bias in measuring value could result in the failure of this strategy to ultimately improve the healthcare system. This session discusses various methods and issues with risk adjustment in a value-based reimbursement model. Risk adjustment is an essential tool for ensuring that fair comparisons are made when deciding what health services and health providers have high value. The goal this presentation is to give analysts an overview of risk adjustment and to provide guidance for when, why, and how to use risk adjustment when quantifying performance of health services and healthcare providers on both cost and quality. Statistical modeling approaches are reviewed and practical issues with developing and implementing the models are discussed. Real-world examples are also provided.
Daryl Wansink, Conifer Value Based Care
Create a model with SAS Contextual Analysis, deploy the model to your Hadoop cluster, run the model using map/reduce driven by SAS Code Accelerator not alongside, but in Hadoop. Abstract: Hadoop took the world by storm as a cost efficient software framework for storing data. Hadoop is more than that, it's an analytics platform. Analytics is more than just crunching numbers, it's also about gaining knowledge from your organization's unstructured data. SAS has the tools to do both. In this paper, we'll demonstrate how to access your unstructured data, automatically create a Text Analytics model, deploy that model in Hadoop and run SAS's Code Accelerator to give you insights into your big and unstructured data lying dormant in Hadoop. We'll show you how the insights gained can help you make better informed decisions for your organization.
David Bultman, SAS
Adam Pilz, SAS
This presentation describes the new SAS® Customer Intelligence 360 solution for multi-armed bandit analysis of controlled experiments. Multi-armed bandit analysis has been in existence since the 1950s and is sometimes referred to as the K- or N-armed bandit problem. In this problem, a gambler at a row of slot machines (sometimes known as 'one-armed bandits') has to decide which machines to play, as well as the frequency and order in which to play each machine with the objective of maximizing returns. In practice, multi-armed bandits have been used to model the problem of managing research projects in a large organization. However, the same technology can be applied within the marketing space where, for example, variants of campaign creatives or variants of customer journey sub-paths are compared to better understand customer behavior by collecting their respective response rates. Distribution weights of the variants are adjusted as time passes and conversions are observed to reflect what customers are responding to, thus optimizing the performance of the campaign. The SAS Customer Intelligence 360 analytic engine collects data at regularly scheduled time intervals and processes it through a simulation engine to return an updated weighting schema. The process is automated in the digital space and led through the solution's content serving and execution engine for interactive communication channels. Both marketing gains and costs are studied during the process to illustrate the business value of multi-arm bandit analysis. SAS® has coupled results from traditional A/B testing to feed the multi-armed bandit engine, which are presented.
Thomas Lehman, SAS
In today's Business Intelligence world, self-service, which allows an everyday knowledge worker to explore data and personalize business reports without being tech-savvy, is a prerequisite. The new release of SAS® Visual Statistics introduces an HTML5-based, easy-to-use user interface that combines statistical modeling, business reporting, and mobile sharing into a one-stop self-service shop. The backbone analytic server of SAS Visual Statistics is also updated, allowing an end user to analyze data of various sizes in the cloud. The paper illustrates this new self-service modeling experience in SAS Visual Statistics using telecom churn data, including the steps of identifying distinct user subgroups using decision tree, building and tuning regression models, designing business reports for customer churn, and sharing the final modeling outcome on a mobile device.
Xiangxiang Meng, SAS
Don Chapman, SAS
Cheryl LeSaint, SAS
The third maintenance release of SAS® 9.4 was a huge release with respect to the interoperability between SAS® and Hadoop, the industry standard for big data. This talk brings you up-to-date with where we are: more distributions, more data types, more options. Come and learn about the exciting new developments for blending your SAS processing with your shared Hadoop cluster. Grid processing. Check. SAS data sets on HDFS. Check.
Paul Kent, SAS
Similarity analysis is used to classify an entire time series by type. In this method, a distance measure is calculated to reflect the difference in form between two ordered sequences, such as are found in time series data. The mutual differences between many sequences or time series can be compiled to a similarity matrix, similar in form to a correlation matrix. When cluster methods are applied to a similarity matrix, time series with similar patterns are grouped together and placed into clusters distinct from others that contain times series with different patterns. In this example, similarity analysis is used to classify supernovae by the way the amount of light they produce changes over time.
David Corliss, Ford Motor Company
Big data is often distinguished as encompassing high volume, high velocity, or high variability of data. While big data can signal big business intelligence and big business value, it can also wreak havoc on systems and software ill-prepared for its profundity. Scalability describes the ability of a system or software to adequately meet the needs of additional users or its ability to use additional processors or resources to fulfill those added requirements. Scalability also describes the adequate and efficient response of a system to increased data throughput. Because sorting data is one of the most common and most resource-intensive operations in any software language, inefficiencies or failures caused by big data often are first observed during sorting routines. Much SAS® literature has been dedicated to optimizing big data sorts for efficiency, including minimizing execution time and, to a lesser extent, minimizing resource usage (that is, memory and storage consumption). However, less attention has been paid to implementing big data sorting that is reliable and robust even when confronted with resource limitations. To that end, this text introduces the SAFESORT macro, which facilitates a priori exception-handling routines (which detect environmental and data set attributes that could cause process failure) and post hoc exception-handling routines (which detect actual failed sorting routines). If exception handling is triggered, SAFESORT automatically reroutes program flow from the default sort routine to a less resource-intensive routine, thus sacrificing execution speed for reliability. Moreover, macro modularity enables developers to select their favorite sort procedure and, for data-driven disciples, to build fuzzy logic routines that dynamically select a sort algorithm based on environmental and data set attributes.
Troy Hughes, Datmesis Analytics
This paper discusses a set of practical recommendations for optimizing the performance and scalability of your Hadoop system using SAS®. Topics include recommendations gleaned from actual deployments from a variety of implementations and distributions. Techniques cover tips for improving performance and working with complex Hadoop technologies such as Kerberos, techniques for improving efficiency when working with data, methods to better leverage the SAS in Hadoop components, and other recommendations. With this information, you can unlock the power of SAS in your Hadoop system.
Nancy Rausch, SAS
Wilbram Hazejager, SAS
Machine learning is gaining popularity, fueled by the rapid advancement of computing technology. Machine learning offers the ability to continuously adjust predictive models based on real-time data, and to visualize changes and take action on the new information. Hear from two PhD's from SAS and Intel about how SAS® machine learning, working with SAS streaming analytics and Intel microprocessor technology, makes this possible.
SAS® High-Performance Risk distributes financial risk data and big data portfolios with complex analyses across a networked Hadoop Distributed File System (HDFS) grid to support rapid in-memory queries for hundreds of simultaneous users. This data is extremely complex and must be stored in a proprietary format to guarantee data affinity for rapid access. However, customers still desire the ability to view and process this data directly. This paper demonstrates how to use the HPRISK custom file reader to directly access risk data in Hadoop MapReduce jobs, using the HPDS2 procedure and the LASR procedure.
Mike Whitcher, SAS
Stacey Christian, SAS
Phil Hanna, SAS Institute
Don McAlister, SAS
Over the years, complex medical data has been captured and retained in a variety of legacy platforms that are characterized by special formats, hierarchical/network relationships, and relational databases. Due to the complex nature of the data structures and the capacity constraints associated with outdated technologies, high-value healthcare data locked in legacy systems has been restricted in its use for analytics. With the emergence of highly scalable, big data technologies such as Hadoop, now it's possible to move, transform, and enable previously locked legacy data economically. In addition, sourcing such data once on an enterprise data hub enables the ability to efficiently store, process, and analyze the data from multiple perspectives. This presentation illustrates how legacy data is not only unlocked but enabled for rapid visual analytics by using Cloudera's enterprise data hub for data transformation and the SAS® ecosystem.
This paper proposes a technique to implement wavelet analysis (WA) for improving a forecasting accuracy of the autoregressive integrated moving average model (ARIMA) in nonlinear time-series. With the assumption of the linear correlation, and conventional seasonality adjustment methods used in ARIMA (that is, differencing, X11, and X12), the model might fail to capture any nonlinear pattern. Rather than directly model such a signal, we decompose it to less complex components such as trend, seasonality, process variations, and noises, using WA. Then, we use them as exogenous variables in the autoregressive integrated moving average with explanatory variable model (ARIMAX). We describe a background of WA. Then, the code and a detailed explanation of WA based on multi-resolution analysis (MRA) in SAS/IML® software are demonstrated. The idea and mathematical basis of ARIMA and ARIMAX are also given. Next, we demonstrate our technique in forecasting applications using SAS® Forecast Studio. The demonstrated time-series are nonlinear in nature from different fields. The results suggest that WA effects are good regressors in ARIMAX, which captures nonlinear patterns well.
Woranat Wongdhamma, Oklahoma State University
The HPSPLIT procedure, a powerful new addition to SAS/STAT®, fits classification and regression trees and uses the fitted trees for interpretation of data structures and for prediction. This presentation shows the use of PROC HPSPLIT to analyze a number of data sets from different subject areas, with a focus on prediction among subjects and across landscapes. A critical component of these analyses is the use of graphical tools to select a model, interpret the fitted trees, and evaluate the accuracy of the trees. The use of different kinds of cross validation for model selection and model assessment is also discussed.
An important strength of observational studies is the ability to estimate a key behavior's or treatment's effect on a specific health outcome. This is a crucial strength as most health outcomes research studies are unable to use experimental designs due to ethical and other constraints. Keeping this in mind, one drawback of observational studies (that experimental studies naturally control for) is that they lack the ability to randomize their participants into treatment groups. This can result in the unwanted inclusion of a selection bias. One way to adjust for a selection bias is through the use of a propensity score analysis. In this study, we provide an example of how to use these types of analyses. Our concern is whether recent substance abuse has an effect on an adolescent's identification of suicidal thoughts. In order to conduct this analysis, a selection bias was identified and adjustment was sought through three common forms of propensity scoring: stratification, matching, and regression adjustment. Each form is separately conducted, reviewed, and assessed as to its effectiveness in improving the model. Data for this study was gathered through the Youth Risk Behavior Surveillance System, an ongoing nation-wide project of the Centers for Disease Control and Prevention. This presentation is designed for any level of statistician, SAS® programmer, or data analyst with an interest in controlling for selection bias, as well as for anyone who has an interest in the effects of substance abuse on mental illness.
Deanna Schreiber-Gregory, National University
A number of SAS® tools can be used to report data, such as the PRINT, MEANS, TABULATE, and REPORT procedures. The REPORT procedure is a single tool that can produce many of the same results as other SAS tools. Not only can it create detailed reports like PROC PRINT can, but it can summarize and calculate data like the MEANS and TABULATE procedures do. Unfortunately, despite its power, PROC REPORT seems to be used less often than the other tools, possibly due to its seemingly complex coding. This paper uses PROC REPORT and the Output Delivery System (ODS) to export a big data set into a customized XML file that a user who is not familiar with SAS can easily read. Several options for the COLUMN, DEFINE, and COMPUTE statements are shown that enable you to present your data in a more colorful way. We show how to control the format of the selected columns and rows, make column headings more meaningful, and how to color selected cells differently to bring attention to the most important data.
Guihong Chen, TCF Bank