SAS Global Forum 2016 Proceedings

Although rapid development of information technologies in the past decade has provided forecasters with both huge amounts of data and massive computing capabilities, these advancements do not necessarily translate into better forecasts. Different industries and products have unique demand patterns. There is not yet a one-size-fits-all forecasting model or technique. A good forecasting model must be tailored to the data in order to capture the salient features and satisfy the business needs. This paper presents a multistage modeling strategy for demand forecasting. The proposed strategy, which has no restrictions on the forecasting techniques or models, provides a general framework for building a forecasting system in multiple stages.

Read the paper (PDF)

Although the statistical foundations of predictive analytics have large overlaps, the business objectives, data availability, and regulations are different across the property and casualty insurance, life insurance, banking, pharmaceutical, and genetics industries. A common process in property and casualty insurance companies with large data sets is introduced, including data acquisition, data preparation, variable creation, variable selection, model building (also known as fitting), model validation, and model testing. Variable selection and model validation stages are described in more detail. Some successful models in the insurance companies are introduced. Base SAS^®, SAS^® Enterprise Guide^®, and SAS^® Enterprise Miner™ are presented as the main tools for this process.

Read the paper (PDF)

Geographically Weighted Negative Binomial Regression (GWNBR) was developed by Silva and Rodrigues (2014). It is a generalization of the Geographically Weighted Poisson Regression (GWPR) proposed by Nakaya and others (2005) and of the Poisson and negative binomial regressions. This paper shows a SAS^® macro to estimate the GWNBR model encoded in SAS/IML^® software and shows how to use the SAS procedure GMAP to draw the maps.

Read the paper (PDF) | Download the data file (ZIP)

Using smart clothing with wearable medical sensors integrated to keep track of human health is now attracting many researchers. However, body movement caused by daily human activities inserts artificial noise into physiological data signals, which affects the output of a health monitoring/alert system. To overcome this problem, recognizing human activities, determining relationship between activities and physiological signals, and removing noise from the collected signals are essential steps. This paper focuses on the first step, which is human activity recognition. Our research shows that no other study used SAS^® for classifying human activities. For this study, two data sets were collected from an open repository. Both data sets have 561 input variables and one nominal target variable with four levels. Principal component analysis along with other variable reduction and selection techniques were applied to reduce dimensionality in the input space. Several modeling techniques with different optimization parameters were used to classify human activity. The gradient boosting model was selected as the best model based on a test misclassification rate of 0.1233. That is, 87.67% of total events were classified correctly.

Read the paper (PDF) | View the e-poster or slides (PDF)

Massive Open Online Courses (MOOC) have attracted increasing attention in educational data mining research areas. MOOC platforms provide free higher education courses to Internet users worldwide. However, MOOCs have high enrollment but notoriously low completion rates. The goal of this study is apply frequentist and Bayesian logistic regression to investigate whether and how students' engagement, intentions, education levels, and other demographics are conducive to course completion in MOOC platforms. The original data used in this study came from an online eight-week course titled Big Data in Education, taught within the Coursera platform (MOOC) by Teachers College, Columbia University. The data sets for analysis were created from three different sources--clickstream data, a pre-course survey, and homework assignment files. The SAS system provides multiple procedures to perform logistic regression, with each procedure having different features and functions. In this study, we apply two approaches--frequentist and Bayesian logistic regression, to MOOC data. PROC LOGISTIC is used for the frequentist approach, and PROC GENMOD is used for Bayesian analysis. The results obtained from the two approaches are compared. All the statistical analyses are conducted in SAS^® 9.3. Our preliminary results show that MOOC students with higher course engagement and higher motivation are more likely to complete the MOOC course.

Read the paper (PDF) | View the e-poster or slides (PDF)

Building representative machine learning models that generalize well on future data requires careful consideration both of the data at hand and of assumptions about the various available training algorithms. Data are rarely in an ideal form that enables algorithms to train effectively. Some algorithms are designed to account for important considerations such as variable selection and handling of missing values, whereas other algorithms require additional preprocessing of the data or appropriate tweaking of the algorithm options. Ultimate evaluation of a model's quality requires appropriate selection and interpretation of an assessment criterion that is meaningful for the given problem. This paper discusses the most common issues faced by machine learning practitioners and provides guidance for using these powerful algorithms to build effective models.

Read the paper (PDF)

The surge of data and data sources in marketing has created an analytical bottleneck in most organizations. Analytics departments have been pushed into a difficult decision: either purchase black-box analytical tools to generate efficiencies or hire more analysts, modelers, and data scientists. Knowledge gaps stemming from restrictions in black-box tools or from backlogs in the work of analytical teams have resulted in lost business opportunities. Existing big data analytics tools respond well when dealing with large record counts and small variable counts, but they fall short in bringing efficiencies when dealing with wide data. This paper discusses the importance of an agile modeling engine designed to deliver productivity, irrespective of the size of the data or the complexity of the modeling approach.

Read the paper (PDF) | Watch the recording

Nowadays, the recommender system is a popular tool for online retailer businesses to predict customers' next-product-to-buy (NPTB). Based on statistical techniques and the information collected by the retailer, an efficient recommender system can suggest a meaningful NPTB to customers. A useful suggestion can reduce the customer's searching time for a wanted product and improve the buying experience, thus increasing the chance of cross-selling for online retailers and helping them build customer loyalty. Within a recommender system, the combination of advanced statistical techniques with available information (such as customer profiles, product attributes, and popular products) is the key element in using the retailer's database to produce a useful suggestion of an NPTB for customers. This paper illustrates how to create a recommender system with the SAS^® RECOMMEND procedure for online business. Using the recommender system, we can produce predictions, compare the performance of different predictive models (such as decision trees or multinomial discrete-choice models), and make business-oriented recommendations from the analysis.

Read the paper (PDF) | Download the data file (ZIP)

One of the major diseases that records a high number of readmissions is bacterial pneumonia in Medicare patients. This study aims at comparing and contrasting Northeast and South regions of the United States based on the factors causing the 30-day readmissions. The study also identifies some of the ICD-9 medical procedures associated with those readmissions. Using SAS^® Enterprise Guide^® 7.1 and SAS^® Enterprise Miner™ 14.1, this study analyzes patient and hospital demographics of the Northeast and South regions of United States using the Cerner Health Facts Database. Further, the study suggests some preventive measures to reduce readmissions. The 30-day readmissions are computed based on admission and discharge dates from 2003 until 2013. Using clustering, various hospitals, along with discharge disposition levels (where a patient is sent after discharge), are grouped. In both regions, the patients who are discharged to home have shown significantly lower chances of readmission. The odds ratio = 0.562 for the Northeast region and for the South region; all other disposition levels have significantly high odds ( > 1), compared to discharged to home. For the South region, females have around 53 percent more possibility of readmission compared to males (odds ratio = 1.535). Also some of the hospital groups have higher readmission cases. The association analysis using the Market Basket node finds catheterization procedures to be highly significant (Lift = 1.25 for the Northeast and 3.84 for the South; Confidence = 52.94% for the Northeast and 85.71% for the South) in bacterial pneumonia readmissions. By research it was found that during these procedures, patients are highly susceptible to acquiring Methicillin-resistant Staphylococcus aureus (MRSA) bacteria, which causes Methicillin-susceptible pneumonia. Providing timely follow up for the patients operated with these procedures might possibly limit readmissions. These patients might also be discharge d to home under medical supervision, as such patients had shown significantly lower chances of readmission.

Read the paper (PDF)

Abstract quasi-complete separation is a commonly detected issue in logit/probit models. Quasi-complete separation occurs when a dependent variable separates an independent variable or a combination of several independent variables to a certain degree. In other words, levels in a categorical variable or values in a numeric variable are separated by groups of discrete outcome variables. Most of the time, this happens in categorical independent variable(s). Quasi-complete separation can cause convergence failures in logistic regression, which consequently result in a large coefficient estimate and standard errors inflation. Therefore, it can potentially yield biased results. This paper provides comprehensive reviews with various approaches for correcting quasi-complete separation in binary logistic regression model and presents hands-on examples from the healthcare insurance industry. First, it introduces the concept of quasi-complete separation and how to diagnosis the issue. Then, it provides step-by-step guidelines for fixing the problem, from a straightforward data configuration approach to a complicated statistical modeling approach such as the EXACT method, the FIRTH method, and so on.

Read the paper (PDF)

Credit card profitability prediction is a complex problem because of the variety of card holders' behavior patterns and the different sources of interest and transactional income. Each consumer account can move to a number of states such as inactive, transactor, revolver, delinquent, or defaulted. This paper i) describes an approach to credit card account-level profitability estimation based on the multistate and multistage conditional probabilities models and different types of income estimation, and ii) compares methods for the most efficient and accurate estimation. We use application, behavioral, card state dynamics, and macroeconomic characteristics, and their combinations as predictors. We use different types of logistic regression such as multinomial logistic regression, ordered logistic regression, and multistage conditional binary logistic regression with the LOGISTIC procedure for states transition probability estimation. The state transition probabilities are used as weights for interest rate and non-interest income models (which one is applied depends on the account state). Thus, the scoring model is split according to the customer behavior segment and the source of generated income. The total income consists of interest and non-interest income. Interest income is estimated with the credit limit utilization rate models. We test and compare five proportion models with the NLMIXED, LOGISTIC, and REG procedures in SAS/STAT^® software. Non-interest income depends on the probability of being in a particular state, the two-stage model of conditional probability to make a point-of-sales transaction (POS) or cash withdrawal (ATM), and the amount of income generated by this transaction. We use the LOGISTIC procedure for conditional probability prediction and the GLIMMIX and PANEL procedures for direct amount estimation with pooled and random-effect panel data. The validation results confirm that traditional techniques can be effectively applied to complex tasks with many para meters and multilevel business logic. The model is used in credit limit management, risk prediction, and client behavior analytics.

Read the paper (PDF)

The recent advances in regulatory stress testing, including stress testing regulated by Comprehensive Capital Analysis and Review (CCAR) in the US, the Prudential Regulation Authority (PRA) in the UK, and the European Banking Authority in the EU, as well as the new international accounting requirement known as IFRS 9 (International Financial Reporting Standard), all pose new challenges to credit risk modeling. The increasing sophistication of the models that are supposed to cover all the material risks in the underlying assets in various economic scenarios makes models harder to implement. Banks are spending a lot of resources on the model implementation but are still facing issues due to long time to deployment and disconnection between the model development and implementation teams. Models are also required at a more granular level, in many cases, down to the trade and account levels. Efficient model execution becomes valuable for banks to get timely response to the analysis requests. At the same time, models are subject to more stringent internal and external scrutiny. This paper introduces a suite of risk modeling solutions from credit risk modeling leader SAS^® to help banks overcome these new challenges and be competent to meet the regulatory requirements.

Read the paper (PDF)

Introduction: Cycling is on the rise in many urban areas across the United States. The number of cyclist fatalities is also increasing, by 19% in the last 3 years. With the broad-ranging personal and public health benefits of cycling, it is important to understand factors that are associated with these traffic-related deaths. There are more distracted drivers on the road than ever before, but the question remains of the extent that these drivers are affecting cycling fatality rates. Methods: This paper uses the Fatality Analysis Reporting System (FARS) data to examine factors related to cyclist death when the drivers are distracted. We use a novel machine learning approach, adaptive LASSO, to determine the relevant features and estimate their effect. Results: If a cyclist makes an improper action at or just before the time of the crash, the likelihood of the driver of the vehicle being distracted decreases. At the same time, if the driver is speeding or has failed to obey a traffic sign and fatally hits a cyclist, the likelihood of them also being distracted increases. Being distracted is related to other risky driving practices when cyclists are fatally injured. Environmental factors such as weather and road condition did not impact the likelihood that a driver was distracted when a cyclist fatality occurred.

Read the paper (PDF)

Phishing is the attempt of a malicious entity to acquire personal, financial, or otherwise sensitive information such as user names and passwords from recipients through the transmission of seemingly legitimate emails. By quickly alerting recipients of known phishing attacks, an organization can reduce the likelihood that a user will succumb to the request and unknowingly provide sensitive information to attackers. Methods to detect phishing attacks typically require the body of each email to be analyzed. However, most academic institutions do not have the resources to scan individual emails as they are received, nor do they wish to retain and analyze message body data. Many institutions simply rely on the education and participation of recipients within their network. Recipients are encouraged to alert information security (IS) personnel of potential attacks as they are delivered to their mailboxes. This paper explores a novel and more automated approach that uses SAS^® to examine email header and transmission data to determine likely phishing attempts that can be further analyzed by IS personnel. Previously a collection of 2,703 emails from an external filtering appliance were examined with moderate success. This paper focuses on the gains from analyzing an additional 50,000 emails, with the inclusion of an additional 30 known attacks. Real-time email traffic is exported from Splunk Enterprise into SAS for analysis. The resulting model aids in determining the effectiveness of alerting IS personnel to potential phishing attempts faster than a user simply forwarding a suspicious email to IS personnel.

View the e-poster or slides (PDF)

Scotiabank Colombian division - Colpatria, is the national leader in terms of providing credit cards, with more than 1,600,000 active cards--the equivalent to a portfolio of 700 million dollars approximately. The behavior score is used to offer credit cards through a cross-sell process, which happens only if customers have completed six months on books after using their first product with the bank. This is the minimum period of time requested by the behavior Artificial Neural Network (ANN) model. The six months on books internal policy suggests that the maturation of the client in this period is adequate, but this has never been proven. The following research aims to evaluate this hypothesis and calculate the appropriate time to offer cross-sales to new customers using Logistic Regression (Logit), while also segmenting these sales targets by their level of seniority using Discrete-Time Markov Chains (DTMC).

Read the paper (PDF)

Road safety is a major concern for all United States of America citizens. According to the National Highway Traffic Safety Administration, 30,000 deaths are caused by automobile accidents annually. Oftentimes fatalities occur due to a number of factors such as driver carelessness, speed of operation, impairment due to alcohol or drugs, and road environment. Some studies suggest that car crashes are solely due to driver factors, while other studies suggest car crashes are due to a combination of roadway and driver factors. However, other factors not mentioned in previous studies may be contributing to automobile accident fatalities. The objective of this project was to identify the significant factors that lead to multiple fatalities in the event of a car crash.

Read the paper (PDF)

Logistic regression models are commonly used in direct marketing and consumer finance applications. In this context the paper discusses two topics about the fitting and evaluation of logistic regression models. Topic 1 is a comparison of two methods for finding multiple candidate models. The first method is the familiar best subsets approach. Best subsets is then compared to a proposed new method based on combining models produced by backward and forward selection plus the predictors considered by backward and forward. This second method uses the SAS^® procedure HPLOGISTIC with selection of models by Schwarz Bayes criterion (SBC). Topic 2 is a discussion of model evaluation statistics to measure predictive accuracy and goodness-of-fit in support of the choice of a final model. Base SAS^® and SAS/STAT^® software are used.

Read the paper (PDF) | Download the data file (ZIP)

Analytics is now pervasive in our society, informing millions of decisions every year, and even making decisions without human intervention in automated systems. The implicit assumption is that all completed analytical projects are accurate and successful, but the reality is that some are not. Bad decisions that result from faulty analytics can be costly, embarrassing and even dangerous. By understanding the root causes of analytical failures, we can minimize the chances of repeating the mistakes of the past.

Read the paper (PDF)

Analysts find the standard linear regression and analysis-of-variance models to be extremely convenient and useful tools. The standard linear model equation form is observations = (sum of explanatory variables) + residual with the assumptions of normality and homogeneity of variance. However, these tools are unsuitable for non-normal response variables in general. Using various transformations can stabilize the variance. These transformations are often ineffective because they fail to address the skewness problem. It can be complicated to transform the estimates back to their original scale and interpret the results of the analysis. At this point, we reach the limits of the standard linear model. This paper introduces generalized linear models (GzLM) using a systematic approach to adapting linear model methods on non-normal data. Why GzLM? Generalized linear models have greater power to identify model effects as statistically significant when the data are not normally distributed (Stroup, xvii). Unlike the standard linear model, the generalized linear model contains the distribution of the observations, the linear predictor or predictors, the variance function, and the link function. A few examples show how to build a GzLM for a variety of response variables that follows a Poisson, Negative Binomial, Exponential, or Gamma distribution. The SAS/STAT^® GENMOD procedure is used to compute basic analyses.

Read the paper (PDF)

As a result of globalization, the durable goods market has become increasingly competitive, with market conditions that challenge profitability for manufacturers. Moreover, high material costs and the capital-intensive nature of the industry make it essential that companies understand demand signals and utilize supply chain capacity as effectively as possible. To grow and increase profitability under these challenging market conditions, a major durable goods company has partnered with SAS to streamline analysis of pricing and profitability, optimize inventory, and improve service levels to its customers. The price of a product is determined by a number of factors, such as the strategic importance of customers, supply chain costs, market conditions, and competitive prices. Offering promotions is an important part of a marketing strategy; it impacts purchasing behaviors of business customers and end consumers. This paper describes how this company developed a system to analyze product profitability and the impact of promotion on purchasing behaviors of both their business customers and end consumers. This paper also discusses how this company uses integrated demand planning and inventory optimization to manage its complex multi-echelon supply chain. The process uses historical order data to create a statistical forecast of demand, and then optimizes inventory across the supply chain to satisfy the forecast at desired service levels.

Read the paper (PDF)

Retailers need critical information about the expected inventory pattern over the life of a product to make pricing, replenishment, and staffing decisions. Hotels rely on booking curves to set rates and staffing levels for future dates. This paper explores a linear state space approach to understanding these industry challenges, applying the SAS/ETS^® SSM procedure. We also use the SAS/ETS SIMILARITY procedure to provide additional insight. These advanced techniques help us quantify the relationship between the current inventory level and all previous inventory levels (in the retail case). In the hospitality example, we can evaluate how current total bookings relate to historical booking levels. Applying these procedures can produce valuable new insights about the nature of the retail inventory cycle and the hotel booking curve.

Read the paper (PDF)

Logic model produced propensity scores have been intensively used to assist direct marketing name selections. As a result, only customers with an absolute higher likelihood to respond are mailed offers in order to achieve cost reduction. Thus, event ROI is increased. There is a fly in the ointment, however. Compared to the model building performance time window, usually 6 months to 12 months, a marketing event time period is usually much shorter. As such, this approach lacks of the ability to deselect those who have a high propensity score but are unlikely to respond to an upcoming campaign. To consider dynamically building a complete propensity model for every upcoming camping is nearly impossible. But, incorporating time to respond has been of great interest to marketers to add another dimension for response prediction enhancement. Hence, this paper presents an inventive modeling technique combining logistic regression and the Cox Proportional Hazards Model. The objective of the fusion approach is to allow a customer's shorter next to repurchase time to compensate for his or her insignificant lower propensity score in winning selection opportunities. The method is accomplished using PROC LOGISTIC, PROC LIFETEST, PROC LIFEREF, and PROC PHREG on the fusion model that is building in a SAS^® environment. This paper also touches on how to use the results to predict repurchase response by demonstrating a case of repurchase time-shift prediction on the 12-month inactive customers of a big box store retailer. The paper also shares a results comparison between the fusion approach and logit alone. Comprehensive SAS macros are provided in the appendix.

Read the paper (PDF)

Today's financial institutions employ tens of thousands of employees globally in the execution of manually intensive processes, from transaction processing to fraud investigation. While heuristics-based solutions have widely penetrated the industry, these solutions often work in the realm of broad generalities, lacking the nuance and experience of a human decision-maker. In today's regulatory environment, that translates to operational inefficiency and overhead as employees labor to rationalize human decisions that run counter to computed predictions. This session explores options that financial services institutions have to augment and automate day-to-day decision-making, leading to improvements in consistency, accuracy, and business efficiency. The session focuses on financial services case studies, including anti-money laundering, fraud, and transaction processing, to demonstrate real-world examples of how organizations can make the transition from predictions to decisions.

Read the paper (PDF)

An ad publisher like eBay has multiple ad sources to serve ads from. Specifically, for eBay's text ads, there are two ad sources, viz. eBay Commerce Network (ECN) and Google. The problem and solution we have formulated considers two ad sources. However, the solution can be extended for any number of ad sources. A study was done on performance of ECN and Google ad sources with respect to the revenue per mille (RPM, or revenue per thousand impressions) ad queries they bring while serving ads. It was found that ECN performs better for some ad queries and Google performs better for others. Thus, our problem is to optimally allocate ad traffic between the two ad sources to maximize RPM.

Read the paper (PDF)

Financial institutions are working hard to optimize credit and pricing strategies at adjudication and for ongoing account and customer management. For cards and other personal lending products, there is intense competitive pressure, together with relentless revenue challenges, that creates a huge requirement for sophisticated credit and pricing optimization tools. Numerous credit and pricing optimization applications are available on the market to satisfy these needs. We present a relatively new approach that relies heavily on the effect modeling (uplift or net lift) technique for continuous target metrics--revenue, cost, losses, and profit. Examples of effect modeling to optimize the impact of marketing campaigns are known. We discuss five essential steps on the credit and pricing optimization path: (1) setting up critical credit and pricing champion/challenger tests, (2) performance measurement of specific test campaigns, (3) effect modeling, (4) defining the best effect model, and (5) moving from the effect model to the optimal solution. These steps require specific applications that are not easily available in SAS^®. Therefore, necessary tools have been developed in SAS/STAT^® software. We go through numerous examples to illustrate credit and pricing optimization solutions.

Read the paper (PDF)

Due to advances in medical care and the rise in living standards, life expectancy increased on average to 79 years in the US. This resulted in an increase in aging populations and increased demand for development of technologies that aid elderly people to live independently and safely. It is possible to achieve this challenge through ambient-assisted living (AAL) technologies that help elderly people live more independently. Much research has been done on human activity recognition (HAR) in the last decade. This research work can be used in the development of assistive technologies, and HAR is expected to be the future technology for e-health systems. In this research, I discuss the need to predict human activity accurately by building various models in SAS^® Enterprise Miner™ 14.1 on a free public data set that contains 165,633 observations and 19 attributes. Variables used in this research represent the metrics of accelerometers mounted on the waist, left thigh, right arm, and right ankle of four individuals performing five different activities, recorded over a period of eight hours. The target variable predicts human activity such as sitting, sitting down, standing, standing up, and walking. Upon comparing different models, the winner is an auto neural model whose input is taken from stepwise logistic regression, which is used for variable selection. The model has an accuracy of 98.73% and sensitivity of 98.42%.

Read the paper (PDF) | View the e-poster or slides (PDF)

There is a growing recognition that mental health is a vital public health and development issue worldwide. Considerable studies have reported that there are close interactions between poverty and mental illness. The association between mental illness and poverty is cyclic and negative. Impoverished people are generally more prone to mental illness and are less capable to afford treatment. Likewise, people with mental health problems are more likely to be in poverty. The availability of free mental health treatment to people in poverty is critical to break this vicious cycle. Based on this hypothesis, a model was developed based on the responses provided by mental health facilities to a federally supported survey. We examined if we can predict whether the mental health facilities will offer free treatment to the patients who cannot afford treatment costs in the United States. About a third of the 9,076 mental health facilities who responded to the survey stated that they offer free treatment to the patients incapable of paying. Different machine learning algorithms and regression models were assessed to predict correctly which facility would offer treatment at no cost. Using a neural network model in conjunction with a decision tree for input variable selection, we found that the best performing model can predict the mental health facilities with an overall accuracy of 71.49%. Sensitivity and specificity of the selected neural network model was 82.96 and 56.57% respectively. The top five most important covariates that explained the model's predictive power are: ownership of mental health facilities, whether facilities provide a sliding fee scale, the type of facility, whether facilities provide mental health treatment service in Spanish, and whether facilities offer smoking cession services. More detailed spatial and descriptive analysis of those key variables will be conducted.

Read the paper (PDF)

Years ago, doctors advised women with autoimmune diseases such as systemic lupus erythematosus (SLE) and rheumatoid arthritis (RA) not to become pregnant for fear of maternal health. Now, it is known that healthy pregnancy is possible for women with lupus but at the expense of a higher pregnancy complication rate. The main objective of this research is to identify key factors contributing to these diseases and to predict the occurrence rate of SLE and RA in pregnant women. Based on the approach used in this study, the prediction of adverse pregnancy outcomes for women with SLE, RA, and other diseases such as diabetes mellitus (DM) and antiphospholipid antibody syndrome (APS) can be carried out. These results will help pregnant women undergo a healthy pregnancy by receiving proper medication at an earlier stage. The data set was obtained from Cerner Health Facts data warehouse. The raw data set contains 883,473 records and 85 variables such as diagnosis code, age, race, procedure code, admission date, discharge date, total charges, and so on. Analyses were carried out with two different data sets--one for SLE patients and the other for RA patients. The final data sets had 398,742 and 397,898 records each for modeling SLE and RA patients, respectively. To provide an honest assessment of the models, the data was split into training and validation using the data partition node. Variable selection techniques such as LASSO, LARS, stepwise regression, and forward regression were used. Using a decision tree, prominent factors that determine the SLE and RA occurrence rate were identified separately. Of all the predictive models run using SAS^® Enterprise Miner™ 12.3, the model comparison node identified the decision tree (Gini) as the best model with the least misclassification rate of 0.308 to predict the SLE patients and 0.288 to predict the RA patients.

Read the paper (PDF)

In early 2006, the United States experienced a housing bubble that affected over half of the American states. It was one of the leading causes of the 2007-2008 financial recession. Primarily, the overvaluation of housing units resulted in foreclosures and prolonged unemployment during and after the recession period. The main objective of this study is to predict the current market value of a housing unit with respect to fair market rent, census region, metropolitan statistical area, area median income, household income, poverty income, number of units in the building, number of bedrooms in the unit, utility costs, other costs of the unit, and so on, to determine which factors affect the market value of the housing unit. For the purpose of this study, data was collected from the Housing Affordability Data System of the US Department of Housing and Urban Development. The data set contains 20 variables and 36,675 observations. To select the best possible input variables, several variable selection techniques were used. For example, LARS (least angle regression), LASSO (least absolute shrinkage and selection operator), adaptive LASSO, variable selection, variable clustering, stepwise regression, (PCA) principal component analysis only with numeric variables, and PCA with all variables were all tested. After selecting input variables, numerous modeling techniques were applied to predict the current market value of a housing unit. An in-depth analysis of the findings revealed that the current market value of a housing unit is significantly affected by the fair market value, insurance and other costs, structure type, household income, and more. Furthermore, a higher household income and median income of an area are associated with a higher market value of a housing unit.

Read the paper (PDF) | View the e-poster or slides (PDF)

This session discusses challenges and considerations typically faced in credit risk scoring as well as options and practical ways to address them. No two problems are ever the same even if the general approach is clear. Every model has its own unique characteristics and creative ways to address them. Successful credit scoring modeling projects are always based on a combination of both advanced analytical techniques and data, and a deep understanding of the business and how the model will be applied. Different aspects of the process are discussed, including feature selection, reject inferencing, sample selection and validation, and model design questions and considerations.

Read the paper (PDF)

In customer relationship management (CRM) or consumer finance it is important to predict the time of repeated events. These repeated events might be a purchase, service visit, or late payment. Specifically, the goal is to find the probability density for the time to first event, the probability density for the time to second event, and so on. Two approaches are presented and contrasted. One approach uses discrete time hazard modeling (DTHM). The second, a distinctly different approach, uses a multinomial logistic model (MLM). Both DTHM and MLM satisfy an important consistency criterion, which is to provide probability densities whose average across all customers equals the empirical population average. DTHM requires fewer models if the number of repeated events is small. MLM requires fewer models if the time horizon is small. SAS^® code is provided through a simulation in order to illustrate the two approaches.

Read the paper (PDF)

This paper describes the merging of designed experiments and regularized regression to find the significant factors to use for a real-time predictive model. The real-time predictive model is used in a Weyerhaeuser modified fiber mill to estimate the value of several key quality characteristics. The modified fiber product is accepted or rejected based on the model prediction or the predicted values' deviation from lab tests. To develop the model, a designed experiment was needed at the mill. The experiment was planned by a team of engineers, managers, operators, lab techs, and a statistician. The data analysis used the actual values of the process variables manipulated in the designed experiment. There were instances of the set point not being achieved or maintained for the duration of the run. The lab tests of the key quality characteristics were used as the responses. The experiment was designed with JMP^® 64-bit Edition 11.2.0 and estimated the two-way interactions of the process variables. It was thought that one of the process variables was curvilinear and this effect was also made estimable. The LASSO method, as implemented in the SAS^® GLMSELECT procedure, was used to analyze the data after cleaning and validating the results. Cross validation was used as the variable selection operator. The resulting prediction model passed all the predetermined success criteria. The prediction model is currently being used real time in the mill for product quality acceptance.

Read the paper (PDF)

As credit unions market themselves to increase their market share against the big banks, they understandably focus on gaining new members. However, they must also retain their existing members. Otherwise, the new members they gain can easily be offset by existing members who leave. Happily, by using predictive analytics as described in this paper, keeping (and further engaging) existing members can actually be much easier and less expensive than enlisting new members. This paper provides a step-by-step overview of a relatively simple but comprehensive approach to reduce member attrition. We first prepare the data for a statistical analysis. With some basic predictive analytics techniques, we can then identify those members who have the highest chance of leaving and the highest value. For each of these members, we can also identify why they would leave, thus suggesting the best way to intervene to retain them. We then make suggestions to improve the model for better accuracy. Finally, we provide suggestions to extend this approach to further engaging existing members and thus increasing their lifetime value. This approach can also be applied to many other organizations and industries. Code snippets are shown for any version of SAS^® software; they also require SAS/STAT^® software.

Read the paper (PDF)

Abstract logistic regression is one of the widely used regression models in statistics. Its dependent variable is a discrete variable with at least two levels. It measures the relationship between categorical dependent variables and independent variables and predicts the likelihood of having the event associated with an outcome variable. Variable reduction and screening are the techniques that reduce the redundant independent variables and filter out the independent variables with less predictive power. For several reasons, variable reduction and screening are important and critical especially when there are hundreds of covariates available that could possibly fit into the model. These techniques decrease the model estimation time and they can avoid the occurrence of non-convergence and can generate an unbiased inference of the outcome. This paper reviews various approaches for variable reduction, including traditional methods starting from univariate single logistic regressions and methods based on variable clustering techniques. It also presents hands-on examples using SAS/STAT^® software and illustrates three selection nodes available in SAS^® Enterprise Miner™: variable selection node, variable clustering node, and decision tree node.

Read the paper (PDF)

Coronal mass ejections (CMEs) are massive explosions of magnetic field and plasma from the Sun. While responsible for the northern lights, these eruptions can cause geomagnetic storms and cataclysmic damage to Earth's telecommunications systems and power grid infrastructures. Hence, it is imperative to construct highly accurate predictive processes to determine whether an incoming CME will produce devastating effects on Earth. One such process, called stacked generalization, trains a variety of models, or base-learners, on a data set. Then, using the predictions from the base-learners, another model is trained to learn from the metadata. The goal of this meta-learner is to deduce information about the biases from the base-learners to make more accurate predictions. Studies have shown success in using linear methods, especially within regularization frameworks, at the meta-level to combine the base-level predictions. Here, SAS^® Enterprise Miner™ 13.1 is used to reinforce the advantages of regularization via the Least Absolute Shrinkage and Selection Operator (LASSO) on this type of metadata. This work compares the LASSO model selection method to other regression approaches when predicting the occurrence of strong geomagnetic storms caused by CMEs.

View the e-poster or slides (PDF)

Similarity analysis is used to classify an entire time series by type. In this method, a distance measure is calculated to reflect the difference in form between two ordered sequences, such as are found in time series data. The mutual differences between many sequences or time series can be compiled to a similarity matrix, similar in form to a correlation matrix. When cluster methods are applied to a similarity matrix, time series with similar patterns are grouped together and placed into clusters distinct from others that contain times series with different patterns. In this example, similarity analysis is used to classify supernovae by the way the amount of light they produce changes over time.

Read the paper (PDF)

Machine learning is gaining popularity, fueled by the rapid advancement of computing technology. Machine learning offers the ability to continuously adjust predictive models based on real-time data, and to visualize changes and take action on the new information. Hear from two PhD's from SAS and Intel about how SAS^® machine learning, working with SAS streaming analytics and Intel microprocessor technology, makes this possible.

Read the paper (PDF)

Most businesses have benefited from using advanced analytics for marketing and other decision making. But to apply analytical techniques to pharmaceutical marketing is challenging and emerging as it is critical to ensure that the analysis makes sense from the medical side. The drug for a specific disease finally consumed is directly or indirectly influenced by many factors, including the disease origins, health-care system policy, physicians' clinical decisions, and the patients' perceptions and behaviors. The key to pharmaceutical marketing is in identifying the targeted populations for specific diseases and to focus on those populations. Because the health-care environment consistently changes, the predictive models are important to predict the change of the targeted population over time based on the patient journey and epidemiology. Time series analysis is used to forecast the number of cases of infectious diseases; correspondingly, over the counter and prescribed medicines for the specific disease could be predicted. The accurate prediction provides valuable information for the strategic plan of campaigns. For different diseases, different analytical techniques are applied. By taking the medical features of the disease and epidemiology into account, the prediction of the potential and total addressable markets can reveal more insightful marketing trends. And by simulating the important factors and quantifying how they impact the patient journey within the typical health-care system, the most accurate demand for specific medicines or treatments could be discovered. Through monitoring the parameters in the dynamic simulation, the smart decision can be made using what-if comparisons to optimize the marketing result.