SAS Global Forum 2015 Proceedings

Imbalanced data are frequently seen in fraud detection, direct marketing, disease prediction, and many other areas. Rare events are sometimes of primary interest. Classifying them correctly is the challenge that many predictive modelers face today. In this paper, we use SAS^® Enterprise Miner™ on a marketing data set to demonstrate and compare several approaches that are commonly used to handle imbalanced data problems in classification models. The approaches are based on cost-sensitive measures and sampling measures. A rather novel technique called SMOTE (Synthetic Minority Over-sampling TEchnique), which has achieved the best result in our comparison, will be discussed.

Read the paper (PDF). | Download the data file (ZIP).

In big data, many variables are polytomous with many levels. The common method to deal with polytomous independent variables is to use a series of design variables, which correspond to the option class or by in the polytomous independent variable in PROC LOGISTIC, if the outcome is binary. If big data has many polytomous independent variables with many levels, using design variables makes the analysis processing very complicated in both computation time and result, which might provide little help on the prediction of outcome. This paper presents a new simple method for logistic regression with polytomous independent variables in big data analysis when analysis of big data is required. In the proposed method, the first step is to conduct an iteration statistical analysis from a SAS^® macro program. Similar to an algorithm in the creation of spline variables, this analysis searches for the proper aggregation groups with a statistical significant difference from all levels in a polytomous independent variable. In the SAS macro program for an iteration, processing of searching new level groups with statistical significant differences has been developed. The first is from level 1 with the smallest value of the outcome means. Then we can conduct a statistical test for the level 1 group with the level 2 group with the second smallest value of outcome mean. If these two groups have a statistical significant difference, we can start to test the level 2 group with the level 3 group. If level 1 and level 2 do not have a statistical significant difference, we can combine them into a new level group 1. Then we are going to test the new level group 1 with level 3. The processing continues until all the levels have been tested. Then we can replace the original level values of the polytomous variable by the new level values with the statistical significant difference. In this situation, the polytomous variable with new levels can be described by these means of all new levels because of the 1 to 1 equivalence relationship of a piecewise function in logit from the polytomous's levels to outcome means. It is very easy to approve that the conditional mean of an outcome y given a polytomous variable x is a very good approximation based on the maximum likelihood analysis. Compared with design variables, the new piecewise variable based on the information of all levels as a single independent variable can capture the impact of all levels in a much simpler way. We have used this method in the predictive models of customer attrition on the polytomous variables: state, business type, customer claim type, and so on. All of these polytomous variables show significant improvement on the prediction of customer attrition than without using them or using design variables in the model development.