SAS Global Forum 2015 Proceedings

Non-Gaussian outcomes are often modeled using members of the so-called exponential family. Notorious members are the Bernoulli model for binary data, leading to logistic regression, and the Poisson model for count data, leading to Poisson regression. Two of the main reasons for extending this family are (1) the occurrence of overdispersion, meaning that the variability in the data is not adequately described by the models, which often exhibit a prescribed mean-variance link, and (2) the accommodation of hierarchical structure in the data, stemming from clustering in the data which, in turn, might result from repeatedly measuring the outcome, for various members of the same family, and so on. The first issue is dealt with through a variety of overdispersion models such as the beta-binomial model for grouped binary data and the negative-binomial model for counts. Clustering is often accommodated through the inclusion of random subject-specific effects. Though not always, one conventionally assumes such random effects to be normally distributed. While both of these phenomena might occur simultaneously, models combining them are uncommon. This paper proposes a broad class of generalized linear models accommodating overdispersion and clustering through two separate sets of random effects. We place particular emphasis on so-called conjugate random effects at the level of the mean for the first aspect and normal random effects embedded within the linear predictor for the second aspect, even though our family is more general. The binary, count, and time-to-event cases are given particular emphasis. Apart from model formulation, we present an overview of estimation methods, and then settle for maximum likelihood estimation with analytic-numerical integration. Implications for the derivation of marginal correlations functions are discussed. The methodology is applied to data from a study of epileptic seizures, a clinical trial for a toenail infection named onychomycosis, and survival data in children with asthma.

Read the paper (PDF). | Watch the recording.

In big data, many variables are polytomous with many levels. The common method to deal with polytomous independent variables is to use a series of design variables, which correspond to the option class or by in the polytomous independent variable in PROC LOGISTIC, if the outcome is binary. If big data has many polytomous independent variables with many levels, using design variables makes the analysis processing very complicated in both computation time and result, which might provide little help on the prediction of outcome. This paper presents a new simple method for logistic regression with polytomous independent variables in big data analysis when analysis of big data is required. In the proposed method, the first step is to conduct an iteration statistical analysis from a SAS^® macro program. Similar to an algorithm in the creation of spline variables, this analysis searches for the proper aggregation groups with a statistical significant difference from all levels in a polytomous independent variable. In the SAS macro program for an iteration, processing of searching new level groups with statistical significant differences has been developed. The first is from level 1 with the smallest value of the outcome means. Then we can conduct a statistical test for the level 1 group with the level 2 group with the second smallest value of outcome mean. If these two groups have a statistical significant difference, we can start to test the level 2 group with the level 3 group. If level 1 and level 2 do not have a statistical significant difference, we can combine them into a new level group 1. Then we are going to test the new level group 1 with level 3. The processing continues until all the levels have been tested. Then we can replace the original level values of the polytomous variable by the new level values with the statistical significant difference. In this situation, the polytomous variable with new levels can be described by these means of all new levels because of the 1 to 1 equivalence relationship of a piecewise function in logit from the polytomous's levels to outcome means. It is very easy to approve that the conditional mean of an outcome y given a polytomous variable x is a very good approximation based on the maximum likelihood analysis. Compared with design variables, the new piecewise variable based on the information of all levels as a single independent variable can capture the impact of all levels in a much simpler way. We have used this method in the predictive models of customer attrition on the polytomous variables: state, business type, customer claim type, and so on. All of these polytomous variables show significant improvement on the prediction of customer attrition than without using them or using design variables in the model development.