If are independent binary (Bernoulli) random variables with common success probability , then their sum is a binomial random variable. In other words, a binomial random variable with parameters and can be generated as the sum of Bernoulli() random experiments. The HPLOGISTIC procedure uses a special syntax to express data in binomial form, the events/trials syntax.
Consider the following data, taken from Cox and Snell (1989, pp. 10–11), of the number, r
, of ingots not ready for rolling, out of n
tested, for a number of combinations of heating time and soaking time. If each test is carried out independently and if for
a particular combination of heating and soaking time there is a constant probability that the tested ingot is not ready for
rolling, then the random variable follows a Binomial distribution, where the success probability is a function of heating and soaking time.
data Ingots; input Heat Soak r n @@; Obsnum= _n_; datalines; 7 1.0 0 10 14 1.0 0 31 27 1.0 1 56 51 1.0 3 13 7 1.7 0 17 14 1.7 0 43 27 1.7 4 44 51 1.7 0 1 7 2.2 0 7 14 2.2 2 33 27 2.2 0 21 51 2.2 0 1 7 2.8 0 12 14 2.8 0 31 27 2.8 1 22 51 4.0 0 1 7 4.0 0 9 14 4.0 0 19 27 4.0 1 16 ;
The following statements show the use of the events/trials syntax to model the binomial response. The events variable in this situation is r
, the number of ingots not ready for rolling, and the trials variable is n
, the number of ingots tested. The dependency of the probability of not being ready for rolling is modeled as a function of
heating time, soaking time, and their interaction. The OUTPUT statement stores the linear predictors and the predicted probabilities in the Out
data set along with the ID variable.
proc hplogistic data=Ingots; model r/n = Heat Soak Heat*Soak; id Obsnum; output out=Out xbeta predicted=Pred; run;
The “Performance Information” table in Output 5.2.1 shows that the procedure executes in single-machine mode. The example is executed on a single machine with the same number of cores as the number of threads used; that is, one computational thread was spawned per CPU.
Output 5.2.1: Performance Information
Performance Information | |
---|---|
Execution Mode | Single-Machine |
Number of Threads | 4 |
The “Model Information” table shows that the data are modeled as binomially distributed with a logit link function (Output 5.2.2). This is the default link function in the HPLOGISTIC procedure for binary and binomial data. The procedure estimates the parameters of the model by a Newton-Raphson algorithm.
Output 5.2.2: Model Information and Number of Observations
Model Information | |
---|---|
Data Source | WORK.INGOTS |
Response Variable (Events) | r |
Response Variable (Trials) | n |
Distribution | Binomial |
Link Function | Logit |
Optimization Technique | Newton-Raphson with Ridging |
Number of Observations Read | 19 |
---|---|
Number of Observations Used | 19 |
Number of Events | 12 |
Number of Trials | 387 |
The second table in Output 5.2.2 shows that all 19 observations in the data set were used in the analysis, and that the total number of events and trials
equal 12 and 387, respectively. These are the sums of the variables r
and n
across all observations.
Output 5.2.3 displays the “Iteration History” and convergence status tables for this run. The HPLOGISTIC procedure converged after four iterations (not counting the initial setup iteration) and meets the GCONV= convergence criterion.
Output 5.2.3: Iteration History and Convergence Status
Iteration History | ||||
---|---|---|---|---|
Iteration | Evaluations | Objective Function |
Change | Max Gradient |
0 | 4 | 0.7676329445 | . | 6.378002 |
1 | 2 | 0.7365832479 | 0.03104970 | 0.754902 |
2 | 2 | 0.7357086248 | 0.00087462 | 0.023623 |
3 | 2 | 0.7357075299 | 0.00000109 | 0.00003 |
4 | 2 | 0.7357075299 | 0.00000000 | 5.42E-11 |
Convergence criterion (GCONV=1E-8) satisfied. |
Output 5.2.4 displays the “Dimensions” table for the model. There are four columns in the design matrix of the model (the matrix); they correspond to the intercept, the Heat
effect, the Soak
effect, and the interaction of the Heat
and Soak
effects. The model is nonsingular, since the rank of the crossproducts matrix equals the number of columns in . All parameters are estimable and participate in the optimization.
Output 5.2.4: Dimensions in Binomial Logistic Regression
Dimensions | |
---|---|
Columns in X | 4 |
Number of Effects | 4 |
Max Effect Columns | 1 |
Rank of Cross-product Matrix | 4 |
Parameters in Optimization | 4 |
Output 5.2.5 displays the “Fit Statistics” table for this run. Evaluated at the converged estimates, –2 times the value of the log-likelihood function equals 27.9569. Further fit statistics are also given, all of them in “smaller is better” form. The AIC, AICC, and BIC criteria are used to compare non-nested models and to penalize the model fit for the number of observations and parameters. The –2 log-likelihood value can be used to compare nested models by way of a likelihood ratio test.
Output 5.2.5: Fit Statistics
Fit Statistics | |
---|---|
-2 Log Likelihood | 27.9569 |
AIC (smaller is better) | 35.9569 |
AICC (smaller is better) | 38.8140 |
BIC (smaller is better) | 39.7346 |
Output 5.2.6 shows the test of the global hypothesis that the effects jointly do not impact the probability of ingot readiness. The chi-square
test statistic can be obtained by comparing the –2 log-likelihood value of the model with covariates to the value in the intercept-only
model. The test is significant with a p-value of 0.0082. One or more of the effects in the model have a significant impact on the probability of ingot readiness.
Output 5.2.6: Null Test
Testing Global Null Hypothesis: BETA=0 | |||
---|---|---|---|
Test | Chi-Square | DF | Pr > ChiSq |
Likelihood Ratio | 11.7663 | 3 | 0.0082 |
The “Parameter Estimates” table in Output 5.2.7 displays the estimates and standard errors of the model effects.
Output 5.2.7: Parameter Estimates
Parameter Estimates | |||||
---|---|---|---|---|---|
Parameter | Estimate | Standard Error |
DF | t Value | Pr > |t| |
Intercept | -5.9902 | 1.6666 | Infty | -3.59 | 0.0003 |
Heat | 0.09634 | 0.04707 | Infty | 2.05 | 0.0407 |
Soak | 0.2996 | 0.7551 | Infty | 0.40 | 0.6916 |
Heat*Soak | -0.00884 | 0.02532 | Infty | -0.35 | 0.7270 |
You can construct the prediction equation of the model from the “Parameter Estimates” table. For example, an observation with Heat
equal to 14 and Soak
equal to 1.7 has linear predictor
The probability that an ingot with these characteristics is not ready for rolling is
The OUTPUT statement computes these linear predictors and probabilities and stores them in the Out
data set. This data set also contains the ID variable, which is used by the following statements to attach the covariates
to these statistics. Output 5.2.8 shows the probability that an ingot with Heat
equal to 14 and Soak
equal to 1.7 is not ready for rolling.
data Out; merge Out Ingots; by Obsnum; proc print data=Out; where Heat=14 & Soak=1.7; run;
Output 5.2.8: Predicted Probability for Heat=14 and Soak=1.7
Obs | Obsnum | Pred | Xbeta | Heat | Soak | r | n |
---|---|---|---|---|---|---|---|
6 | 6 | 0.012836 | -4.34256 | 14 | 1.7 | 0 | 43 |
Binomial data are a form of grouped binary data where “successes” in the underlying Bernoulli trials are totaled. You can thus unwind data for which you use the events/trials syntax and fit it with techniques for binary data.
The following DATA step expands the Ingots
data set with 12 events in 387 trials into a binary data set with 387 observations.
data Ingots_binary; set Ingots; do i=1 to n; if i <= r then y=1; else y = 0; output; end; run;
The following HPLOGISTIC statements fit the model with Heat
effect, Soak
effect, and their interaction to the binary data set. The event=’1’
response-variable option in the MODEL statement ensures that the HPLOGISTIC procedure models the probability that the variable y
takes on the value ('1').
proc hplogistic data=Ingots_binary; model y(event='1') = Heat Soak Heat*Soak; run;
Output 5.2.9 displays the “Performance Information”, “Model Information,” “Number of Observations,” and the “Response Profile” tables. The data are now modeled as binary (Bernoulli distributed) with a logit link function. The “Response Profile” table shows that the binary response breaks down into 375 observations where y
equals zero and 12 observations where y
equals 1.
Output 5.2.9: Model Information in Binary Model
Performance Information | |
---|---|
Execution Mode | Single-Machine |
Number of Threads | 4 |
Model Information | |
---|---|
Data Source | WORK.INGOTS_BINARY |
Response Variable | y |
Distribution | Binary |
Link Function | Logit |
Optimization Technique | Newton-Raphson with Ridging |
Number of Observations Read | 387 |
---|---|
Number of Observations Used | 387 |
Response Profile | ||
---|---|---|
Ordered Value |
y | Total Frequency |
1 | 0 | 375 |
2 | 1 | 12 |
You are modeling the probability that y='1'. |
Output 5.2.10 displays the result for the test of the global null hypothesis and the parameter estimates. These results match those in Output 5.2.6 and Output 5.2.7.
Output 5.2.10: Null Test and Parameter Estimates
Testing Global Null Hypothesis: BETA=0 | |||
---|---|---|---|
Test | Chi-Square | DF | Pr > ChiSq |
Likelihood Ratio | 11.7663 | 3 | 0.0082 |
Parameter Estimates | |||||
---|---|---|---|---|---|
Parameter | Estimate | Standard Error |
DF | t Value | Pr > |t| |
Intercept | -5.9902 | 1.6666 | Infty | -3.59 | 0.0003 |
Heat | 0.09634 | 0.04707 | Infty | 2.05 | 0.0407 |
Soak | 0.2996 | 0.7551 | Infty | 0.40 | 0.6916 |
Heat*Soak | -0.00884 | 0.02532 | Infty | -0.35 | 0.7270 |