The PROBIT Procedure

Example 75.4 An Epidemiology Study

The data in this example, which are from an epidemiology study, consist of five variables: the number, r, of individuals surviving after an epidemic, out of n treated, for combinations of medicine dosage (dose), treatment (treat = A, B), and sex (sex = 0(Female), 1(Male)).

To see whether the two treatments have different effects on male and female individual survival rates, the interaction term between the two variables treat and sex is included in the model.

The following invocation of PROC PROBIT fits the binary probit model to the grouped data:

data epidemic;
   input treat$ dose n r sex @@;
   label dose = Dose;
   datalines;
A  2.17 142 142  0   A   .57 132  47  1
A  1.68 128 105  1   A  1.08 126 100  0
A  1.79 125 118  0   B  1.66 117 115  1
B  1.49 127 114  0   B  1.17  51  44  1
B  2.00 127 126  0   B   .80 129 100  1
;

data xval;
   input treat $ dose sex;
   datalines;
B  2.  1
;

proc probit optc lackfit covout data=epidemic
            outest = out1 xdata = xval
            Plots=(predpplot ippplot lpredplot);
   class treat sex;
   model r/n = dose treat sex sex*treat/corrb covb inversecl;
   output out = out2 p =p;
run;

The results of this analysis are shown in the outputs that follow.

Output 75.4.1 displays the table of level information for all classification variables in the CLASS statement.

Output 75.4.1: Class Level Information

The Probit Procedure

Class Level Information
Name	Levels	Values
treat	2	A B
sex	2	0 1

Output 75.4.2 displays the table of parameter information for the effects in the MODEL statement.

Output 75.4.2: Parameter Information

Parameter Information
Parameter	Effect	treat	sex
Intercept	Intercept
dose	dose
treatA	treat	A
treatB	treat	B
sex0	sex		0
sex1	sex		1
treatAsex0	treat*sex	A	0
treatAsex1	treat*sex	A	1
treatBsex0	treat*sex	B	0
treatBsex1	treat*sex	B	1

Output 75.4.3 displays background information about the model fit. Included are the name of the input data set, the response variables used, the numbers of observations, events, and trials, the type of distribution, and the final value of the log-likelihood function.

Output 75.4.3: Model Information

The Probit Procedure

Model Information
Data Set	WORK.EPIDEMIC
Events Variable	r
Trials Variable	n
Number of Observations	10
Number of Events	1011
Number of Trials	1204
Name of Distribution	Normal
Log Likelihood	-387.2467391

Output 75.4.4 displays the table of goodness-of-fit tests requested with the LACKFIT option in the PROC PROBIT statement. Two goodness-of-fit statistics, the Pearson’s chi-square statistic and the likelihood ratio chi-square statistic, are computed. The grouping method for computing these statistics can be specified by the AGGREGATE= option. The details can be found in the AGGREGATE= option, and an example can be found in the second part of this example. By default, the PROBIT procedure uses the covariates in the MODEL statement to do grouping. Observations with the same values of the covariates in the MODEL statement are grouped into cells and the two statistics are computed according to these cells. The total number of cells and the number of levels for the response variable are reported next in the “Response-Covariate Profile.”

In this example, neither the Pearson’s chi-square nor the log-likelihood ratio chi-square tests are significant at the 0.1 level, which is the default test level used by the PROBIT procedure. That means that the model, which includes the interaction of treat and sex, is suitable for this epidemiology data set. (Further investigation shows that models without the interaction of treat and sex are not acceptable by either test.)

Output 75.4.4: Goodness-of-Fit Tests and Response-Covariate Profile

Goodness-of-Fit Tests
Statistic	Value	DF	Value/DF	Pr > ChiSq
Pearson Chi-Square	4.9317	4	1.2329	0.2944
L.R. Chi-Square	5.7079	4	1.4270	0.2220

Response-Covariate Profile
Response Levels	2
Number of Covariate Values	10

Output 75.4.5 displays the Type III test results for all effects specified in the MODEL statement, which include the degrees of freedom for the effect, the Wald Chi-Square test statistic, and the p-value.

Output 75.4.5: Type III Tests

Type III Analysis of Effects
Effect	DF	Wald Chi-Square	Pr > ChiSq
dose	1	42.1691	<.0001
treat	1	16.1421	<.0001
sex	1	1.7710	0.1833
treat*sex	1	13.9343	0.0002

Output 75.4.6 displays the table of parameter estimates for the model. The PROBIT procedure displays information for all the parameters of an effect. Degenerate parameters are indicated by 0 degree of freedom. Confidence intervals are computed for all parameters with nonzero degrees of freedom, including the natural threshold C if the OPTC option is specified in the PROC PROBIT statement. The confidence level can be specified by the ALPHA= option in the MODEL statement. The default confidence level is 95%.

Output 75.4.6: Analysis of Parameter Estimates

Analysis of Maximum Likelihood Parameter Estimates
Parameter			DF	Estimate	Standard Error	95% Confidence Limits		Chi-Square	Pr > ChiSq
Intercept			1	-0.8871	0.3632	-1.5991	-0.1752	5.96	0.0146
dose			1	1.6774	0.2583	1.1711	2.1837	42.17	<.0001
treat	A		1	-1.2537	0.2616	-1.7664	-0.7410	22.97	<.0001
treat	B		0	0.0000	.	.	.	.	.
sex	0		1	-0.4633	0.2289	-0.9119	-0.0147	4.10	0.0429
sex	1		0	0.0000	.	.	.	.	.
treat*sex	A	0	1	1.2899	0.3456	0.6126	1.9672	13.93	0.0002
treat*sex	A	1	0	0.0000	.	.	.	.	.
treat*sex	B	0	0	0.0000	.	.	.	.	.
treat*sex	B	1	0	0.0000	.	.	.	.	.
_C_			1	0.2735	0.0946	0.0881	0.4589

From Output 75.4.6, you can see the following results:

The variable dose has a significant positive effect on the survival rate.
Individuals under treatment A have a lower survival rate.
Male individuals have a higher survival rate.
Female individuals under treatment A have a higher survival rate.

Output 75.4.7 and Output 75.4.8 display tables of estimated covariance matrix and estimated correlation matrix for estimated parameters with a nonzero degree of freedom, respectively. They are computed by the inverse of the Hessian matrix of the estimated parameters.

Output 75.4.7: Estimated Covariance Matrix

Estimated Covariance Matrix
	Intercept	dose	treatA	sex0	treatAsex0	_C_
Intercept	0.131944	-0.087353	0.053551	0.030285	-0.067056	-0.028073
dose	-0.087353	0.066723	-0.047506	-0.034081	0.058620	0.018196
treatA	0.053551	-0.047506	0.068425	0.036063	-0.075323	-0.017084
sex0	0.030285	-0.034081	0.036063	0.052383	-0.063599	-0.008088
treatAsex0	-0.067056	0.058620	-0.075323	-0.063599	0.119408	0.019134
_C_	-0.028073	0.018196	-0.017084	-0.008088	0.019134	0.008948

Output 75.4.8: Estimated Correlation Matrix

Estimated Correlation Matrix
	Intercept	dose	treatA	sex0	treatAsex0	_C_
Intercept	1.000000	-0.930998	0.563595	0.364284	-0.534227	-0.817027
dose	-0.930998	1.000000	-0.703083	-0.576477	0.656744	0.744699
treatA	0.563595	-0.703083	1.000000	0.602359	-0.833299	-0.690420
sex0	0.364284	-0.576477	0.602359	1.000000	-0.804154	-0.373565
treatAsex0	-0.534227	0.656744	-0.833299	-0.804154	1.000000	0.585364
_C_	-0.817027	0.744699	-0.690420	-0.373565	0.585364	1.000000

Output 75.4.9 displays the computed values and fiducial limits for the first single continuous variable dose in the MODEL statement, given the probability levels, without the effect of the natural threshold, and when the option INSERSECL in the MODEL statement is specified. If there is no single continuous variable in the MODEL specification but the INVERSECL option is specified, an error is reported.

Output 75.4.9: Probit Analysis on Dose

The Probit Procedure

Probit Analysis on dose
Probability	dose	95% Fiducial Limits
0.01	-0.85801	-1.81301	-0.33743
0.02	-0.69549	-1.58167	-0.21116
0.03	-0.59238	-1.43501	-0.13093
0.04	-0.51482	-1.32476	-0.07050
0.05	-0.45172	-1.23513	-0.02130
0.06	-0.39802	-1.15888	0.02063
0.07	-0.35093	-1.09206	0.05742
0.08	-0.30877	-1.03226	0.09039
0.09	-0.27043	-0.97790	0.12040
0.10	-0.23513	-0.92788	0.14805
0.15	-0.08900	-0.72107	0.26278
0.20	0.02714	-0.55706	0.35434
0.25	0.12678	-0.41669	0.43322
0.30	0.21625	-0.29095	0.50437
0.35	0.29917	-0.17477	0.57064
0.40	0.37785	-0.06487	0.63387
0.45	0.45397	0.04104	0.69546
0.50	0.52888	0.14481	0.75654
0.55	0.60380	0.24800	0.81819
0.60	0.67992	0.35213	0.88157
0.65	0.75860	0.45879	0.94803
0.70	0.84151	0.56985	1.01942
0.75	0.93099	0.68770	1.09847
0.80	1.03063	0.81571	1.18970
0.85	1.14677	0.95926	1.30171
0.90	1.29290	1.12867	1.45386
0.91	1.32819	1.16747	1.49273
0.92	1.36654	1.20867	1.53590
0.93	1.40870	1.25284	1.58450
0.94	1.45579	1.30084	1.64012
0.95	1.50949	1.35397	1.70515
0.96	1.57258	1.41443	1.78353
0.97	1.65015	1.48626	1.88238
0.98	1.75326	1.57833	2.01720
0.99	1.91577	1.71776	2.23537

If the XDATA= option is used to input a data set for the independent variables in the MODEL statement, the PROBIT procedure uses these values for the independent variables other than the single continuous variable. Missing values are not permitted in the XDATA= data set for the independent variables, although the value for the single continuous variable is not used in the computing of the fiducial limits. A suitable valid value should be given. In the data set xval created by the SAS statements on , dose = 2. Only one observation from the XDATA= data set is used to produce a probit analysis table for a combination of classification variable levels. If more than one observation is present in the XDATA= data set, only the last observation is used.

See the section XDATA= SAS-data-set for the default values for those effects other than the single continuous variable, for which the fiducial limits are computed.

In this example, there are two classification variables, treat and sex. Fiducial limits for the dose variable are computed for the highest level of the classification variables, treat = B and sex = 1, which is the default specification. Since these are the default values, you would get the same values and fiducial limits if you did not specify the XDATA= option in this example. The confidence level for the fiducial limits can be specified by the ALPHA= option in the MODEL statement. The default level is 95%.

If a LOG10 or LOG option is used in the PROC PROBIT statement, the values and the fiducial limits are computed for both the single continuous variable and its logarithm.

Output 75.4.10 displays the OUTEST= data set. All parameters for an effect are included. The name of a parameter is generated by combining the variable names and levels in the effect. The maximum length of a parameter name is 32.

Output 75.4.10: Outest Data Set for Epidemiology Study

Obs	_NAME_	_TYPE_	_DIST_	_STATUS_	_LNLIKE_	r	Intercept	dose	treatA	sex0	treatAsex0	_C_
1	r	PARMS	Normal	0 Converged	-387.247	-1.00000	-0.88714	1.67739	-1.25367	-0.46329	1.28991	0.27347
2	Intercept	COV	Normal	0 Converged	-387.247	-0.88714	0.13194	-0.08735	0.05355	0.03029	-0.06706	-0.02807
3	dose	COV	Normal	0 Converged	-387.247	1.67739	-0.08735	0.06672	-0.04751	-0.03408	0.05862	0.01820
4	treatA	COV	Normal	0 Converged	-387.247	-1.25367	0.05355	-0.04751	0.06843	0.03606	-0.07532	-0.01708
5	treatB	COV	Normal	0 Converged	-387.247	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000
6	sex0	COV	Normal	0 Converged	-387.247	-0.46329	0.03029	-0.03408	0.03606	0.05238	-0.06360	-0.00809
7	sex1	COV	Normal	0 Converged	-387.247	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000
8	treatAsex0	COV	Normal	0 Converged	-387.247	1.28991	-0.06706	0.05862	-0.07532	-0.06360	0.11941	0.01913
9	treatAsex1	COV	Normal	0 Converged	-387.247	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000
10	treatBsex0	COV	Normal	0 Converged	-387.247	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000
11	treatBsex1	COV	Normal	0 Converged	-387.247	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000
12	_C_	COV	Normal	0 Converged	-387.247	0.27347	-0.02807	0.01820	-0.01708	-0.00809	0.01913	0.00895

The plots in the following three outputs, Output 75.4.11, Output 75.4.12, and Output 75.4.13, are generated by the PLOTS= option. The first plot, specified with the PREDPPLOT option, is the plot of the predicted probability against the first single continuous variable dose in the MODEL statement. You can specify values of other independent variables in the MODEL statement by using an XDATA= data set or by using the default values.

The second plot, specified with the IPPPLOT option, is the inverse of the predicted probability plot with the fiducial limits. It should be pointed out that the fiducial limits are not just the inverse of the confidence limits in the predicted probability plot; see the section Inverse Confidence Limits for the computation of these limits. The third plot, specified with the LPREDPLOT option, is the plot of the linear predictor $\mb {x^{\prime }\bbeta }$ against the first single continuous variable with the Wald confidence intervals.

Output 75.4.11: Predicted Probability Plot

Output 75.4.12: Inverse Predicted Probability Plot

Output 75.4.13: Linear Predictor Plot

When you combine the INEST= data set and the MAXIT= option in the MODEL statement, the PROBIT procedure can do prediction, if the parameterizations for the models used for the training data and the validation data are exactly the same. The following SAS statements show an example:

data validate;
   input treat $ dose  sex  n r group @@;
   datalines;
B  2.0  0  44 43  1   B  2.0  1  54 52  2
B  1.5  1  36 32  3   B  1.5  0  45 40  4
A  2.0  0  66 64  5   A  2.0  1  89 89  6
A  1.5  1  45 39  7   A  1.5  0  66 60  8
B  2.0  0  44 44  1   B  2.0  1  54 54  2
B  1.5  1  36 30  3   B  1.5  0  45 41  4
A  2.0  0  66 65  5   A  2.0  1  89 88  6
A  1.5  1  45 38  7   A  1.5  0  66 59  8
;

proc probit optc data=validate inest=out1;
   class treat sex;
   model r/n = dose treat sex sex*treat  / maxit = 0;
   output out = out3 p =p;
run;

proc probit optc lackfit data=validate inest=out1;
   class treat sex;
   model r/n = dose treat sex sex*treat / aggregate = group;
   output out = out4 p =p;
run;

After the first invocation of PROC PROBIT, you have the estimated parameters and their covariance matrix in the data set OUTEST = Out1, and the fitted probabilities for the training data set epidemic in the data set OUTPUT = Out2. See Output 75.4.10 for the data set Out1 and Output 75.4.14 for the data set Out2.

The validation data are collected in data set validate. The second invocation of PROC PROBIT simply passes the estimated parameters from the training data set epidemic to the validation data set validate for prediction. The predicted probabilities are stored in the data set OUTPUT = Out3 (see Output 75.4.15). The third invocation of PROC PROBIT passes the estimated parameters as initial values for a new fit of the validation data set with the same model. Predicted probabilities are stored in the data set OUTPUT = Out4 (see Output 75.4.16). Goodness-of-fit tests are computed based on the cells grouped by the AGGREGATE= group variable. Results are shown in Output 75.4.17.

Output 75.4.14: Out2

Obs	treat	dose	n	r	sex	p
1	A	2.17	142	142	0	0.99272
2	A	0.57	132	47	1	0.35925
3	A	1.68	128	105	1	0.81899
4	A	1.08	126	100	0	0.77517
5	A	1.79	125	118	0	0.96682
6	B	1.66	117	115	1	0.97901
7	B	1.49	127	114	0	0.90896
8	B	1.17	51	44	1	0.89749
9	B	2.00	127	126	0	0.98364
10	B	0.80	129	100	1	0.76414

Output 75.4.15: Out3

Obs	treat	dose	sex	n	r	group	p
1	B	2.0	0	44	43	1	0.98364
2	B	2.0	1	54	52	2	0.99506
3	B	1.5	1	36	32	3	0.96247
4	B	1.5	0	45	40	4	0.91145
5	A	2.0	0	66	64	5	0.98500
6	A	2.0	1	89	89	6	0.91835
7	A	1.5	1	45	39	7	0.74300
8	A	1.5	0	66	60	8	0.91666
9	B	2.0	0	44	44	1	0.98364
10	B	2.0	1	54	54	2	0.99506
11	B	1.5	1	36	30	3	0.96247
12	B	1.5	0	45	41	4	0.91145
13	A	2.0	0	66	65	5	0.98500
14	A	2.0	1	89	88	6	0.91835
15	A	1.5	1	45	38	7	0.74300
16	A	1.5	0	66	59	8	0.91666

Output 75.4.16: Out4

Obs	treat	dose	sex	n	r	group	p
1	B	2.0	0	44	43	1	0.98954
2	B	2.0	1	54	52	2	0.98262
3	B	1.5	1	36	32	3	0.86187
4	B	1.5	0	45	40	4	0.90095
5	A	2.0	0	66	64	5	0.98768
6	A	2.0	1	89	89	6	0.98614
7	A	1.5	1	45	39	7	0.88075
8	A	1.5	0	66	60	8	0.88964
9	B	2.0	0	44	44	1	0.98954
10	B	2.0	1	54	54	2	0.98262
11	B	1.5	1	36	30	3	0.86187
12	B	1.5	0	45	41	4	0.90095
13	A	2.0	0	66	65	5	0.98768
14	A	2.0	1	89	88	6	0.98614
15	A	1.5	1	45	38	7	0.88075
16	A	1.5	0	66	59	8	0.88964

Output 75.4.17: Goodness-of-Fit Table

The Probit Procedure

Goodness-of-Fit Tests
Statistic	Value	DF	Value/DF	Pr > ChiSq
Pearson Chi-Square	2.8101	2	1.4050	0.2454
L.R. Chi-Square	2.8080	2	1.4040	0.2456