Assessing the Accuracy of Cluster Allocations Obtained from Finite Mixture Models

Started ‎12-18-2023 by
Modified ‎12-18-2023 by
Views 193

Overview

 

Model-based clustering is one of the many uses for finite mixture models and SAS/STAT software’s FMM procedure. The finite mixture model approach to clustering assumes that the observations to be clustered are drawn from a mixture of a specified number of populations in varying proportions (McLachlan and Basford; 1988). After the finite mixture model is fit to estimate the model parameters and the posterior probabilities of population membership, each observation is assigned membership to the population for which it has the highest estimated posterior probability of belonging. As with any method of cluster analysis, the practitioner faces the problem of assessing the accuracy of the population membership allocations that are obtained, because in practice, population membership is not observed. To address this assessment problem, Basford and McLachlan (1985) propose a method of estimating the correct allocation rates for individual populations and for the overall mixture that is based on averaging appropriate functions of the maximum posterior probabilities. Because the proposed estimators for the allocation rates are known to be biased, bootstrap methods are used to estimate the bias, enabling the production of bias-adjusted estimates of the correct allocation rates. This example shows how to produce the bias-adjusted allocation rate estimates by using the FMM procedure and a little SAS programming.

 

The SAS output in this example is generated using SAS/STAT 12.1.

 

Analysis

 

The underlying premise of the finite mixture model approach to clustering is that the population from which the response variable of interest is sampled can be partitioned into mutually exclusive subpopulations. It is assumed that there is a distinct probability distribution with a known parametric form associated with each subpopulation. These subpopulation probability distributions govern the distribution of values of the response variable, given that an observation originates from a particular subpopulation. Thus, the marginal distribution of the response variable is a mixture of J distinct distributions. The prior probability that an observation is drawn from a particular subpopulation is equal to the proportion of the overall population that is a member of that particular subpopulation. Finite mixture models represent the marginal distribution of the response variable as a linearly weighted sum of component probability distributions:

stat_webex_rare0002.png

The component (subpopulation) distributions, ƒ(θi), can be discrete or continuous distributions; θi is a vector of parameters for the jth component probability distribution. The mixing probabilities, πj, measure the prior probabilities of component (subpopulation) membership.

 

Given a realization of the response variable and the parameters of the component distributions, the posterior probability that the ith observation is a member of the jth subpopulation is

stat_webex_rare0008.png

Each realization is assigned membership to the subpopulation to which it has the highest posterior probability of belonging. Because stat_webex_rare0009.png, the maximum estimated posterior probabilities have a range of (I/JI). If the maximum estimated posterior probabilities are near 1 for most of the observations in a sample, this is compelling evidence that the finite mixture model can cluster the sample with a high degree of certainty. As the maximum estimated posterior probabilities approach I/J, this indicates that the components of the fitted mixture model are too close together for the sample to be clustered with any certainty. A visual inspection of a graph of the maximum posterior probabilities can be informative, but such assessments are necessarily subjective. It would be useful to have a summary statistic that provides an objective measure of how well the data have been clustered.

 

The correct allocation rates measure the proportion of observations that have been allocated to the correct subpopulation according to the maximum posterior probability criterion. If you could observe component (subpopulation) membership, computing the correct allocation rates would be straightforward. To do so you would create three sets of indicator variables. Denote the first set as stat_webex_rare0012.png, where each zis associated with a particular subpopulation. For each observation in the sample, set zij = 1 if the observation originates from the jth subpopulation, and set 

zij = 0 otherwise. Denote the second set as stat_webex_rare0016.png, where each variable similarly corresponds to a particular subpopulation. For each observation in the sample, set stat_webex_rare0017.png if stat_webex_rare0018.png and set stat_webex_rare0019.png otherwise, where stat_webex_rare0020.png is the estimated posterior probability of subpopulation membership. Denote the third set as stat_webex_rare0021.png, where stat_webex_rare0022.png if stat_webex_rare0023.png and stat_webex_rare0024.png otherwise. The correct allocation rate for the jth subpopulation is computed as

stat_webex_rare0025.png

where

stat_webex_rare0026.png

The true overall correct allocation rate for the mixture is

stat_webex_rare0027.png

Because the correct allocation rates Aj  and A depend on the unobserved indicator variables z, in practice they must be estimated. Ganesalingam and McLachlan (1980) propose estimating the overall correct allocation rate by

stat_webex_rare0031.png

Basford and McLachlan (1985) propose estimating the individual correct allocation rates by

stat_webex_rare0032.png

If the parameters θ of the finite mixture model are estimated consistently, then the biases, – andTj  – A, converge in probability to 0 as stat_webex_rare0036.png. However, results from both McLachlan and Basford (1988) and Basford and McLachlan (1985) indicate that T andT tend to overestimate the correct allocation rates in finite samples, and so some method of bias correction is recommended. Basford and McLachlan (1985) propose using the parametric bootstrap method to estimate the bias in order to produce bias-adjusted estimates of the correct allocation rates. The parametric bootstrap method works because, although you cannot observe the subpopulation of origin in the original sample, you can observe the subpopulation of origin in the bootstrap samples. This makes it possible to compute the overall correct allocation rate and the subpopulation correct allocation rates Aj  for each bootstrap sample. You can then estimate the bias by taking the average of the differences  – and Aj  – T over the bootstrap samples.

 

Conceptually, there are five steps in the process of computing the bias-adjusted estimates of the correct allocation rates by using the parametric bootstrap method:

 

  1. Generate parametric bootstrap samples. To do this, generate K replicates of the original data set. For each observation in the K data sets, generate a pseudorandom variable that takes the values I,...,Jwith probabilities 

    stat_webex_rare0043.png. Replace the response variable observations with a pseudorandom variable that has a mixture distribution equivalent to that of the finite mixture model that was fitted by using the original data

  2. Fit a finite mixture model to each of the K bootstrap samples by using the model specification that was applied to the original sample, and compute the posterior probabilities of component membership.

  3. Compute A, Aj ,T, and Tj for each of the bootstrap samples.

  4. Estimate the bias of and the standard error of the bias estimate as

    stat_webex_rare0044.png

    Then estimate the bias of each T and the standard error of the bias estimate as

    stat_webex_rare0045.png

  5. Compute the bias-adjusted estimate of the overall correct allocation rate as 
    T* – b

    Then compute the bias-adjusted estimates of the subpopulation correct allocation rates as

    T*j  =Tj  – bj                 = 1,...,J

 

You can also compute the root mean square errors for T and Tj to use as measures of precision:

stat_webex_rare0048.png

and

stat_webex_rare0049.png

The ratio of the bias of an estimator to its root mean square errors can be used to indicate how serious a problem bias is for a given estimator. If the ratio stat_webex_rare0050.png, then bias is probably not a serious problem.

 

Example

 

The data used in this example are from a study by Symons, Grimson, and Yuan (1983) that investigates the incidence of sudden infant death syndrome (SIDS) and the problem of identifying the counties at high risk of SIDS in North Carolina. The data available are the number of deaths due to sudden infant death syndrome (SIDS) and the number of live births in the 100 counties of North Carolina for the years 1974–1978. The study models the number of incidences of SIDS as a mixture of two Poisson distributions, one representing a "normal" subpopulation and the other representing a "high-risk" subpopulation.

 

The following SAS statements create a SAS data set named NCSIDS. In addition to the original variables SIDS and Births, the variables Logrisk and Rate are created. Logrisk is the natural logarithm of the number of births, and Rate is the incidence rate.

data ncsids;
   input County $ 1-12 Births SIDS;
   Rate=SIDS/Births;
   Logrisk=log(Births);
   datalines;
Alamance        4672     13
Alexander       1333      0
Alleghany        487      0
Anson           1570     15
Ashe            1091      1
Avery            781      0
Beaufort        2692      7

   ... more lines ...

Wayne           6638     18
Wilkes          3146      4
Wilson          3702     11
Yadkin          1269      1
Yancey           770      0
;

 

The following SAS statements use the FMM procedure to fit the two-component finite mixture model. The K= option in the MODEL statement specifies that two component distributions be fit. The DIST= option specifies that the two distributions be Poisson, and the OFFSET= option specifies that the variable Logrisk be used as an offset. The CLASS option in the OUTPUT statement adds the estimated component membership to the OUTPUT data set in a variable that is named Class by default. The MAXPOST options adds the maximum posterior probability to the output data set in a variable that is named Maxpost by default. The ODS OUTPUT statement saves the "Parameter Estimates," "Number of Observations," and "Mixing Probabilities" tables to SAS data sets.

proc fmm data=ncsids;
   model SIDS = / dist=poisson k=2 offset=Logrisk;
   output out=model class maxpost prior;
   ods output ParameterEstimates=Parameters
              NObs=NObs(where=(label="Number of Observations Used") keep=label N)
              MixingProbs=MixingProbs(keep=prob);
run;

 

Output 1 displays selected portions of the FMM procedure’s output. In the Inverse Linked Estimates column of the "Parameter Estimates" table, you can see that the mean for the component 1 Poisson distribution is 0.001693 and the mean for the component 2 Poisson distribution is 0.003805. Thus, component 1 represents the "normal-risk" subpopulation and component 2 represents the "high-risk" subpopulation. You can also see from the "Mixing Probabilities" table that the prior probability for component 1 is 0.7969; the prior probability for component 2 can be inferred from the fact that the sum of the prior probabilities equals 1.

 

Output 1 Two Component Finite Mixture Model of SIDS Data
 
The FMM Procedure

 

Parameter Estimates for 'Poisson' Model
Component Parameter Estimate Standard Error z Value Pr > |z| Inverse Linked
Estimate
1 Intercept -6.3813 0.08812 -72.42 <.0001 0.001693
2 Intercept -5.5715 0.2288 -24.35 <.0001 0.003805

 

Parameter Estimates for Mixing Probabilities
Parameter Linked Scale Probability
Estimate Standard Error z Value Pr > |z|
Probability 1.3669 0.8024 1.70 0.0885 0.7969

 

Fit Statistics
-2 Log Likelihood 474.3
AIC (smaller is better) 480.3
AICC (smaller is better) 480.5
BIC (smaller is better) 488.1
Pearson Statistic 113.3
Effective Parameters 3
Effective Components 2

 

Parameter Estimates for 'Poisson' Model
Component Parameter Estimate Standard Error z Value Pr > |z| Inverse Linked
Estimate
1 Intercept -6.3813 0.08812 -72.42 <.0001 0.001693
2 Intercept -5.5715 0.2288 -24.35 <.0001 0.003805

 

Parameter Estimates for Mixing Probabilities
Parameter Linked Scale Probability
Estimate Standard Error z Value Pr > |z|
Probability 1.3669 0.8024 1.70 0.0885 0.7969

 

Figure 1 is a plot of the ordered incidence rates, with color-coded markers to indicate whether a county is classified by the finite mixture model as normal risk or high risk. The plot reveals that, with a couple of exceptions, the counties with the highest rates tend to be classified as high risk by the finite mixture model.

 

Figure 1 Ordered Incidence Rates
 
orderedrates.png

 

You can generate plots of the maximum posterior probabilities to visually assess how accurately the finite mixture model can cluster the data. For example, the following SAS statements generate a plot of the ordered maximum posterior probabilities and a plot of the maximum posterior probabilities versus the incidence rates:

proc rank data=model out=model descending;
   var maxpost;
   ranks postorder;
run;
title "Ordered Maximum Posterior Probabilities";
proc sgplot data=model;
   scatter y=maxpost x=postorder / markerattrs=(symbol=CircleFilled size=4px)
                                   group=class;
run;
title;
title "Maximum Posterior Probabilities by Incidence Rates";
proc sgplot data=model;
   scatter y=maxpost x=rate / markerattrs=(symbol=CircleFilled size=4px)
                              group=class;
run;
title;
The plot on the left in Figure 2 shows that about half the counties have relatively large (> 0.9) maximum posterior probabilities, indicating that they can be classified with a high degree of certainty. The maximum posterior probabilities then decline at an accelerating rate, indicating a deteriorating degree of certainty about the correct allocation. The counties that are classified as high risk appear to account for a disproportionate share of those with relatively low (< 0.7) maximum posterior probabilities. The plot on the right shows that counties with incidence rates between approximately 0.0026 and 0.005 tend to have the highest degree of uncertainty about the correct allocation.

 

 
Figure 2 Maximum Posterior Probabilities
 

Ordered Maximum Posterior Probabilities

Maximum Posterior Probabilities by Incidence Rates

maxpost.png

 
maxpostbyrate.png

You can compute the estimated correct allocation rates T, T1, and T2 as follows by using the data that are saved in the output data sets Model, Nobs, and MixingProbs. The results are saved in the data set ModelPost.

proc sort data=model out=model;
   by Class;
run;

proc means data=model sum;
   by Class;
   var Maxpost;
   output out=ModelPost(drop=_TYPE_ _FREQ_)
          sum(Maxpost)=Maxpost;
run;

proc transpose data=ModelPost prefix=MaxPost
               out=ModelPost(drop=_LABEL_ _NAME_);
   var Maxpost;
   id Class;
run;
data ModelPost;
   merge ModelPost NObs(drop=label) MixingProbs;
   T=(MaxPost1 + MaxPost2)/N;
   T1=MaxPost1/(N*Prob);
   T2=MaxPost2/(N*(1-Prob));
   label T=Estimated Overall Correct Allocation Rate
         T1=Estimated Component 1 Correct Allocation Rate
         T2=Estimated Component 2 Correct Allocation Rate;
run;
proc print data=ModelPost noobs label;
   var T T1 T2;
run;

Output 2 shows the estimated correct allocation rates. The estimate of the overall rate is .88, the rate for the normal-risk subpopulation is .96, and the rate for the high-risk subpopulation is .59.

 

Even though these estimated allocation rates might be biased, they indicate that the finite mixture model can correctly classify observations from the normal-risk population with a high degree of accuracy, but they also indicate that the degree of accuracy is considerably lower for observations from the high-risk population. This implies that the model is much more likely to incorrectly classify a high-risk county as normal risk than to incorrectly classify a normal-risk county as high risk.

 

Output 2 Allocation Rates
 
Estimated Overall
Correct Allocation
Rate
Estimated Component
1 Correct Allocation
Rate
Estimated Component
2 Correct Allocation
Rate
0.88775 0.96290 0.59292

 

Computing Bias-Adjusted Correct Allocation Rates by Using the Parametric Bootstrap Method

 

The section Analysis describes a five-step procedure that uses the parametric bootstrap method to produce bias-adjusted correct allocation rates. The five subsections that follow describe each step in greater detail.

 

Step 1: Generate the Bootstrap Samples

 

Step 1 is to generate the bootstrap samples. One of the easiest ways to do this is to use the SURVEYSELECT procedure. If you use the METHOD=SRS option, SAMPRATE=1 option, and REPS=200 option, PROC SURVEYSELECT generates 200 exact replicates of the original data set. The procedure automatically generates a variable named Replicate that indexes the samples. Using the SEED option ensures reproducibility. The OUTPUT option saves the 200 bootstrap samples in a data set named Bootstrap.

proc surveyselect data=model(keep=county sids logrisk)
                  out=bootstrap
                  seed=872398
                  method=srs
                  samprate=1
                  rep=200;
run;

Next, you retrieve the estimated Poisson parameters and the first component mixing probability from the output data sets Parameters and MixingProbs and store those values in macro variables for later use.

data parameters;
   set parameters;
   if component=1 then call symput('theta1',estimate);
   if component=2 then call symput('theta2',estimate);
run;

data MixingProbs;
   set MixingProbs;
   call symput('pi1',prob);
run;

Then, you generate the pseudorandom variable Group, which takes on the value 1 or 2 with the probability stat_webex_rare0051.png or stat_webex_rare0052.png, respectively. Also, you replace the response variable SIDS with the pseudorandom numbers that have a distribution equivalent to the estimated finite mixture model. Then, to prepare the data set Bootstrap for BY-group processing, you sort it by Replicate.

data bootstrap;
   set bootstrap;
   call streaminit(987346598);
   Group=rand('TABLE',&pi1);
   mu1=exp(&theta1+logrisk);
   mu2=exp(&theta2+logrisk);
   SIDS=ifn(group=1,rand('POISSON',mu1),rand('POISSON',mu2));
run;

 

proc sort data=bootstrap out=bootstrap;
   by Replicate;
run;

 

Step 2: Fit a Finite Mixture Model to Each Bootstrap Sample

 

Step 2 is to use PROC FMM with the original finite mixture model specification and BY-group processing to fit a finite mixture model to each of the 200 bootstrap samples. In an OUTPUT statement, you specify the OUT=, CLASS, MAXPOST, and PRIOR options. The ODS OUTPUT statement saves the parameter estimates for each bootstrap sample in the data set BootParameters and saves the model convergence status in the data set Converge. These two data sets and the output data set ML will be merged so that you can perform some recommended filtering of the output data set ML.

ods select none;
proc fmm data=bootstrap;
   by Replicate;
   model SIDS = / dist=poisson k=2 offset=Logrisk;
   output out=ml class maxpost prior;
   ods output ParameterEstimates=BootParameters(keep=replicate component estimate)
              ConvergenceStatus=Converge(keep=replicate status);
run;

Before you perform any analysis, it is recommended that you perform two filtering operations on the output data set ML. The first filtering operation concerns the convergence status of the models that have been fit to each bootstrap sample. Before you accept the estimates, you should check the convergence status of each model and discard any estimates that were generated by a model that failed to converge. The data set Converge contains a variable named Status that has a value of 0 if the model converged and a nonzero value if the model fails to converge.

The second filtering operation requires a slightly more elaborate explanation. Output 1 shows that PROC FMM names the two component distributions component 1 and component 2. Because the parameter estimate for component 1 is smaller than the estimate for component 2, component 1 is referred to as the normal-risk component and component 2 is referred to as the high-risk component. Thus, when the variable Class has a value of 1, it is interpreted to mean that a particular observation has been assigned to the normal-risk subpopulation; when Class has a value of 2, it is interpreted to mean that a particular observation has been assigned to the high-risk subpopulation. However, this naming of the components by PROC FMM is completely arbitrary. If you fit the same model to a different sample, there is no guarantee that component 1 will be associated with the smaller parameter estimate and thus represent the normal-risk component. This means that when the variable Class has a value of 1, it cannot necessarily be interpreted to mean that the observation has been assigned to the normal-risk subpopulation. To ensure that the classifications have the same interpretations across the bootstrap samples, you must ensure that the magnitudes of the parameter estimates are in the same order. When they are not in the same order, you must switch the values of the variable Class and the values of the prior probabilities that are stored in the output data set ML.

 

To switch these values, you first transpose the data set BootParameters and sort it by Replicate. Then you sort the data set Converge by Replicate and merge BootParameters and Converge with ML.

proc transpose data=BootParameters
               out=BootParameters(drop=_NAME_)
               prefix=Estimate;
  by Replicate;
  id Component;
run;

proc sort data=BootParameters out=BootParameters;
   by Replicate;
run;

proc sort data=Converge;
   by Replicate;
run;
data ml;
   merge ml BootParameters Converge;
   by Replicate;
run;

Next, you can perform a DATA step and use a WHERE clause to exclude all observations where Status is nonzero. Then, you create a set of variables that are duplicates of the variables that contain the parameter estimates, the prior probabilities, and the variable Class. Finally, you perform a conditional (IF-THEN-DO) operation based on the relative magnitudes of the parameter estimates and switch the values in the original variables as needed. The parameter estimates are stored in the variables Estimate1 and Estimate2, and the prior probabilities are stored in the variables Prior_1 and Prior_2.

data ml(drop=Prior1 Prior2 Lambda1 Lambda2 Class2 Status);
   set ml(where=(Status=0));
   Prior1=Prior_1;
   Prior2=Prior_2;
   Post1=Post_1;
   Post2=Post_2;
   Lambda1=Estimate1;
   Lambda2=Estimate2;
   Class2=Class;
   if Estimate1>Estimate2 then do;
      Prior_1=Prior2;
      Prior_2=Prior1;
      if Class2=1 then Class=2;
      if Class2=2 then Class=1;
   end;
run;

 

Step 3: Compute the True Correct Allocation Rates and the Estimated Correct Allocation Rates for Each Bootstrap Sample

 

Step 3 is to compute the true correct allocation rates A, A1 , and A2  the estimated correct allocation rates T, T1 , and T2  for the 200 bootstrap samples.

 

The following SAS statements compute the true correct allocation rates A, A1 , and A2 for the bootstrap samples and save the results in a data set named Allocation:

data ml;
   set ml;
   z1=ifn(Group=1,1,0);
   zhat1=ifn(Class=1,1,0);
   d1=ifn(z1=zhat1,1,0);
   z2=ifn(Group=2,1,0);
   zhat2=ifn(Class=2,1,0);
   d2=ifn(z2=zhat2,1,0);
   A1=z1*d1;
   A2=z2*d2;
run;
proc sort data=ml out=ml;
   by Replicate Class;
run;

proc means data=ml(keep= replicate A1 A2 z1 z2) sum;
   by Replicate;
   var A1 A2 z1 z2;
   output out=allocation(drop=_TYPE_ _FREQ_)
          sum(A1)=A1
          sum(A2)=A2
          sum(z1)=n1
          sum(z2)=n2;
run;
data allocation;
   set allocation;
   A1=A1/n1;
   A2=A2/n2;
   n=n1+n2;
   A=(n1*A1 + n2*A2)/n;
run;

The next group of SAS statements computes the estimated correct allocation rates T, T1 , and T2 for the bootstrap samples and saves the results to the same data set, Allocation:

proc means data=ml sum;
   by Replicate Class;
   var maxpost;
   output out=PostSums(drop=_TYPE_ _FREQ_)
          sum(maxpost)=Maxpost;
run;

proc transpose data=PostSums prefix=SumMaxPost
               out=PostSums(drop=_LABEL_ _NAME_);
   var Maxpost;
   by Replicate;
   id Class;
run;

proc sort data=allocation out=allocation;
   by Replicate;
run;

proc sort data=PostSums out=PostSums;
   by Replicate;
run;
data prior;
   set ml(keep=Replicate Prior_1 Prior_2);
   by Replicate;
   if First.Replicate;
run;
proc sort data=prior out=prior;
   by Replicate;
run;
data allocation;
   merge allocation PostSums prior;
   by Replicate;
run;

data allocation;
   set allocation;
   T=(SumMaxPost1 + SumMaxPost2)/n;
   T1=SumMaxPost1/(n*Prior_1);
   T2=SumMaxPost2/(n*Prior_2);
run;

Step 4: Estimate the Biases of the Correct Allocation Rate Estimators

 

Step 4 is to compute the biases b, b1, and b2, the standard errors of the bias estimates SE(b), SE(b1), and SE(b2),  and the root mean square errors RMSE(T), RMSE(T1 ), and RMSE(T2).

 

The following SAS statements accomplish this task and save the results in the data set Bias:

data allocation;
   set allocation;
   B=T-A;
   Bsqr=B**2;
   B1=T1-A1;
   B1sqr=B1**2;
   B2=T2-A2;
   B2sqr=B2**2;
run;
proc means data=allocation mean;
   var B B1 B2 Bsqr B1sqr B2sqr;
   output out=bias
          mean(B)=B
          var(B)=varB
          mean(B1)=B1
   var(B1)=varB1
          mean(B2)=B2
   var(B2)=varB2
          mean(Bsqr)=Bsqr
          mean(B1sqr)=B1sqr
          mean(B2sqr)=B2sqr
          N=K;
run;
data bias;
   set bias;
   se_B=sqrt(varB/K);
   se_B1=sqrt(varB1/K);
   se_B2=sqrt(varB2/K);
   RMSE_T1=sqrt(B1sqr);
   RMSE_T2=sqrt(B2sqr);
   RMSE_T=sqrt(Bsqr);
run;

Step 5: Compute the Bias-Adjusted Estimates of the Correct Allocation Rates

 

Step 5 is to compute the bias-adjusted estimates of the correct allocation rates T*, T*1, and T*2. You do this by merging the data set Bias with the data set ModelPost, which contains the estimated correct allocation rates from the original sample, and subtracting the bias estimates b, b1, and bfrom the corresponding correct allocation rate estimatesT, T1, and T2. The results are saved in the data set BiasAdjusted.

data BiasAdjusted;
   merge bias(drop = _TYPE_ _FREQ_) ModelPost;
   BiasAdjT=T-B;
   BiasAdjT1=T1-B1;
   BiasAdjT2=T2-B2;
run;

The data set BiasAdjusted holds the results, but it is not in the best form for printing. All the data are currently stored in a wide form that consists of a single vector of values. To prepare the results for printing, it is recommended that you reshape the data set to a long form; that is, you stack the data so there are separate rows for the overall population and the two subpopulations. You can do this by breaking BiasAdjusted into three data sets and then appending those three data sets together.

data T;
   set BiasAdjusted(keep=T B SE_B BiasAdjT RMSE_T);
   length Label $ 12;
   label='Overall';
   rename T=Estimate
          B=Bias
          SE_B=SE
          BiasAdjT=Adjusted
          RMSE_T=RMSE;
run;

data T1;
   set BiasAdjusted(keep=T1 B1 SE_B1 BiasAdjT1 RMSE_T1);
   length Label $ 12;
   Label='Normal Risk';
   rename T1=Estimate
          B1=Bias
          SE_B1=SE
          BiasAdjT1=Adjusted
          RMSE_T1=RMSE;
run;

data T2;
   set BiasAdjusted(keep=T2 B2 SE_B2 BiasAdjT2 RMSE_T2);
   length Label $ 12;
   Label='High Risk';
   rename T2=Estimate
          B2=Bias
          SE_B2=SE
          BiasAdjT2=Adjusted
          RMSE_T2=RMSE;
run;
proc append base=results data=T;
run;
proc append base=results data=T1;
run;
proc append base=results data=T2;
run;

Next, you compute the ratios 

stat_webex_rare0069.pngstat_webex_rare0070.png, and stat_webex_rare0071.png and create appropriate labels to prepare the data set Results for printing:
data results;
   set results;
   ratio=bias/rmse;
   label Label=Population
         Estimate=Estimated Correct Allocation Rate
         Bias=Estimate of Bias
         SE=Standard Error of Bias Estimate
         Adjusted=Bias-Adjusted Correct Allocation Rate
         RMSE=RMSE of Correct Allocation Rate
         ratio=ratio of Bias to RMSE;
run;

Finally, you print the data set Results:

ods select all;
proc print data=results noobs label;
   var Label Estimate RMSE Bias SE ratio Adjusted;
   title 'Estimated Correct Allocation Rates';
   title2 'Parametric Bootstrap Method';
run;

Output 3 shows the results. As you can see, for this example the bias estimates are all positive but fairly small. The ratios of bias to RMSE are all below the .25 threshold, indicating that bias is not a significant issue for this model and sample.

 

Output 3 Parametric Bootstrap Results
 
Estimated Correct Allocation Rates
Parametric Bootstrap Method

Population Estimated Correct
Allocation Rate
RMSE of Correct
Allocation Rate
Estimate of Bias Standard Error
of Bias Estimate
ratio of Bias
to RMSE
Bias-Adjusted
Correct Allocation
Rate
Overall 0.88775 0.04590 .004583433 .003237460 0.099858 0.88316
Normal Risk 0.96290 0.01902 .001455327 .001344381 0.076513 0.96145
High Risk 0.59292 0.12715 .003564499 .009010140 0.028033 0.58935

 

References

 

Basford, K. E. and McLachlan, G. J. (1985), “Estimation of Allocation Rates in a Cluster Analysis Context,” Journal of the American Statistical Association, 80(390), 286–293.

 

Ganesalingam, S. and McLachlan, G. J. (1980), “Error rate estimation on the basis of posterior probabilities,” Pattern Recognition, 12, 405–413.

 

McLachlan, G. J. and Basford, K. E. (1988), Mixture Models, New York: Marcel Dekker.

 

Symons, M. J., Grimson, R. C., and Yuan, Y. C. (1983), “Clustering of Rare Events,” Biometrics, 39, 193–205.

Version history
Last update:
‎12-18-2023 04:07 PM
Updated by:
Contributors

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Article Labels
Article Tags
Programming Tips
Want more? Visit our blog for more articles like these.