SUPPORT / SAMPLES & SAS NOTES
 

Support

Sample 66969: Generate multivariate binary data with specified means and correlation matrix

DetailsResultsDownloadsAboutRate It

Generate multivariate binary data with given means and correlation matrix

Contents: Purpose / History / Requirements / Usage / Details / Limitations / References

 

PURPOSE:
The RanMBin macro generates values from multiple binary variables with specified means and correlation matrix. It creates an output data set with the specified number of observations and arranged either in wide or long format. Exchangeable and autoregressive AR(1) correlation structures are supported as well as unstructured correlation matrices. Banded correlation structures with autoregressive, power-decaying correlations are also easily handled.
HISTORY:
The version of the RanMBin macro that you are using is displayed when you specify version (or any string) as the first argument. For example:
    %RanMBin(version, )

The RanMBin macro always attempts to check for a later version of itself. If it is unable to do this (such as if there is no active internet connection available), the macro will issue the following message:

   NOTE: Unable to check for newer version

The computations performed by the macro are not affected by the appearance of this message.

Version
Update Notes
1.1 Corrected Prentice constraint computation
1.0 Initial coding
REQUIREMENTS:
Only Base SAS® is required.
USAGE:
Follow the instructions on the Downloads tab of this sample to save the RanMBin macro definition. Replace the text within quotation marks in the following statement with the location of the RanMBin macro definition file on your system. In your SAS program or in the SAS editor window, specify this statement to define the RanMBin macro and make it available for use:
   %inc "<location of your file containing the RanMBin macro>";

Following this statement, you can call the RanMBin macro. See the Results tab for examples.

The following parameters are required when using the RanMBin macro:

means=value-list , or
inmeans=data-set
Either means= or inmeans= is required to specify the number of binary variables, m, to be created and their means. If means= and inmeans= are both specified, inmeans= is ignored.

If inmeans= is specified, the first observation in the data set is read. The number of numeric variables in the data set determines the number of binary variables to be created. The values in all numeric variables are used as means of the desired binary variables. The values in the variables must be real numbers between 0 and 1. The names of the numeric variables in the data set is not restricted.

If means= is specified, value-list is a space-separated list of real numbers between 0 and 1. Additionally, any value in the list can be integer*value, where integer is a positive integer to specify that value should be replicated integer times. For example, means=0.4 2*0.1 3*0.15 is equivalent to means=0.4 0.1 0.1 0.15 0.15 0.15.

 

exch=value , or
ar=value , or
incorr=data-set
One of exch=, ar=, or incorr= must be specified to determine the correlations among the m binary variables to be created. Specify exch= to create a correlation matrix with exchangeable structure in which every pair of variables has the specified correlation value. An autoregressive structure is created by specifying ar=. See k= for how it modifies that structure. If exch= or ar= is specified, value must be a real number between 0 and 1. If incorr= is specified, the data set must have exactly m numeric variables. The names of the numeric variables in the data set are not restricted. The observations in the data set must contain the desired correlation matrix with positive correlations. The ith row and column in the data set should contain the correlations corresponding with the ith mean. Only the values in the upper triangle above the main diagonal are read. Values on and below the main diagonal are ignored.

 

The following parameters are optional:

out=data-set
Specifies the name of the data set to be created that contains the generated random variables. The variables, the number of observations, and the shape of the data set depend on outshape=. See the descriptions of outshape= and n= for details. If omitted, out=Mbin.
outshape=wide|long
Specifies the shape and structure of the out= data set. If outshape=wide, each observation contains one set of m random values (either 0 or 1) for the m binary variables named Y1-Ym. The data set contains n observations as specified by n=. If outshape=long, each set of m random values creates m observations and all random values are placed in a single binary variable named Y. The data set contains nm observations. The probability that a random value equals 1 is equal to the corresponding mean as specified in means= or inmeans=. A numeric variable identifying the separate sets of generated random values is included in the data set. It is named ID by default with values 1, 2, and so on. If outshape=long, an additional variable is added to identify the binary variable within each generated set. It is named SubID by default with values 1, 2, ... , m in each set. If omitted, outshape=wide.
n=n
Specifies the number of sets of random values to create in the out= data set. Each set contains m random values for the m binary variables with the specified means and correlation matrix. Each value is either 0 or 1. If omitted, n=100.
subject=variable-name
Specifies the name of the numeric variable in the out= data set that identifies each set of m generated random values. See the description of outshape= for details. If omitted, subject=ID.
within=variable-name
Specifies the name of the numeric variable in the out= data set that identifies the binary variable within each set of m generated random values when outshape=long. within= is ignored if outshape=wide. See the description of outshape= for details. If omitted, within=SubID.
k=k
Specifies the number of bands, k=1, 2, ..., m-1, parallel to the main diagonal that will contain non-zero correlation values. All correlations in bands beyond the kth band will be set to zero. k= can be specified with ar= or incorr=. When specified with ar=α, the correlation matrix has value αk for all elements in the kth band and zero for all bands beyond k. If specified with incorr=, any non-zero correlations specified beyond the kth band are ignored and set to zero. If omitted, k=m-1.
seed=number
Specifies an integer used to start the generation of random values. If you do not specify a seed, or if you specify a value less than or equal to zero, the seed is by default generated from reading the time of day from the computer's clock. If omitted, seed=0.
DETAILS:
Wei et al. (2020) present algorithms that "generate high-dimensional binary data with specified correlation structures and unequal probabilities." The RanMBin macro implements three of these algorithms and can generate random values from m binary variables that have specified means (means= or inmeans=) and a correlation matrix with exchangeable (exch=) , autoregressive AR(1) (ar=), or banded power-decaying structures (ar= with k=). Random values can also be generated when the correlation matrix is fully unstructured (incorr=).

The macro creates an output data set (out=) with n observations (n=) and random values in variables Y1, Y2, ..., Ym if a wide data set structure (outshape=wide) is selected, or nm observations if a long structure (outshape=long) is desired. The output data set contains a variable (subject=) that numbers the observations in the wide format. In the long format, each generated set of random values for the m variables creates m output observations that are placed in a single variable, Y. Also included is a variable (subject=) that identifies the set of random values and another variable (within=) that identifies the random variable in the set associated with each observation. Reproducible sets of random values can be obtained by specifying a seed (seed=).

Errors and constraints

As noted by Wei et al. (2020) and in "Generalized Estimating Equations" in the Details section of the GENMOD documentation, there are natural restrictions on the correlations among binary variables. Prentice (1988) developed formulas expressing the constraints and these are presented by Wei et al. The macro checks the specified means and correlations to verify that the Prentice constraints are satisfied. Any violations are displayed. When violations occur, changes to the specified means and/or correlations can be made in a subsequent run of the macro. In addition, the algorithm used when incorr= or k= is specified can fail when computation of a binomial probability results in an invalid value. A note is displayed in the log when either this failure or violations of the Prentice constraints occurs.

The macro does not create any displayed output unless violations of the Prentice constraints are detected as noted above.

The RanMBin macro can be used to simulate multivariate binary data for many situations such as when a binary response is measured over time or under many conditions. Such data is often modeled using a Generalized Estimating Equations (GEE) or random effects model such as available in the GEE, GENMOD, and GLIMMIX procedures. The correlation structures available in the macro are commonly used in GEE models.

Multiple means vectors and/or correlation matrices

While the RanMBin macro does not directly support random value generation for multiple mean vectors and/or correlation matrices, this can be done using the RunBY macro. With it, you can run the RanMBin macro repeatedly for each mean vector in the inmeans= data set and/or each correlation matrix in the incorr= data set. See the RunBY macro documentation for details on its use. Also see the example titled "Multiple mean vectors" on the Results tab above.

LIMITATIONS:
All correlations must be positive.
REFERENCES:
Prentice, R. L. (1988), "Correlated Binary Regression with Covariates Specific to Each Binary Observation," Biometrics, 44, 1033-1048.

Wei J., Shuang S., Lin H., and Hongyu Z. (2020), "A Set of Efficient Methods to Generate High-Dimensional Binary Data With Specified Correlation Structures," The American Statistician, DOI: 10.1080/00031305.2020.1816213.

 




These sample files and code examples are provided by SAS Institute Inc. "as is" without warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability and fitness for a particular purpose. Recipients acknowledge and agree that SAS Institute shall not be liable for any damages whatsoever arising out of their use of this material. In addition, SAS Institute will provide no support for the materials contained herein.