66969 - Generate multivariate binary data with specified means and correlation matrix

SUPPORT / SAMPLES & SAS NOTES

Support

Sample 66969: Generate multivariate binary data with specified means and correlation matrix

Generate multivariate binary data with given means and correlation matrix

Contents:

Purpose / History / Requirements / Usage / Details / Limitations / References

PURPOSE:

The RanMBin macro generates values from multiple binary variables with specified means and correlation matrix. It creates an output data set with the specified number of observations and arranged either in wide or long format. Exchangeable and autoregressive AR(1) correlation structures are supported as well as unstructured correlation matrices. Banded correlation structures with autoregressive, power-decaying correlations are also easily handled.

HISTORY:

The version of the RanMBin macro that you are using is displayed when you specify version (or any string) as the first argument. For example:

    %RanMBin(version, )

The RanMBin macro always attempts to check for a later version of itself. If it is unable to do this (such as if there is no active internet connection available), the macro will issue the following message:

   NOTE: Unable to check for newer version

The computations performed by the macro are not affected by the appearance of this message.

Version	Update Notes
1.1	Corrected Prentice constraint computation
1.0	Initial coding

REQUIREMENTS:

Only Base SAS^® is required.

USAGE:

Follow the instructions on the Downloads tab of this sample to save the RanMBin macro definition. Replace the text within quotation marks in the following statement with the location of the RanMBin macro definition file on your system. In your SAS program or in the SAS editor window, specify this statement to define the RanMBin macro and make it available for use:

   %inc "<location of your file containing the RanMBin macro>";

Following this statement, you can call the RanMBin macro. See the Results tab for examples.

The following parameters are required when using the RanMBin macro:

means=value-list , or
inmeans=data-set

Either means= or inmeans= is required to specify the number of binary variables, m, to be created and their means. If means= and inmeans= are both specified, inmeans= is ignored.

If inmeans= is specified, the first observation in the data set is read. The number of numeric variables in the data set determines the number of binary variables to be created. The values in all numeric variables are used as means of the desired binary variables. The values in the variables must be real numbers between 0 and 1. The names of the numeric variables in the data set is not restricted.

If means= is specified, value-list is a space-separated list of real numbers between 0 and 1. Additionally, any value in the list can be integer*value, where integer is a positive integer to specify that value should be replicated integer times. For example, means=0.4 2*0.1 3*0.15 is equivalent to means=0.4 0.1 0.1 0.15 0.15 0.15.

exch=value , or
ar=value , or
incorr=data-set

One of exch=, ar=, or incorr= must be specified to determine the correlations among the m binary variables to be created. Specify exch= to create a correlation matrix with exchangeable structure in which every pair of variables has the specified correlation value. An autoregressive structure is created by specifying ar=. See k= for how it modifies that structure. If exch= or ar= is specified, value must be a real number between 0 and 1. If incorr= is specified, the data set must have exactly m numeric variables. The names of the numeric variables in the data set are not restricted. The observations in the data set must contain the desired correlation matrix with positive correlations. The i^th row and column in the data set should contain the correlations corresponding with the i^th mean. Only the values in the upper triangle above the main diagonal are read. Values on and below the main diagonal are ignored.

The following parameters are optional:

out=data-set: Specifies the name of the data set to be created that contains the generated random variables. The variables, the number of observations, and the shape of the data set depend on outshape=. See the descriptions of outshape= and n= for details. If omitted, out=Mbin.
outshape=wide|long: Specifies the shape and structure of the out= data set. If outshape=wide, each observation contains one set of m random values (either 0 or 1) for the m binary variables named Y1-Ym. The data set contains n observations as specified by n=. If outshape=long, each set of m random values creates m observations and all random values are placed in a single binary variable named Y. The data set contains nm observations. The probability that a random value equals 1 is equal to the corresponding mean as specified in means= or inmeans=. A numeric variable identifying the separate sets of generated random values is included in the data set. It is named ID by default with values 1, 2, and so on. If outshape=long, an additional variable is added to identify the binary variable within each generated set. It is named SubID by default with values 1, 2, ... , m in each set. If omitted, outshape=wide.
n=n: Specifies the number of sets of random values to create in the out= data set. Each set contains m random values for the m binary variables with the specified means and correlation matrix. Each value is either 0 or 1. If omitted, n=100.
subject=variable-name: Specifies the name of the numeric variable in the out= data set that identifies each set of m generated random values. See the description of outshape= for details. If omitted, subject=ID.
within=variable-name: Specifies the name of the numeric variable in the out= data set that identifies the binary variable within each set of m generated random values when outshape=long. within= is ignored if outshape=wide. See the description of outshape= for details. If omitted, within=SubID.
k=k: Specifies the number of bands, k=1, 2, ..., m-1, parallel to the main diagonal that will contain non-zero correlation values. All correlations in bands beyond the k^th band will be set to zero. k= can be specified with ar= or incorr=. When specified with ar=α, the correlation matrix has value α^k for all elements in the k^th band and zero for all bands beyond k. If specified with incorr=, any non-zero correlations specified beyond the k^th band are ignored and set to zero. If omitted, k=m-1.
seed=number: Specifies an integer used to start the generation of random values. If you do not specify a seed, or if you specify a value less than or equal to zero, the seed is by default generated from reading the time of day from the computer's clock. If omitted, seed=0.

DETAILS:

Wei et al. (2020) present algorithms that "generate high-dimensional binary data with specified correlation structures and unequal probabilities." The RanMBin macro implements three of these algorithms and can generate random values from m binary variables that have specified means (means= or inmeans=) and a correlation matrix with exchangeable (exch=) , autoregressive AR(1) (ar=), or banded power-decaying structures (ar= with k=). Random values can also be generated when the correlation matrix is fully unstructured (incorr=).

The macro creates an output data set (out=) with n observations (n=) and random values in variables Y1, Y2, ..., Ym if a wide data set structure (outshape=wide) is selected, or nm observations if a long structure (outshape=long) is desired. The output data set contains a variable (subject=) that numbers the observations in the wide format. In the long format, each generated set of random values for the m variables creates m output observations that are placed in a single variable, Y. Also included is a variable (subject=) that identifies the set of random values and another variable (within=) that identifies the random variable in the set associated with each observation. Reproducible sets of random values can be obtained by specifying a seed (seed=).

Errors and constraints

As noted by Wei et al. (2020) and in "Generalized Estimating Equations" in the Details section of the GENMOD documentation, there are natural restrictions on the correlations among binary variables. Prentice (1988) developed formulas expressing the constraints and these are presented by Wei et al. The macro checks the specified means and correlations to verify that the Prentice constraints are satisfied. Any violations are displayed. When violations occur, changes to the specified means and/or correlations can be made in a subsequent run of the macro. In addition, the algorithm used when incorr= or k= is specified can fail when computation of a binomial probability results in an invalid value. A note is displayed in the log when either this failure or violations of the Prentice constraints occurs.

The macro does not create any displayed output unless violations of the Prentice constraints are detected as noted above.

The RanMBin macro can be used to simulate multivariate binary data for many situations such as when a binary response is measured over time or under many conditions. Such data is often modeled using a Generalized Estimating Equations (GEE) or random effects model such as available in the GEE, GENMOD, and GLIMMIX procedures. The correlation structures available in the macro are commonly used in GEE models.

Multiple means vectors and/or correlation matrices

While the RanMBin macro does not directly support random value generation for multiple mean vectors and/or correlation matrices, this can be done using the RunBY macro. With it, you can run the RanMBin macro repeatedly for each mean vector in the inmeans= data set and/or each correlation matrix in the incorr= data set. See the RunBY macro documentation for details on its use. Also see the example titled "Multiple mean vectors" on the Results tab above.

LIMITATIONS:

All correlations must be positive.

REFERENCES:

Prentice, R. L. (1988), "Correlated Binary Regression with Covariates Specific to Each Binary Observation," Biometrics, 44, 1033-1048.

Wei J., Shuang S., Lin H., and Hongyu Z. (2020), "A Set of Efficient Methods to Generate High-Dimensional Binary Data With Specified Correlation Structures," The American Statistician, DOI: 10.1080/00031305.2020.1816213.

These sample files and code examples are provided by SAS Institute Inc. "as is" without warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability and fitness for a particular purpose. Recipients acknowledge and agree that SAS Institute shall not be liable for any damages whatsoever arising out of their use of this material. In addition, SAS Institute will provide no support for the materials contained herein.

Type:	Sample
Topic:	Analytics ==> Simulation SAS Reference ==> Macro

Date Modified:	2021-10-14 13:05:26
Date Created:	2020-11-18 14:58:29

Product Family	Product	Host	SAS Release
Product Family	Product	Host	Starting	Ending
SAS System	N/A	Aster Data nCluster on Linux x64
		DB2 Universal Database on AIX
		DB2 Universal Database on Linux x64
		Netezza TwinFin 32-bit SMP Hosts
		Netezza TwinFin 32bit blade
		Netezza TwinFin 64-bit S-Blades
		Netezza TwinFin 64-bit SMP Hosts
		Teradata on Linux
		Cloud Foundry
		64-bit Enabled AIX
		64-bit Enabled HP-UX
		64-bit Enabled Solaris
		ABI+ for Intel Architecture
		AIX
		HP-UX
		HP-UX IPF
		IRIX
		Linux
		Linux for AArch64
		Linux for x64
		Linux on Itanium
		OpenVMS Alpha
		OpenVMS on HP Integrity
		Solaris
		Solaris for x64
		Tru64 UNIX
		z/OS
		z/OS 64-bit
		IBM AS/400
		OpenVMS VAX
		N/A
		Android Operating System
		Apple Mobile Operating System
		Chrome Web Browser
		Macintosh
		Macintosh on x64
		Microsoft Windows 10
		Microsoft Windows 7
		Microsoft Windows 8 Enterprise 32-bit
		Microsoft Windows 8 Enterprise x64
		Microsoft Windows 8 Pro 32-bit
		Microsoft Windows 8 Pro x64
		Microsoft Windows 8 x64
		Microsoft Windows Server 2008 R2
		Microsoft Windows Server 2012 R2 Datacenter
		Microsoft Windows Server 2012 R2 Std
		Microsoft® Windows® for 64-Bit Itanium-based Systems
		Microsoft Windows Server 2003 Datacenter 64-bit Edition
		Microsoft Windows Server 2003 Enterprise 64-bit Edition
		Microsoft Windows XP 64-bit Edition
		Microsoft® Windows® for x64
		OS/2
		SAS Cloud
		Microsoft Windows 8.1 Enterprise 32-bit
		Microsoft Windows 8.1 Enterprise x64
		Microsoft Windows 8.1 Pro 32-bit
		Microsoft Windows 8.1 Pro x64
		Microsoft Windows 95/98
		Microsoft Windows 2000 Advanced Server
		Microsoft Windows 2000 Datacenter Server
		Microsoft Windows 2000 Server
		Microsoft Windows 2000 Professional
		Microsoft Windows NT Workstation
		Microsoft Windows Server 2003 Datacenter Edition
Microsoft Windows Server 2003 Enterprise Edition
Microsoft Windows Server 2003 Standard Edition
Microsoft Windows Server 2003 for x64
Microsoft Windows Server 2008
Microsoft Windows Server 2008 for x64
Microsoft Windows Server 2012 Datacenter
Microsoft Windows Server 2012 Std
Microsoft Windows Server 2016
Microsoft Windows Server 2019
Microsoft Windows XP Professional
Windows 7 Enterprise 32 bit
Windows 7 Enterprise x64
Windows 7 Home Premium 32 bit
Windows 7 Home Premium x64
Windows 7 Professional 32 bit
Windows 7 Professional x64
Windows 7 Ultimate 32 bit
Windows 7 Ultimate x64
Windows Millennium Edition (Me)
Windows Vista
Windows Vista for x64

Support

Sample 66969: Generate multivariate binary data with specified means and correlation matrix

Generate multivariate binary data with given means and correlation matrix

Errors and constraints

Multiple means vectors and/or correlation matrices

Operating System and Release Information

Follow Us

What is...