Skip to contents

Generate correlated, synthetic normal variables with user-specified probability of MAR. Specify the column length, correlation coefficient, standard deviation, number of columns and desired probability of missing values to obtain a data frame of correlated observations with missing values.

Usage

gen.mar(len, rho, sigma, n_vars, na_prob = 0.1)

Arguments

len

number of rows per column

rho

desired correlation coefficient of generated variables. The length of rho must be equal to the product of n_vars and half of n_vars minus one.

sigma

desired standard deviation for each generated variable.

n_vars

total number of variables to be generated. At least two variables must be provided.

na_prob

desired probability of missingness in each variable set to 10% by default.

Value

a data frame of at least 2 columns

Details

The gen.mar algorithm will randomly pick a causative feature/column and use the lowest indices of this target column to assign the missing values to the remaining columns. Note that there are additional ways of creating missing values at random, for example, using conditional probabilities and set thresholds, or logistic regression-based masking; however, the truncation algorithm here is a straight-forward implementation of the MAR mechanism.

Examples

syn_na <- gen.mar(50,c(.25,.75,.044),c(1.1,.56,1.56),3,.15)
summary(syn_na)
#>        V1                 V2                 V3         
#>  Min.   :-1.86105   Min.   :-1.20421   Min.   :-1.9815  
#>  1st Qu.:-0.66588   1st Qu.:-0.41935   1st Qu.:-0.4629  
#>  Median : 0.06021   Median :-0.09583   Median : 0.3545  
#>  Mean   : 0.06996   Mean   :-0.01700   Mean   : 0.4134  
#>  3rd Qu.: 0.60729   3rd Qu.: 0.42676   3rd Qu.: 1.3482  
#>  Max.   : 2.09555   Max.   : 1.66331   Max.   : 4.0359  
#>                     NA's   :8          NA's   :8