Generate correlated, synthetic normal variables with user-specified probability of MAR. Specify the column length, correlation coefficient, standard deviation, number of columns and desired probability of missing values to obtain a data frame of correlated observations with missing values.
Arguments
- len
number of rows per column
- rho
desired correlation coefficient of generated variables. The length of rho must be equal to the product of
n_vars
and half ofn_vars
minus one.- sigma
desired standard deviation for each generated variable.
- n_vars
total number of variables to be generated. At least two variables must be provided.
- na_prob
desired probability of missingness in each variable set to 10% by default.
Details
The gen.mar
algorithm will randomly pick a causative feature/column and
use the lowest indices of this target column to assign the missing values to
the remaining columns. Note that there are additional ways of creating missing
values at random, for example, using conditional probabilities and set thresholds,
or logistic regression-based masking; however, the truncation algorithm
here is a straight-forward implementation of the MAR mechanism.
Examples
syn_na <- gen.mar(50,c(.25,.75,.044),c(1.1,.56,1.56),3,.15)
summary(syn_na)
#> V1 V2 V3
#> Min. :-1.86105 Min. :-1.20421 Min. :-1.9815
#> 1st Qu.:-0.66588 1st Qu.:-0.41935 1st Qu.:-0.4629
#> Median : 0.06021 Median :-0.09583 Median : 0.3545
#> Mean : 0.06996 Mean :-0.01700 Mean : 0.4134
#> 3rd Qu.: 0.60729 3rd Qu.: 0.42676 3rd Qu.: 1.3482
#> Max. : 2.09555 Max. : 1.66331 Max. : 4.0359
#> NA's :8 NA's :8