Predictive mean matching imputation (PMI)

Predictive mean matching (PMM) is an imputation technique introduced by Donald . Rubin in 1987. This imputation method aims to maintain the natural variability of the data and avoid implausible imputations that can occur with other univariate imputation methods.

Usage

pmean.match(
  data,
  family = "AUTO",
  robust = FALSE,
  k = 3,
  char_to_factor = FALSE,
  seed = NULL,
  verbose = FALSE
)

Arguments

data: a numeric matrix or data frame of at least 2 columns.
family: the distribution family of your observations. The family arguments defaults to 'AUTO'; and it will automatically select a distribution family (gaussian, binomial, multinomial) based on the type of variable (numeric or factor). The distribution family dictates the regression model used (lm,glm, multinom). However, the user can change the family argument to match his response variable distribution and the function will adapt to this input by using the generalized linear model or beta regression.
robust: logical indicated whether to use robust estimation methods or ignore them. If set to 'TRUE', the function will make use of robust linear and generalized linear models to make its prediction.
k: numeric vector indicating number of nearest neighbors to extract for imputation. Currently k defaults to 3 but can be changed.
char_to_factor: transform character variable to unordered factor variable
seed: numeric vector used for reproducible results. Used to sample the same predicted value over time.
verbose: verbose error handling

Value

a matrix or data frame containing the imputed dataset.

Details

How's predictive mean matching different from conditional mean imputation(CMI)?

PMM is a combination of CMI and HDI. Predictive mean matching (PMM) uses regression on observed variables to estimate missing values, like CMI, however, PMM will also fill in the missing value by randomly sampling observed values whose predicted values are closest to the predicted values of the missing observation. This is currently done using the nearest neighbor approach with a set number of neighbors (3) but can be changed depending on your data.

Examples

set.seed(123)
data <- data.frame(x1 = stats::rnorm(100),x2 = stats::rnorm(100),y = stats::rnorm(100))
data$x1[sample(1:100, 20)] <- NA
data$x2[sample(1:100, 15)] <- NA
data$y[sample(1:100, 10)] <- NA
fact_dat <- data.frame(data, c = gl(5,20))
matched_data <- pmean.match(fact_dat, robust = TRUE)
summary(matched_data)
#>        x1                x2                 y            c     
#>  Min.   :-2.3092   Min.   :-2.05325   Min.   :-1.75653   1:20  
#>  1st Qu.:-0.4136   1st Qu.:-0.80110   1st Qu.:-0.53131   2:20  
#>  Median : 0.2173   Median :-0.24669   Median : 0.07063   3:20  
#>  Mean   : 0.2115   Mean   :-0.08539   Mean   : 0.12123   4:20  
#>  3rd Qu.: 0.7904   3rd Qu.: 0.53973   3rd Qu.: 0.69013   5:20  
#>  Max.   : 2.1873   Max.   : 3.24104   Max.   : 2.19881