Cold deck imputation — coldeck.impute • describe

Cold deck imputation (CDI) is a univariate imputation technique where, for each respondent or recipient with a missing value, we use an external, pre-existing source to find a donor with similar values across a subset of categorical or numerical predictors and use it to fill the recipient's missing observation. For this reason, cold deck imputation has been used with stratification across categorical variables.

The current implementation of CDI allows the user to choose one of four selection methods: deterministic, random sampling from all possible donors, from k-nearest neighbors and random sampling using weights as probabilities. The function will iteratively impute missing values across all variables with missing observations using the selection method specified in the function arguments.

Usage

coldeck.impute(
  data,
  ext_data = NULL,
  method = "deterministic",
  k = NULL,
  seed = NULL,
  na.rm = TRUE
)

Arguments

data

a matrix or data frame containing missing values in at least one predictor

ext_data

external data source of complete cases to be used as donor values

method

selection method for imputing missing values based on donor similarity. Can be one of:

"deterministic": Select the same donor value for multiple repetitions of the CDI.
"rand_from_all": Select a different donor value for each repetition of the CDI.
"rand_nearest_k": Select one random donor value from a subset of k nearest neighbors for each repetition of the CDI.
"weighted_rand": Select one random donor through a probability-weighted choice for each repetition of the CDI.

k

number of nearest neighbors to select from when using the rand_nearest_k method.

seed

a numeric seed for reproducible results for every method except deterministic selection

na.rm

indicates removal of NA values from every row in the matrix or data frame

Value

a matrix or data frame of imputed values

Details

CDI is a valuable method when a reliable external source of data is available and can ensure more standardized imputations, particularly in large-scale studies or when historical consistency is important. However, it requires careful consideration to avoid introducing bias due to mismatches between the current dataset and the external source.

Examples

data <- gen.mcar(100,rho = c(.56,.23,.18),sigma = c(1,2,.5),n_vars = 3,na_prob = .18)
ext_data <- gen.mcar(100,rho = c(.45,.26,.21),sigma = c(1.67,2.23,.56),n_vars = 3,na_prob = 0)
cold_data <- coldeck.impute(data,ext_data)
summary(cold_data)
#>        V1                 V2                V3          
#>  Min.   :-2.46590   Min.   :-4.4916   Min.   :-0.95783  
#>  1st Qu.:-0.44313   1st Qu.:-0.9340   1st Qu.:-0.36091  
#>  Median : 0.01414   Median : 0.4071   Median : 0.02361  
#>  Mean   : 0.06888   Mean   : 0.2878   Mean   : 0.01502  
#>  3rd Qu.: 0.64923   3rd Qu.: 1.5895   3rd Qu.: 0.29061  
#>  Max.   : 2.57146   Max.   : 5.9800   Max.   : 1.33431