Cold deck imputation (CDI) is a univariate imputation technique where, for each respondent or recipient with a missing value, we use an external, pre-existing source to find a donor with similar values across a subset of categorical or numerical predictors and use it to fill the recipient's missing observation. For this reason, cold deck imputation has been used with stratification across categorical variables.
The current implementation of CDI allows the user to choose one of four selection methods: deterministic, random sampling from all possible donors, from k-nearest neighbors and random sampling using weights as probabilities. The function will iteratively impute missing values across all variables with missing observations using the selection method specified in the function arguments.
Usage
coldeck.impute(
data,
ext_data = NULL,
method = "deterministic",
k = NULL,
seed = NULL,
na.rm = TRUE
)
Arguments
- data
a matrix or data frame containing missing values in at least one predictor
- ext_data
external data source of complete cases to be used as donor values
- method
selection method for imputing missing values based on donor similarity. Can be one of:
"deterministic"
Select the same donor value for multiple repetitions of the CDI.
"rand_from_all"
Select a different donor value for each repetition of the CDI.
"rand_nearest_k"
Select one random donor value from a subset of k nearest neighbors for each repetition of the CDI.
"weighted_rand"
Select one random donor through a probability-weighted choice for each repetition of the CDI.
- k
number of nearest neighbors to select from when using the
rand_nearest_k
method.- seed
a numeric seed for reproducible results for every method except deterministic selection
- na.rm
indicates removal of NA values from every row in the matrix or data frame
Details
CDI is a valuable method when a reliable external source of data is available and can ensure more standardized imputations, particularly in large-scale studies or when historical consistency is important. However, it requires careful consideration to avoid introducing bias due to mismatches between the current dataset and the external source.
Examples
data <- gen.mcar(100,rho = c(.56,.23,.18),sigma = c(1,2,.5),n_vars = 3,na_prob = .18)
ext_data <- gen.mcar(100,rho = c(.45,.26,.21),sigma = c(1.67,2.23,.56),n_vars = 3,na_prob = 0)
cold_data <- coldeck.impute(data,ext_data)
summary(cold_data)
#> V1 V2 V3
#> Min. :-2.46590 Min. :-4.4916 Min. :-0.95783
#> 1st Qu.:-0.44313 1st Qu.:-0.9340 1st Qu.:-0.36091
#> Median : 0.01414 Median : 0.4071 Median : 0.02361
#> Mean : 0.06888 Mean : 0.2878 Mean : 0.01502
#> 3rd Qu.: 0.64923 3rd Qu.: 1.5895 3rd Qu.: 0.29061
#> Max. : 2.57146 Max. : 5.9800 Max. : 1.33431