Hot deck imputation (HDI) is a univariate imputation technique where, for each respondent or recipient with a missing value, we find a donor with similar values across a subset of categorical or numerical predictors and use it to fill the recipient's missing observation. For this reason, HDI has been used with stratification across categorical variables.
The current implementation of HDI allows the user to choose one of four selection methods: deterministic, random sampling from all possible donors, from k-nearest neighbors and random sampling using weights as probabilities. The function will iteratively impute missing values across all variables with missing observations using the selection method specified in the function arguments.
Arguments
- data
a matrix or data frame containing missing values in at least one predictor
- method
selection method for imputing missing values based on donor similarity. Can be one of:
"deterministic"
Select the same donor value for multiple repetitions of the CDI.
"rand_from_all"
Select a different donor value for each repetition of the CDI.
"rand_nearest_k"
Select one random donor value from a subset of k nearest neighbors for each repetition of the CDI.
"weighted_rand"
Select one random donor through a probability-weighted choice for each repetition of the CDI.
- k
number of nearest neighbors to select from when using the
rand_nearest_k
method.- seed
a numeric seed for reproducible results for every method except deterministic selection
- na.rm
indicates removal of NA values from every row in the matrix or data frame
Examples
data <- gen.mcar(100,rho = c(.56,.23,.18),sigma = c(1,2,.5),n_vars = 3,na_prob = .18)
hot_data <- hotdeck.impute(data)
summary(hot_data)
#> V1 V2 V3
#> Min. :-2.8766 Min. :-4.8334 Min. :-1.13091
#> 1st Qu.:-1.0105 1st Qu.:-1.3556 1st Qu.:-0.26354
#> Median :-0.1079 Median :-0.5331 Median : 0.02834
#> Mean :-0.1892 Mean :-0.3364 Mean : 0.01080
#> 3rd Qu.: 0.4801 3rd Qu.: 0.5525 3rd Qu.: 0.30642
#> Max. : 2.3067 Max. : 5.1222 Max. : 0.96153