Skip to contents

This method replaces missing values with the expected value of the missing variable, given other variables in the dataset. Predictions of conditional mean performed using specific regression models. CMI is a univariate imputation method that leverages the relationship between variables in the data to make informed predictions about missing values. Although this method reduces bias and aims to maintain the relationship among variables, it will yield residuals with less variation than the original data.

CMI will fail if a regressor also contains missing values, thus making imputation of the target's missing values impossible. The user will want to understand and fill in as much of the missing data as possible before imputing through this method. If the regressors contain missing values, it is better to use multiple imputation techniques.

Usage

cmean.impute(
  data,
  family = "AUTO",
  robust = FALSE,
  char_to_factor = FALSE,
  verbose = FALSE
)

Arguments

data

a numeric matrix or data frame of at least 2 columns.

family

the distribution family of your observations. The family arguments defaults to 'AUTO'; and it will automatically select a distribution family (gaussian, binomial, multinomial) based on the type of variable (numeric or factor). The distribution family dictates the regression model used (lm,glm, multinom). However, the user can change the family argument to match his response variable distribution and the function will adapt to this input by using the generalized linear model or beta regression.

robust

logical indicated whether to use robust estimation methods or ignore them. If set to 'TRUE', the function will make use of robust linear and generalized linear models to make its prediction.

char_to_factor

transform character variable to unordered factor variable

verbose

verbose error handling

Value

a matrix or data frame containing the imputed dataset.

Examples

set.seed(123)
data <- data.frame(x1 = c(stats::rnorm(87),rep(NA,13)),
x2 = stats::rnorm(100),y = stats::rnorm(100))
cmi_data <- cmean.impute(data)
summary(cmi_data)
#>        x1                 x2                 y           
#>  Min.   :-2.30917   Min.   :-2.05325   Min.   :-1.75653  
#>  1st Qu.:-0.45091   1st Qu.:-0.66656   1st Qu.:-0.57945  
#>  Median : 0.01496   Median :-0.15831   Median :-0.08944  
#>  Mean   : 0.03790   Mean   :-0.02793   Mean   : 0.01478  
#>  3rd Qu.: 0.47015   3rd Qu.: 0.55205   3rd Qu.: 0.60503  
#>  Max.   : 2.16896   Max.   : 3.24104   Max.   : 2.29308