To generate missing values in a dataset with missMethods you can use
one of the delete_
functions. The names of these functions
always starts with delete_
and the next part of the name
shows the used missing data mechanism. There are three basic types of
missing data mechanisms: missing completely at random
(MCAR
), missing at random (MAR
) and missing
not at random (MNAR
). A list of all available functions for
the different mechanisms is given below:
MCAR
delete_MCAR()
MAR
delete_MAR_1_to_x()
delete_MAR_censoring()
delete_MAR_one_group()
delete_MAR_rank()
MNAR
delete_MNAR_1_to_x()
delete_MNAR_censoring()
delete_MNAR_one_group()
delete_MNAR_rank()
All these functions share a common interface. The first argument
ds
takes the dataset in which missing values should be
generated. The next argument p
specifies the proportion of
missing values to include in every column with missing value. These
columns are specified with the third argument cols_mis
. The
further arguments depend on the chosen function and are documented for
every function separately. In most cases, reasonable defaults are set
for these further arguments. Only the MAR
functions need
one additional argument with no default: cols_ctrl
. The
argument cols_ctrl
specifies the columns that control the
generation of missing data in a MAR settings.
One further remark: All MAR
functions have a
MNAR
twin. These twins behave exactly the same way. The
only difference is the columns that controls the generation of missing
values. In the MAR
functions separate
cols_ctrl
columns controls the generation of missing values
in the cols_mis
columns. In contrast, in the
MNAR
functions the generation of missing values in the
cols_mis
columns is controlled by the cols_mis
columns themselves.
In the following, different examples for the generation of missing values will be presented. Furthermore, connections between the functions from missMethods and the paper by Santos et al. (2019) will be shown.
The examples below show the use of some delete_
functions in a 2-dimensional dataset. Missing values are always
generated in the variable “X” and 30 % of the values are deleted. At
first, a basic set-up:
library(missMethods)
library(ggplot2)
set.seed(123)
<- function(ds_comp, ds_mis) {
make_simple_MDplot $missX <- is.na(ds_mis$X)
ds_compggplot(ds_comp, aes(x = X, y = Y, col = missX)) +
geom_point()
}
# generate complete data frame
<- data.frame(X = rnorm(100), Y = rnorm(100)) ds_comp
Generate MCAR values:
<- delete_MCAR(ds_comp, 0.3, "X")
ds_mcar make_simple_MDplot(ds_comp, ds_mcar)
Generate MAR values using a censoring mechanism. This leads to a missing value in “X”, if the y-value is below the 30 % quantile of “Y”:
<- delete_MAR_censoring(ds_comp, 0.3, "X", cols_ctrl = "Y")
ds_mar make_simple_MDplot(ds_comp, ds_mar)
The censoring mechanism is a rather strong form of MAR. A function
that allows to control the strength of the MAR mechanism is
delete_MAR_1_to_x
. The strength is controlled through the
argument x
: the bigger x
, the stronger the
simulated MAR
mechanism:
# x = 2
<- delete_MAR_1_to_x(ds_comp, 0.3, "X", cols_ctrl = "Y", x = 2)
ds_mar make_simple_MDplot(ds_comp, ds_mar)
# x = 10
<- delete_MAR_1_to_x(ds_comp, 0.3, "X", cols_ctrl = "Y", x = 10)
ds_mar make_simple_MDplot(ds_comp, ds_mar)
Generate MAR values using a censoring mechanism. This leads to a missing value in “X”, if the x-value is below the 30 % quantile of “X”:
<- delete_MNAR_censoring(ds_comp, 0.3, "X")
ds_mnar make_simple_MDplot(ds_comp, ds_mnar)
The following table shows the connections between the algorithm names of the missing data creation methods in Santos et al. (2019) and the functions of missMethods:
Santos et al. (2019) | Function | Arguments |
---|---|---|
MCAR1univa | delete_MCAR |
n_mis_stochastic = FALSE |
MCAR2univa | delete_MCAR |
all default |
MCAR3univa | delete_MCAR |
all default |
MAR1univa | delete_MAR_censoring |
sorting = FALSE |
MAR2univa | delete_MAR_rank |
all default |
MAR3univa | delete_MAR_1_to_x |
x = 1/9 |
MAR4univa | delete_MAR_censoring |
where = “upper” |
MAR5univa | delete_MAR_censoring |
where = “both” |
MNAR1univa | delete_MNAR_censoring |
sorting = FALSE |
MNAR2univa | delete_MNAR_censoring |
where = “upper” |
MCAR1unifo | delete_MCAR |
n_mis_stochastic = FALSE |
MCAR2unifo | delete_MCAR |
p_overall = TRUE |
MAR1unifo | delete_MAR_censoring |
all default |
MAR2unifo | delete_MAR_censoring |
sorting = FALSE |
MAR3unifo | delete_MAR_one_group |
all default |
MAR4unifo | delete_MAR_one_group |
all default |
MNAR1unifo | delete_MNAR_censoring |
all default |
MNAR2unifo | delete_MNAR_censoring |
sorting = FALSE |
MNAR3unifo | delete_MAR_one_group |
all default |
MNAR4unifo | delete_MAR_one_group |
all default |
Only the argument(s), which default values must be altered, are shown in the table. Notice that most functions in missMethods are more general than the described algorithms in Santos et al. (2019). Therefore, some functions of missMethods are able to replace different algorithms from Santos et al. (2019). In contrast to Santos et al. (2019), the user must always specify the missing column(s). However, this may change in the future.