"""
Overview
--------
This module implements the Multiple Imputation through Chained
Equations (MICE) approach to handling missing data in statistical data
analyses. The approach has the following steps:
0. Impute each missing value with the mean of the observed values of
the same variable.
1. For each variable in the data set with missing values (termed the
'focus variable'), do the following:
1a. Fit an 'imputation model', which is a regression model for the
focus variable, regressed on the observed and (current) imputed values
of some or all of the other variables.
1b. Impute the missing values for the focus variable. Currently this
imputation must use the 'predictive mean matching' (pmm) procedure.
2. Once all variables have been imputed, fit the 'analysis model' to
the data set.
3. Repeat steps 1-2 multiple times and combine the results using a
'combining rule' to produce point estimates of all parameters in the
analysis model and standard errors for them.
The imputations for each variable are based on an imputation model
that is specified via a model class and a formula for the regression
relationship. The default model is OLS, with a formula specifying
main effects for all other variables.
The MICE procedure can be used in one of two ways:
* If the goal is only to produce imputed data sets, the MICEData class
can be used to wrap a data frame, providing facilities for doing the
imputation. Summary plots are available for assessing the performance
of the imputation.
* If the imputed data sets are to be used to fit an additional
'analysis model', a MICE instance can be used. After specifying the
MICE instance and running it, the results are combined using the
`combine` method. Results and various summary plots are then
available.
Terminology
-----------
The primary goal of the analysis is usually to fit and perform
inference using an 'analysis model'. If an analysis model is not
specified, then imputed datasets are produced for later use.
The MICE procedure involves a family of imputation models. There is
one imputation model for each variable with missing values. An
imputation model may be conditioned on all or a subset of the
remaining variables, using main effects, transformations,
interactions, etc. as desired.
A 'perturbation method' is a method for setting the parameter estimate
in an imputation model. The 'gaussian' perturbation method first fits
the model (usually using maximum likelihood, but it could use any
statsmodels fit procedure), then sets the parameter vector equal to a
draw from the Gaussian approximation to the sampling distribution for
the fit. The 'bootstrap' perturbation method sets the parameter
vector equal to a fitted parameter vector obtained when fitting the
conditional model to a bootstrapped version of the data set.
Class structure
---------------
There are two main classes in the module:
* 'MICEData' wraps a Pandas dataframe, incorporating information about
the imputation model for each variable with missing values. It can
be used to produce multiply imputed data sets that are to be further
processed or distributed to other researchers. A number of plotting
procedures are provided to visualize the imputation results and
missing data patterns. The `history_func` hook allows any features
of interest of the imputed data sets to be saved for further
analysis.
* 'MICE' takes both a 'MICEData' object and an analysis model
specification. It runs the multiple imputation, fits the analysis
models, and combines the results to produce a `MICEResults` object.
The summary method of this results object can be used to see the key
estimands and inferential quantities.
Notes
-----
By default, to conserve memory 'MICEData' saves very little
information from one iteration to the next. The data set passed by
the user is copied on entry, but then is over-written each time new
imputations are produced. If using 'MICE', the fitted
analysis models and results are saved. MICEData includes a
`history_callback` hook that allows arbitrary information from the
intermediate datasets to be saved for future use.
References
----------
JL Schafer: 'Multiple Imputation: A Primer', Stat Methods Med Res,
1999.
TE Raghunathan et al.: 'A Multivariate Technique for Multiply
Imputing Missing Values Using a Sequence of Regression Models', Survey
Methodology, 2001.
SAS Institute: 'Predictive Mean Matching Method for Monotone Missing
Data', SAS 9.2 User's Guide, 2014.
A Gelman et al.: 'Multiple Imputation with Diagnostics (mi) in R:
Opening Windows into the Black Box', Journal of Statistical Software,
2009.
"""
import pandas as pd
import numpy as np
import patsy
from statsmodels.base.model import LikelihoodModelResults
from statsmodels.regression.linear_model import OLS
from collections import defaultdict
_mice_data_example_1 = """
>>> imp = mice.MICEData(data)
>>> imp.set_imputer('x1', formula='x2 + np.square(x2) + x3')
>>> for j in range(20):
... imp.update_all()
... imp.data.to_csv('data%02d.csv' % j)"""
_mice_data_example_2 = """
>>> imp = mice.MICEData(data)
>>> j = 0
>>> for data in imp:
... imp.data.to_csv('data%02d.csv' % j)
... j += 1"""
class PatsyFormula(object):
"""
A simple wrapper for a string to be interpreted as a Patsy formula.
"""
def __init__(self, formula):
self.formula = "0 + " + formula
class MICEData(object):
__doc__ = """\
Wrap a data set to allow missing data handling with MICE.
Parameters
----------
data : Pandas data frame
The data set, which is copied internally.
perturbation_method : str
The default perturbation method
k_pmm : int
The number of nearest neighbors to use during predictive mean
matching. Can also be specified in `fit`.
history_callback : function
A function that is called after each complete imputation
cycle. The return value is appended to `history`. The
MICEData object is passed as the sole argument to
`history_callback`.
Examples
--------
Draw 20 imputations from a data set called `data` and save them in
separate files with filename pattern `dataXX.csv`. The variables
other than `x1` are imputed using linear models fit with OLS, with
mean structures containing main effects of all other variables in
`data`. The variable named `x1` has a conditional mean structure
that includes an additional term for x2^2.
%(_mice_data_example_1)s
Impute using default models, using the MICEData object as an
iterator.
%(_mice_data_example_2)s
Notes
-----
Allowed perturbation methods are 'gaussian' (the model parameters
are set to a draw from the Gaussian approximation to the posterior
distribution), and 'boot' (the model parameters are set to the
estimated values obtained when fitting a bootstrapped version of
the data set).
`history_callback` can be implemented to have side effects such as
saving the current imputed data set to disk.
""" % {'_mice_data_example_1': _mice_data_example_1,
'_mice_data_example_2': _mice_data_example_2}
def __init__(self, data, perturbation_method='gaussian',
k_pmm=20, history_callback=None):
if data.columns.dtype != np.dtype('O'):
msg = "MICEData data column names should be string type"
raise ValueError(msg)
self.regularized = dict()
# Drop observations where all variables are missing. This
# also has the effect of copying the data frame.
self.data = data.dropna(how='all').reset_index(drop=True)
self.history_callback = history_callback
self.history = []
self.predict_kwds = {}
# Assign the same perturbation method for all variables.
# Can be overridden when calling 'set_imputer'.
self.perturbation_method = defaultdict(lambda:
perturbation_method)
# Map from variable name to indices of observed/missing
# values.
self.ix_obs = {}
self.ix_miss = {}
for col in self.data.columns:
ix_obs, ix_miss = self._split_indices(self.data[col])
self.ix_obs[col] = ix_obs
self.ix_miss[col] = ix_miss
# Most recent model instance and results instance for each variable.
self.models = {}
self.results = {}
# Map from variable names to the conditional formula.
self.conditional_formula = {}
# Map from variable names to init/fit args of the conditional
# models.
self.init_kwds = defaultdict(lambda: dict())
self.fit_kwds = defaultdict(lambda: dict())
# Map from variable names to the model class.
self.model_class = {}
# Map from variable names to most recent params update.
self.params = {}
# Set default imputers.
for vname in data.columns:
self.set_imputer(vname)
# The order in which variables are imputed in each cycle.
# Impute variables with the fewest missing values first.
vnames = list(data.columns)
nmiss = [len(self.ix_miss[v]) for v in vnames]
nmiss = np.asarray(nmiss)
ii = np.argsort(nmiss)
ii = ii[sum(nmiss == 0):]
self._cycle_order = [vnames[i] for i in ii]
self._initial_imputation()
self.k_pmm = k_pmm
def next_sample(self):
"""
Returns the next imputed dataset in the imputation process.
Returns
-------
data : array_like
An imputed dataset from the MICE chain.
Notes
-----
`MICEData` does not have a `skip` parameter. Consecutive
values returned by `next_sample` are immediately consecutive
in the imputation chain.
The returned value is a reference to the data attribute of
the class and should be copied before making any changes.
"""
self.update_all(1)
return self.data
def _initial_imputation(self):
"""
Use a PMM-like procedure for initial imputed values.
For each variable, missing values are imputed as the observed
value that is closest to the mean over all observed values.
"""
for col in self.data.columns:
di = self.data[col] - self.data[col].mean()
di = np.abs(di)
ix = di.idxmin()
imp = self.data[col].loc[ix]
self.data[col].fillna(imp, inplace=True)
def _split_indices(self, vec):
null = pd.isnull(vec)
ix_obs = np.flatnonzero(~null)
ix_miss = np.flatnonzero(null)
if len(ix_obs) == 0:
raise ValueError("variable to be imputed has no observed values")
return ix_obs, ix_miss
def set_imputer(self, endog_name, formula=None, model_class=None,
init_kwds=None, fit_kwds=None, predict_kwds=None,
k_pmm=20, perturbation_method=None, regularized=False):
"""
Specify the imputation process for a single variable.
Parameters
----------
endog_name : str
Name of the variable to be imputed.
formula : str
Conditional formula for imputation. Defaults to a formula
with main effects for all other variables in dataset. The
formula should only include an expression for the mean
structure, e.g. use 'x1 + x2' not 'x4 ~ x1 + x2'.
model_class : statsmodels model
Conditional model for imputation. Defaults to OLS. See below
for more information.
init_kwds : dit-like
Keyword arguments passed to the model init method.
fit_kwds : dict-like
Keyword arguments passed to the model fit method.
predict_kwds : dict-like
Keyword arguments passed to the model predict method.
k_pmm : int
Determines number of neighboring observations from which
to randomly sample when using predictive mean matching.
perturbation_method : str
Either 'gaussian' or 'bootstrap'. Determines the method
for perturbing parameters in the imputation model. If
None, uses the default specified at class initialization.
regularized : dict
If regularized[name]=True, `fit_regularized` rather than
`fit` is called when fitting imputation models for this
variable. When regularized[name]=True for any variable,
perturbation_method must be set to boot.
Notes
-----
Loading ...