Learn more  » Push, build, and install  RubyGems npm packages Python packages Maven artifacts PHP packages Go Modules Bower components Debian packages RPM packages NuGet packages

alkaline-ml / statsmodels   python

Repository URL to install this package:

Version: 0.11.1 

/ imputation / mice.py

"""
Overview
--------

This module implements the Multiple Imputation through Chained
Equations (MICE) approach to handling missing data in statistical data
analyses. The approach has the following steps:

0. Impute each missing value with the mean of the observed values of
the same variable.

1. For each variable in the data set with missing values (termed the
'focus variable'), do the following:

1a. Fit an 'imputation model', which is a regression model for the
focus variable, regressed on the observed and (current) imputed values
of some or all of the other variables.

1b. Impute the missing values for the focus variable.  Currently this
imputation must use the 'predictive mean matching' (pmm) procedure.

2. Once all variables have been imputed, fit the 'analysis model' to
the data set.

3. Repeat steps 1-2 multiple times and combine the results using a
'combining rule' to produce point estimates of all parameters in the
analysis model and standard errors for them.

The imputations for each variable are based on an imputation model
that is specified via a model class and a formula for the regression
relationship.  The default model is OLS, with a formula specifying
main effects for all other variables.

The MICE procedure can be used in one of two ways:

* If the goal is only to produce imputed data sets, the MICEData class
can be used to wrap a data frame, providing facilities for doing the
imputation.  Summary plots are available for assessing the performance
of the imputation.

* If the imputed data sets are to be used to fit an additional
'analysis model', a MICE instance can be used.  After specifying the
MICE instance and running it, the results are combined using the
`combine` method.  Results and various summary plots are then
available.

Terminology
-----------

The primary goal of the analysis is usually to fit and perform
inference using an 'analysis model'. If an analysis model is not
specified, then imputed datasets are produced for later use.

The MICE procedure involves a family of imputation models.  There is
one imputation model for each variable with missing values.  An
imputation model may be conditioned on all or a subset of the
remaining variables, using main effects, transformations,
interactions, etc. as desired.

A 'perturbation method' is a method for setting the parameter estimate
in an imputation model.  The 'gaussian' perturbation method first fits
the model (usually using maximum likelihood, but it could use any
statsmodels fit procedure), then sets the parameter vector equal to a
draw from the Gaussian approximation to the sampling distribution for
the fit.  The 'bootstrap' perturbation method sets the parameter
vector equal to a fitted parameter vector obtained when fitting the
conditional model to a bootstrapped version of the data set.

Class structure
---------------

There are two main classes in the module:

* 'MICEData' wraps a Pandas dataframe, incorporating information about
  the imputation model for each variable with missing values. It can
  be used to produce multiply imputed data sets that are to be further
  processed or distributed to other researchers.  A number of plotting
  procedures are provided to visualize the imputation results and
  missing data patterns.  The `history_func` hook allows any features
  of interest of the imputed data sets to be saved for further
  analysis.

* 'MICE' takes both a 'MICEData' object and an analysis model
  specification.  It runs the multiple imputation, fits the analysis
  models, and combines the results to produce a `MICEResults` object.
  The summary method of this results object can be used to see the key
  estimands and inferential quantities.

Notes
-----

By default, to conserve memory 'MICEData' saves very little
information from one iteration to the next.  The data set passed by
the user is copied on entry, but then is over-written each time new
imputations are produced.  If using 'MICE', the fitted
analysis models and results are saved.  MICEData includes a
`history_callback` hook that allows arbitrary information from the
intermediate datasets to be saved for future use.

References
----------

JL Schafer: 'Multiple Imputation: A Primer', Stat Methods Med Res,
1999.

TE Raghunathan et al.: 'A Multivariate Technique for Multiply
Imputing Missing Values Using a Sequence of Regression Models', Survey
Methodology, 2001.

SAS Institute: 'Predictive Mean Matching Method for Monotone Missing
Data', SAS 9.2 User's Guide, 2014.

A Gelman et al.: 'Multiple Imputation with Diagnostics (mi) in R:
Opening Windows into the Black Box', Journal of Statistical Software,
2009.
"""

import pandas as pd
import numpy as np
import patsy
from statsmodels.base.model import LikelihoodModelResults
from statsmodels.regression.linear_model import OLS
from collections import defaultdict


_mice_data_example_1 = """
    >>> imp = mice.MICEData(data)
    >>> imp.set_imputer('x1', formula='x2 + np.square(x2) + x3')
    >>> for j in range(20):
    ...     imp.update_all()
    ...     imp.data.to_csv('data%02d.csv' % j)"""

_mice_data_example_2 = """
    >>> imp = mice.MICEData(data)
    >>> j = 0
    >>> for data in imp:
    ...     imp.data.to_csv('data%02d.csv' % j)
    ...     j += 1"""


class PatsyFormula(object):
    """
    A simple wrapper for a string to be interpreted as a Patsy formula.
    """
    def __init__(self, formula):
        self.formula = "0 + " + formula


class MICEData(object):

    __doc__ = """\
    Wrap a data set to allow missing data handling with MICE.

    Parameters
    ----------
    data : Pandas data frame
        The data set, which is copied internally.
    perturbation_method : str
        The default perturbation method
    k_pmm : int
        The number of nearest neighbors to use during predictive mean
        matching.  Can also be specified in `fit`.
    history_callback : function
        A function that is called after each complete imputation
        cycle.  The return value is appended to `history`.  The
        MICEData object is passed as the sole argument to
        `history_callback`.

    Examples
    --------
    Draw 20 imputations from a data set called `data` and save them in
    separate files with filename pattern `dataXX.csv`.  The variables
    other than `x1` are imputed using linear models fit with OLS, with
    mean structures containing main effects of all other variables in
    `data`.  The variable named `x1` has a conditional mean structure
    that includes an additional term for x2^2.
    %(_mice_data_example_1)s

    Impute using default models, using the MICEData object as an
    iterator.
    %(_mice_data_example_2)s

    Notes
    -----
    Allowed perturbation methods are 'gaussian' (the model parameters
    are set to a draw from the Gaussian approximation to the posterior
    distribution), and 'boot' (the model parameters are set to the
    estimated values obtained when fitting a bootstrapped version of
    the data set).

    `history_callback` can be implemented to have side effects such as
    saving the current imputed data set to disk.
    """ % {'_mice_data_example_1': _mice_data_example_1,
           '_mice_data_example_2': _mice_data_example_2}

    def __init__(self, data, perturbation_method='gaussian',
                 k_pmm=20, history_callback=None):

        if data.columns.dtype != np.dtype('O'):
            msg = "MICEData data column names should be string type"
            raise ValueError(msg)

        self.regularized = dict()

        # Drop observations where all variables are missing.  This
        # also has the effect of copying the data frame.
        self.data = data.dropna(how='all').reset_index(drop=True)

        self.history_callback = history_callback
        self.history = []
        self.predict_kwds = {}

        # Assign the same perturbation method for all variables.
        # Can be overridden when calling 'set_imputer'.
        self.perturbation_method = defaultdict(lambda:
                                               perturbation_method)

        # Map from variable name to indices of observed/missing
        # values.
        self.ix_obs = {}
        self.ix_miss = {}
        for col in self.data.columns:
            ix_obs, ix_miss = self._split_indices(self.data[col])
            self.ix_obs[col] = ix_obs
            self.ix_miss[col] = ix_miss

        # Most recent model instance and results instance for each variable.
        self.models = {}
        self.results = {}

        # Map from variable names to the conditional formula.
        self.conditional_formula = {}

        # Map from variable names to init/fit args of the conditional
        # models.
        self.init_kwds = defaultdict(lambda: dict())
        self.fit_kwds = defaultdict(lambda: dict())

        # Map from variable names to the model class.
        self.model_class = {}

        # Map from variable names to most recent params update.
        self.params = {}

        # Set default imputers.
        for vname in data.columns:
            self.set_imputer(vname)

        # The order in which variables are imputed in each cycle.
        # Impute variables with the fewest missing values first.
        vnames = list(data.columns)
        nmiss = [len(self.ix_miss[v]) for v in vnames]
        nmiss = np.asarray(nmiss)
        ii = np.argsort(nmiss)
        ii = ii[sum(nmiss == 0):]
        self._cycle_order = [vnames[i] for i in ii]

        self._initial_imputation()

        self.k_pmm = k_pmm

    def next_sample(self):
        """
        Returns the next imputed dataset in the imputation process.

        Returns
        -------
        data : array_like
            An imputed dataset from the MICE chain.

        Notes
        -----
        `MICEData` does not have a `skip` parameter.  Consecutive
        values returned by `next_sample` are immediately consecutive
        in the imputation chain.

        The returned value is a reference to the data attribute of
        the class and should be copied before making any changes.
        """

        self.update_all(1)
        return self.data

    def _initial_imputation(self):
        """
        Use a PMM-like procedure for initial imputed values.

        For each variable, missing values are imputed as the observed
        value that is closest to the mean over all observed values.
        """

        for col in self.data.columns:
            di = self.data[col] - self.data[col].mean()
            di = np.abs(di)
            ix = di.idxmin()
            imp = self.data[col].loc[ix]
            self.data[col].fillna(imp, inplace=True)

    def _split_indices(self, vec):
        null = pd.isnull(vec)
        ix_obs = np.flatnonzero(~null)
        ix_miss = np.flatnonzero(null)
        if len(ix_obs) == 0:
            raise ValueError("variable to be imputed has no observed values")
        return ix_obs, ix_miss

    def set_imputer(self, endog_name, formula=None, model_class=None,
                    init_kwds=None, fit_kwds=None, predict_kwds=None,
                    k_pmm=20, perturbation_method=None, regularized=False):
        """
        Specify the imputation process for a single variable.

        Parameters
        ----------
        endog_name : str
            Name of the variable to be imputed.
        formula : str
            Conditional formula for imputation. Defaults to a formula
            with main effects for all other variables in dataset.  The
            formula should only include an expression for the mean
            structure, e.g. use 'x1 + x2' not 'x4 ~ x1 + x2'.
        model_class : statsmodels model
            Conditional model for imputation. Defaults to OLS.  See below
            for more information.
        init_kwds : dit-like
            Keyword arguments passed to the model init method.
        fit_kwds : dict-like
            Keyword arguments passed to the model fit method.
        predict_kwds : dict-like
            Keyword arguments passed to the model predict method.
        k_pmm : int
            Determines number of neighboring observations from which
            to randomly sample when using predictive mean matching.
        perturbation_method : str
            Either 'gaussian' or 'bootstrap'. Determines the method
            for perturbing parameters in the imputation model.  If
            None, uses the default specified at class initialization.
        regularized : dict
            If regularized[name]=True, `fit_regularized` rather than
            `fit` is called when fitting imputation models for this
            variable.  When regularized[name]=True for any variable,
            perturbation_method must be set to boot.

        Notes
        -----
Loading ...