Learn more  » Push, build, and install  RubyGems npm packages Python packages Maven artifacts PHP packages Go Modules Bower components Debian packages RPM packages NuGet packages

alkaline-ml / statsmodels   python

Repository URL to install this package:

Version: 0.11.1 

/ genmod / bayes_mixed_glm.py

r"""
Bayesian inference for generalized linear mixed models.

Currently only families without additional scale or shape parameters
are supported (binomial and Poisson).

Two estimation approaches are supported: Laplace approximation
('maximum a posteriori'), and variational Bayes (mean field
approximation to the posterior distribution).

All realizations of random effects are modeled to be mutually
independent in this implementation.

The `exog_vc` matrix is the design matrix for the random effects.
Every column of `exog_vc` corresponds to an independent realization of
a random effect.  These random effects have mean zero and an unknown
standard deviation.  The standard deviation parameters are constrained
to be equal within subsets of the columns. When not using formulas,
these subsets are specified through the parameter `ident`.  `ident`
must have the same length as the number of columns of `exog_vc`, and
two columns whose `ident` values are equal have the same standard
deviation.  When formulas are used, the columns of `exog_vc` derived
from a common formula are constrained to have the same standard
deviation.

In many applications, `exog_vc` will be sparse.  A sparse matrix may
be passed when constructing a model class.  If a dense matrix is
passed, it will be converted internally to a sparse matrix.  There
currently is no way to avoid creating a temporary dense version of
`exog_vc` when using formulas.

Model and parameterization
--------------------------
The joint density of data and parameters factors as:

.. math::

    p(y | vc, fep) p(vc | vcp) p(vcp) p(fe)

The terms :math:`p(vcp)` and :math:`p(fe)` are prior distributions
that are taken to be Gaussian (the :math:`vcp` parameters are log
standard deviations so the standard deviations have log-normal
distributions).  The random effects distribution :math:`p(vc | vcp)`
is independent Gaussian (random effect realizations are independent
within and between values of the `ident` array).  The model
:math:`p(y | vc, fep)` depends on the specific GLM being fit.
"""

import numpy as np
from scipy.optimize import minimize
from scipy import sparse
import statsmodels.base.model as base
from statsmodels.iolib import summary2
from statsmodels.genmod import families
import pandas as pd
import warnings
import patsy

# Gauss-Legendre weights
glw = [
    [0.2955242247147529, -0.1488743389816312],
    [0.2955242247147529, 0.1488743389816312],
    [0.2692667193099963, -0.4333953941292472],
    [0.2692667193099963, 0.4333953941292472],
    [0.2190863625159820, -0.6794095682990244],
    [0.2190863625159820, 0.6794095682990244],
    [0.1494513491505806, -0.8650633666889845],
    [0.1494513491505806, 0.8650633666889845],
    [0.0666713443086881, -0.9739065285171717],
    [0.0666713443086881, 0.9739065285171717],
]

_init_doc = r"""
    Generalized Linear Mixed Model with Bayesian estimation

    The class implements the Laplace approximation to the posterior
    distribution (`fit_map`) and a variational Bayes approximation to
    the posterior (`fit_vb`).  See the two fit method docstrings for
    more information about the fitting approaches.

    Parameters
    ----------
    endog : array_like
        Vector of response values.
    exog : array_like
        Array of covariates for the fixed effects part of the mean
        structure.
    exog_vc : array_like
        Array of covariates for the random part of the model.  A
        scipy.sparse array may be provided, or else the passed
        array will be converted to sparse internally.
    ident : array_like
        Array of integer labels showing which random terms (columns
        of `exog_vc`) have a common variance.
    vcp_p : float
        Prior standard deviation for variance component parameters
        (the prior standard deviation of log(s) is vcp_p, where s is
        the standard deviation of a random effect).
    fe_p : float
        Prior standard deviation for fixed effects parameters.
    family : statsmodels.genmod.families instance
        The GLM family.
    fep_names : list[str]
        The names of the fixed effects parameters (corresponding to
        columns of exog).  If None, default names are constructed.
    vcp_names : list[str]
        The names of the variance component parameters (corresponding
        to distinct labels in ident).  If None, default names are
        constructed.
    vc_names : list[str]
        The names of the random effect realizations.

    Returns
    -------
    MixedGLMResults object

    Notes
    -----
    There are three types of values in the posterior distribution:
    fixed effects parameters (fep), corresponding to the columns of
    `exog`, random effects realizations (vc), corresponding to the
    columns of `exog_vc`, and the standard deviations of the random
    effects realizations (vcp), corresponding to the unique integer
    labels in `ident`.

    All random effects are modeled as being independent Gaussian
    values (given the variance structure parameters).  Every column of
    `exog_vc` has a distinct realized random effect that is used to
    form the linear predictors.  The elements of `ident` determine the
    distinct variance structure parameters.  Two random effect
    realizations that have the same value in `ident` have the same
    variance.  When fitting with a formula, `ident` is constructed
    internally (each element of `vc_formulas` yields a distinct label
    in `ident`).

    The random effect standard deviation parameters (`vcp`) have
    log-normal prior distributions with mean 0 and standard deviation
    `vcp_p`.

    Note that for some families, e.g. Binomial, the posterior mode may
    be difficult to find numerically if `vcp_p` is set to too large of
    a value.  Setting `vcp_p` to 0.5 seems to work well.

    The prior for the fixed effects parameters is Gaussian with mean 0
    and standard deviation `fe_p`.

    Examples
    --------{example}


    References
    ----------
    Introduction to generalized linear mixed models:
    https://stats.idre.ucla.edu/other/mult-pkg/introduction-to-generalized-linear-mixed-models

    SAS documentation:
    https://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_intromix_a0000000215.htm

    An assessment of estimation methods for generalized linear mixed
    models with binary outcomes
    https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3866838/
    """

# The code in the example should be identical to what appears in
# the test_doc_examples unit test
_logit_example = """
    A binomial (logistic) random effects model with random intercepts
    for villages and random slopes for each year within each village:

    >>> random = {"a": '0 + C(Village)', "b": '0 + C(Village)*year_cen'}
    >>> model = BinomialBayesMixedGLM.from_formula(
                   'y ~ year_cen', random, data)
    >>> result = model.fit_vb()
"""

# The code in the example should be identical to what appears in
# the test_doc_examples unit test
_poisson_example = """
    A Poisson random effects model with random intercepts for villages
    and random slopes for each year within each village:

    >>> random = {"a": '0 + C(Village)', "b": '0 + C(Village)*year_cen'}
    >>> model = PoissonBayesMixedGLM.from_formula(
                    'y ~ year_cen', random, data)
    >>> result = model.fit_vb()
"""


class _BayesMixedGLM(base.Model):
    def __init__(self,
                 endog,
                 exog,
                 exog_vc=None,
                 ident=None,
                 family=None,
                 vcp_p=1,
                 fe_p=2,
                 fep_names=None,
                 vcp_names=None,
                 vc_names=None,
                 **kwargs):

        if exog.ndim == 1:
            if isinstance(exog, np.ndarray):
                exog = exog[:, None]
            else:
                exog = pd.DataFrame(exog)

        if exog.ndim != 2:
            msg = "'exog' must have one or two columns"
            raise ValueError(msg)

        if exog_vc.ndim == 1:
            if isinstance(exog_vc, np.ndarray):
                exog_vc = exog_vc[:, None]
            else:
                exog_vc = pd.DataFrame(exog_vc)

        if exog_vc.ndim != 2:
            msg = "'exog_vc' must have one or two columns"
            raise ValueError(msg)

        ident = np.asarray(ident)
        if ident.ndim != 1:
            msg = "ident must be a one-dimensional array"
            raise ValueError(msg)

        if len(ident) != exog_vc.shape[1]:
            msg = "len(ident) should match the number of columns of exog_vc"
            raise ValueError(msg)

        if not np.issubdtype(ident.dtype, np.integer):
            msg = "ident must have an integer dtype"
            raise ValueError(msg)

        # Get the fixed effects parameter names
        if fep_names is None:
            if hasattr(exog, "columns"):
                fep_names = exog.columns.tolist()
            else:
                fep_names = ["FE_%d" % (k + 1) for k in range(exog.shape[1])]

        # Get the variance parameter names
        if vcp_names is None:
            vcp_names = ["VC_%d" % (k + 1) for k in range(int(max(ident)) + 1)]
        else:
            if len(vcp_names) != len(set(ident)):
                msg = "The lengths of vcp_names and ident should be the same"
                raise ValueError(msg)

        if not sparse.issparse(exog_vc):
            exog_vc = sparse.csr_matrix(exog_vc)

        ident = ident.astype(np.int)
        vcp_p = float(vcp_p)
        fe_p = float(fe_p)

        # Number of fixed effects parameters
        if exog is None:
            k_fep = 0
        else:
            k_fep = exog.shape[1]

        # Number of variance component structure parameters and
        # variance component realizations.
        if exog_vc is None:
            k_vc = 0
            k_vcp = 0
        else:
            k_vc = exog_vc.shape[1]
            k_vcp = max(ident) + 1

        # power might be better but not available in older scipy
        exog_vc2 = exog_vc.multiply(exog_vc)

        super(_BayesMixedGLM, self).__init__(endog, exog, **kwargs)

        self.exog_vc = exog_vc
        self.exog_vc2 = exog_vc2
        self.ident = ident
        self.family = family
        self.k_fep = k_fep
        self.k_vc = k_vc
        self.k_vcp = k_vcp
        self.fep_names = fep_names
        self.vcp_names = vcp_names
        self.vc_names = vc_names
        self.fe_p = fe_p
        self.vcp_p = vcp_p
        self.names = fep_names + vcp_names
        if vc_names is not None:
            self.names += vc_names

    def _unpack(self, vec):

        ii = 0

        # Fixed effects parameters
        fep = vec[:ii + self.k_fep]
        ii += self.k_fep

        # Variance component structure parameters (standard
        # deviations).  These are on the log scale.  The standard
        # deviation for random effect j is exp(vcp[ident[j]]).
        vcp = vec[ii:ii + self.k_vcp]
        ii += self.k_vcp

        # Random effect realizations
        vc = vec[ii:]

        return fep, vcp, vc

    def logposterior(self, params):
        """
        The overall log-density: log p(y, fe, vc, vcp).

        This differs by an additive constant from the log posterior
        log p(fe, vc, vcp | y).
        """

        fep, vcp, vc = self._unpack(params)

        # Contributions from p(y | x, vc)
        lp = 0
        if self.k_fep > 0:
            lp += np.dot(self.exog, fep)
        if self.k_vc > 0:
            lp += self.exog_vc.dot(vc)

        mu = self.family.link.inverse(lp)
        ll = self.family.loglike(self.endog, mu)

        if self.k_vc > 0:

            # Contributions from p(vc | vcp)
            vcp0 = vcp[self.ident]
            s = np.exp(vcp0)
            ll -= 0.5 * np.sum(vc**2 / s**2) + np.sum(vcp0)

            # Contributions from p(vc)
            ll -= 0.5 * np.sum(vcp**2 / self.vcp_p**2)

        # Contributions from p(fep)
        if self.k_fep > 0:
            ll -= 0.5 * np.sum(fep**2 / self.fe_p**2)
Loading ...