2.1.1. dodiscover.ci.CMITest#

class dodiscover.ci.CMITest(k=0.2, transform='rank', n_jobs=-1, n_shuffle_nbrs=5, n_shuffle=100, random_seed=None)[source]#

Conditional mutual information independence test.

Implements the conditional independence test using conditional mutual information proposed in [1].

Parameters:

kfloat, optional: Number of nearest-neighbors for each sample point. If the number is smaller than 1, it is computed as a fraction of the number of samples, by default 0.2.
transformstr, optional: Transform the data by standardizing the data, by default ‘rank’, which converts data to ranks. Can be ‘rank’, ‘uniform’, ‘standardize’.
n_jobsint, optional: The number of CPUs to use, by default -1, which corresponds to using all CPUs available.
n_shuffle_nbrsint, optional: Number of nearest-neighbors within the Z covariates for shuffling, by default 5.
n_shuffleint: The number of times to shuffle the dataset to generate the null distribution. By default, 1000.
random_seedint, optional: The random seed that is used to seed via np.random.defaultrng.

Notes

Conditional mutual information (CMI) is defined as:

\[I(X;Y|Z) = \iiint p(z) p(x,y|z) \log \frac{ p(x,y|z)}{p(x|z)\cdot p(y |z)} \,dx dy dz\]

It can be seen that when \(X \perp Y | Z\), then CMI is equal to 0. Hence, CMI is a general measure for conditional dependence. The estimator for CMI proposed in [1] is a k-nearest-neighbor based estimator:

\[\widehat{I}(X;Y|Z) = \psi (k) + \frac{1}{T} \sum_{t=1}^T (\psi(k_{Z,t}) - \psi(k_{XZ,t}) - \psi(k_{YZ,t}))\]

where \(\psi\) is the Digamma (i.e. see scipy.special.digamma) function. \(k\) determines the size of hyper-cubes around each (high-dimensional) sample point. Then \(k_{Z,},k_{XZ},k_{YZ}\) are the numbers of neighbors in the respective subspaces. \(k\) can be viewed as a density smoothing parameter (although it is data-adaptive unlike fixed-bandwidth estimators). For large \(k\), the underlying dependencies are more smoothed and CMI has a larger bias, but lower variance, which is more important for significance testing. Note that the estimated CMI values can be slightly negative while CMI is a non- negative quantity.

The estimator implemented here assumes the data is continuous.

References

Methods

test(df, x_vars, y_vars[, z_covariates])

Abstract method for all conditional independence tests.

test(df, x_vars, y_vars, z_covariates=None)[source]#

Abstract method for all conditional independence tests.

Parameters:

dfpd.DataFrame: The dataframe containing the dataset.
x_varsSet of column: A column in df.
y_varsSet of column: A column in df.
z_covariatesSet, optional: A set of columns in df, by default None. If None, then the test should run a standard independence test.

Returns:

Tuple[float, float]: Test statistic and pvalue.