dowhy.gcm.independence_test package

Submodules

dowhy.gcm.independence_test.generalised_cov_measure module

dowhy.gcm.independence_test.generalised_cov_measure.generalised_cov_based(X: ndarray, Y: ndarray, Z: Optional[ndarray] = None, prediction_model_X: Union[AssignmentQuality, Callable[[], PredictionModel]] = AssignmentQuality.BETTER, prediction_model_Y: Union[AssignmentQuality, Callable[[], PredictionModel]] = AssignmentQuality.BETTER)[source]

(Conditional) independence test based on the Generalised Covariance Measure.

Note: - Currently, only univariate and continuous X and Y are supported. - Residuals are based on the training data. - The relationships need to be non-deterministic, i.e., the residuals cannot be constant!

See - R. D. Shah and J Peters. The hardness of conditional independence testing and the generalised covariance measure, The Annals of Statistics 48(3), 2018 for more details.

Parameters
  • X – Data matrix for observations from X.

  • Y – Data matrix for observations from Y.

  • Z – Optional data matrix for observations from Z. This is the conditional variable.

  • prediction_model_X – Either a model class that will be used as prediction model for regressing X on Z (e.g., a linear regressor) or an AssignmentQuality for automatically selecting a model.

  • prediction_model_Y – Either a model class that will be used as prediction model for regressing X on Z (e.g., a linear regressor) or an AssignmentQuality for automatically selecting a model.

:return The p-value for the null hypothesis that X and Y are independent (given Z).

dowhy.gcm.independence_test.kernel module

Functions in this module should be considered experimental, meaning there might be breaking API changes in the future.

dowhy.gcm.independence_test.kernel.approx_kernel_based(X: ~numpy.ndarray, Y: ~numpy.ndarray, Z: ~typing.Optional[~numpy.ndarray] = None, num_random_features_X: int = 50, num_random_features_Y: int = 50, num_random_features_Z: int = 50, num_permutations: int = 100, approx_kernel: ~typing.Callable[[~numpy.ndarray], ~numpy.ndarray] = <function approximate_rbf_kernel_features>, scale_data: bool = False, use_bootstrap: bool = True, bootstrap_num_runs: int = 10, bootstrap_num_samples: int = 1000, bootstrap_n_jobs: ~typing.Optional[int] = None, p_value_adjust_func: ~typing.Callable[[~typing.Union[~numpy.ndarray, ~typing.List[float]]], float] = <function quantile_based_fwer>) float[source]

Implementation of the Randomized Conditional Independence Test. The independence test estimates a p-value for the null hypothesis that X and Y are independent (given Z). Depending whether Z is given, a conditional or pairwise independence test is performed.

If Z is given: Using RCIT as conditional independence test. If Z is not given: Using RIT as pairwise independence test.

Note: - The data can be multivariate, i.e. the given input matrices can have multiple columns. - Categorical data need to be represented as strings. - It is possible to apply a different kernel to each column in the matrices. For instance, a RBF kernel for the

first dimension in X and a delta kernel for the second.

Based on the work:

Strobl, Eric V., Kun Zhang, and Shyam Visweswaran. Approximate kernel-based conditional independence tests for fast non-parametric causal discovery. Journal of Causal Inference 7.1 (2019).

Parameters
  • X – Data matrix for observations from X.

  • Y – Data matrix for observations from Y.

  • Z – Optional data matrix for observations from Z. This is the conditional variable.

  • num_random_features_X – Number of features sampled from the approximated kernel map for X.

  • num_random_features_Y – Number of features sampled from the approximated kernel map for Y.

  • num_random_features_Z – Number of features sampled from the approximated kernel map for Z.

  • num_permutations – Number of permutations for estimating the test test statistic.

  • approx_kernel – The approximated kernel map. The expected input is a n x d numpy array and the output is expected to be a n x k numpy array with k << d. By default, the Nystroem method with a RBF kernel is used.

  • scale_data – If set to True, the data will be standardized. If set to False, the data is taken as it is. Standardizing the data helps in identifying weak dependencies. If one is only interested in stronger ones, consider setting this to False.

  • use_bootstrap – If True, the independence tests are performed on multiple subsets of the data and the final p-value is constructed based on the provided p_value_adjust_func function.

  • bootstrap_num_runs – Number of bootstrap runs (only relevant if use_bootstrap is True).

  • bootstrap_num_samples – Maximum number of used samples per bootstrap run.

  • bootstrap_n_jobs – Number of parallel jobs for the bootstrap runs.

  • p_value_adjust_func – A callable that expects a numpy array of multiple p-values and returns one p-value. This is typically used a family wise error rate control method.

Returns

The p-value for the null hypothesis that X and Y are independent (given Z).

dowhy.gcm.independence_test.kernel.kernel_based(X: ~numpy.ndarray, Y: ~numpy.ndarray, Z: ~typing.Optional[~numpy.ndarray] = None, use_bootstrap: bool = True, bootstrap_num_runs: int = 10, bootstrap_num_samples_per_run: int = 2000, bootstrap_n_jobs: ~typing.Optional[int] = None, p_value_adjust_func: ~typing.Callable[[~typing.Union[~numpy.ndarray, ~typing.List[float]]], float] = <function quantile_based_fwer>, **kwargs) float[source]

Prepares the data and uses kernel (conditional) independence test. The independence test estimates a p-value for the null hypothesis that X and Y are independent (given Z). Depending whether Z is given, a conditional or pairwise independence test is performed.

Here, we utilize the implementations of the https://github.com/cmu-phil/causal-learn package.

If Z is given: Using KCI as conditional independence test, i.e. we use https://github.com/cmu-phil/causal-learn/blob/main/causallearn/utils/KCI/KCI.py#L238. If Z is not given: Using KCI as pairwise independence test, i.e. we use https://github.com/cmu-phil/causal-learn/blob/main/causallearn/utils/KCI/KCI.py#L17.

Note: - The data can be multivariate, i.e. the given input matrices can have multiple columns. - Categorical data need to be represented as strings.

Based on the work: - K. Zhang, J. Peters, D. Janzing, B. Schölkopf. Kernel-based Conditional Independence Test and Application in Causal Discovery. UAI’11, Pages 804–813, 2011. - A. Gretton, K. Fukumizu, C.-H. Teo, L. Song, B. Schölkopf, A. Smola. A Kernel Statistical Test of Independence. NIPS 21, 2007.

For more information about configuring the kernel independence test, see: - https://github.com/cmu-phil/causal-learn/blob/main/causallearn/utils/KCI/KCI.py#L17 (if Z is not given) - https://github.com/cmu-phil/causal-learn/blob/main/causallearn/utils/KCI/KCI.py#L238 (if Z is given)

Parameters
  • X – Data matrix for observations from X.

  • Y – Data matrix for observations from Y.

  • Z – Optional data matrix for observations from Z. This is the conditional variable.

  • use_bootstrap – If True, the independence tests are performed on multiple subsets of the data and the final p-value is constructed based on the provided p_value_adjust_func function.

  • bootstrap_num_runs – Number of bootstrap runs (only relevant if use_bootstrap is True).

  • bootstrap_num_samples_per_run – Number of samples used in a bootstrap run (only relevant if use_bootstrap is True).

  • bootstrap_n_jobs – Number of parallel jobs for the bootstrap runs.

  • p_value_adjust_func – A callable that expects a numpy array of multiple p-values and returns one p-value. This is typically used a family wise error rate control method.

Returns

The p-value for the null hypothesis that X and Y are independent (given Z).

dowhy.gcm.independence_test.kernel_operation module

Functions in this module should be considered experimental, meaning there might be breaking API changes in the future.

dowhy.gcm.independence_test.kernel_operation.apply_delta_kernel(X: ndarray) ndarray[source]

Applies the delta kernel, i.e. the distance is 1 if two entries are equal and 0 otherwise.

Parameters

X – Input data.

Returns

The outcome of the delta-kernel, a binary distance matrix.

dowhy.gcm.independence_test.kernel_operation.apply_rbf_kernel(X: ndarray, precision: Optional[float] = None) ndarray[source]

Estimates the RBF (Gaussian) kernel for the given input data.

Parameters
  • X – Input data.

  • precision – Specific precision matrix for the RBF kernel. If None is given, this is inferred from the data.

Returns

The outcome of applying a RBF (Gaussian) kernel on the data.

dowhy.gcm.independence_test.kernel_operation.apply_rbf_kernel_with_adaptive_precision(X: ndarray) ndarray[source]

Estimates the RBF (Gaussian) kernel for the given input data. Here, each column is scaled by an individual precision parameter which is automatically inferred from the data.

Parameters

X – Input data.

Returns

The outcome of applying a RBF (Gaussian) kernel on the data.

dowhy.gcm.independence_test.kernel_operation.approximate_delta_kernel_features(X: ndarray, num_random_components: int) ndarray[source]

Applies the Nystroem method to create a NxD (D << N) approximated delta kernel map using a subset of the data, where N is the number of samples in X and D the number of components. The delta kernel gives 1 if two entries are equal and 0 otherwise.

Parameters
  • X – Input data.

  • num_random_components – Number of components D for the approximated kernel map.

Returns

A NxD approximated RBF kernel map, where N is the number of samples in X and D the number of components.

dowhy.gcm.independence_test.kernel_operation.approximate_rbf_kernel_features(X: ndarray, num_random_components: int, precision: Optional[float] = None) ndarray[source]

Applies the Nystroem method to create a NxD (D << N) approximated RBF kernel map using a subset of the data, where N is the number of samples in X and D the number of components.

Parameters
  • X – Input data.

  • num_random_components – Number of components D for the approximated kernel map.

  • precision – Specific precision matrix for the RBF kernel. If None is given, this is inferred from the data.

Returns

A NxD approximated RBF kernel map, where N is the number of samples in X and D the number of components.

dowhy.gcm.independence_test.regression module

Regression based (conditional) independence test. Testing independence via regression, i.e. if a variable has information about another variable, then they are dependent.

dowhy.gcm.independence_test.regression.regression_based(X: ~numpy.ndarray, Y: ~numpy.ndarray, Z: ~typing.Optional[~numpy.ndarray] = None, num_components_all_inputs: int = 40, num_runs: int = 20, p_value_adjust_func: ~typing.Callable[[~typing.Union[~numpy.ndarray, ~typing.List[float]]], float] = <function quantile_based_fwer>, f_test_samples_ratio: ~typing.Optional[float] = 0.3, max_samples_per_run: int = 10000) float[source]

The main idea is that if X and Y are dependent, then X should help in predicting Y. If there is no dependency, then X should not help. When Z is given, the idea remains the same, but here X and Y are conditionally independent given Z if X does not help in predicting Y when knowing Z. This is, X has not additional information about Y given Z. In the pairwise case (Z is not given), the performances (in terms of squared error) between predicting Y based on X and predicting Y by returning its mean (the best estimator without any inputs) are compared. Note that categorical inputs are transformed via the sklearn one-hot-encoder.

Here, we use the sklearn.kernel_approximation.Nystroem approach to approximate a kernel map of the inputs that serves as new input features. These new features allow to model complex non-linear relationships. In case of categorical data, we first apply an one-hot-encoding and then map it into the kernel feature space. Afterwards, we use linear regression as a prediction model based on the non-linear input features. The idea is then the same as in Granger causality, where we apply a f-test to see if the additional input features significantly help in predicting the target or not.

Note: As compared to kernel_based(), this method is quite fast and provides reasonably well results. However, there are types of dependencies that this test cannot detect. For instance, if X determines the variance of Y, then this cannot be captured. For these more complex dependencies, consider using the kernel_based() independence test instead.

This test is motivated by Granger causality, the approx_kernel_based test and the following paper:

K Chalupka, P Perona, F. Eberhardt. Fast Conditional Independence Test for Vector Variables with Large Sample Sizes. arXiv:1804.02747, 2018.

Parameters
  • X – Input data for X.

  • Y – Input data for Y.

  • Z – Input data for Z. The set of variables to (optionally) condition on.

  • num_components_all_inputs – Number of kernel features when combining X and Z. If Z is not given, it will be replaced with an empty array. If Z is given, half of the number is used to generate features for Z.

  • num_runs – Number of runs. This equals the number of estimated p-values, which get adjusted by the p_value_adjust_func.

  • p_value_adjust_func – A callable that expects a numpy array of multiple p-values and returns one p-value. This is typically used a family wise error rate control method.

  • f_test_samples_ratio – Ratio for splitting the data into test and training data sets for calculating the f-statistic. A ratio of 0.3 means that 30% of the samples are used for the f-test (test samples) and 70% are used for training the prediction model (training samples). If set to None, training and test data set are the same, which could help in settings where only a few samples are available.

  • max_samples_per_run – Maximum number of samples used per run.

Returns

The p-value for the null hypothesis that X and Y are independent given Z. If Z is not given, then for the hypothesis that X and Y are independent.

Module contents

dowhy.gcm.independence_test.independence_test(X, Y, conditioned_on=None, method='kernel', **kwargs)[source]

Performs a (conditional) independence test. Three methods for (conditional) independence test are supported at the moment:

  • kernel: Kernel-based (conditional) independence test.

      1. Zhang, J. Peters, D. Janzing, B. Schölkopf. Kernel-based Conditional Independence Test and Application in Causal Discovery. UAI’11, Pages 804–813, 2011.

      1. Gretton, K. Fukumizu, C.-H. Teo, L. Song, B. Schölkopf, A. Smola. A Kernel Statistical Test of Independence. NIPS 21, 2007.

    Here, we utilize the implementations of the https://github.com/cmu-phil/causal-learn package.

  • approx_kernel: Approximate kernel-based (conditional) independence test.

      1. Strobl, K. Zhang, S. Visweswaran. Approximate kernel-based conditional independence tests for fast non-parametric causal discovery. Journal of Causal Inference, 2019.

  • regression: Regression based (conditional) independence test using a f-test. See regression_based() for more details.

  • gcm: (Conditional) independence test based on the Generalised Covariance Measure. See generalised_cov_based() for more details.

        1. Shah and J Peters. The hardness of conditional independence testing and the generalised covariance measure, The Annals of Statistics 48(3), 2018

Parameters
  • X – Observations of X.

  • Y – Observations of Y.

  • conditioned_on – Observations of conditioning variable if we want to perform a conditional independence test. By default, independence test is carried out.

  • method – Method for conditional independence test. The choices are: kernel (default): kernel_based() (conditional) independence test. approx_kernel: approx_kernel_based() (conditional) independence test. regression: regression_based() (conditional) independence test. gcm: generalised_cov_based() (conditional) independence test. For more information about these methods, see above.

Returns

p-value of the (conditional) independence test. (Conditional) Independence is the null hypothesis.