Independence Tests

Assuming we have the following data:

>>> import numpy as np, pandas as pd
>>>
>>> X = np.random.normal(loc=0, scale=1, size=1000)
>>> Y = 2 * X + np.random.normal(loc=0, scale=1, size=1000)
>>> Z = 3 * Y + np.random.normal(loc=0, scale=1, size=1000)
>>> data = pd.DataFrame(data=dict(X=X, Y=Y, Z=Z))

To test whether \(X\) is conditionally independent of \(Z\) given \(Y\) using the kernel dependence measure, all you need to do is:

>>> import dowhy.gcm as gcm
>>>
>>> # Null hypothesis: x is independent of y given z
>>> p_value = gcm.independence_test(X, Z, conditioned_on=Y)
>>> p_value
0.48386151342564865

If we define a threshold of 0.05 (as is often done as a good default), and the p-value is clearly above this, it says \(X\) and \(Z\) are indeed independent when we condition on \(Y\). This is what we would expect, given that we generated the data using the causal graph \(X \rightarrow Y \rightarrow Z\), where Z is conditionally independent of \(X\) given \(Y\).

To test whether \(X\) is independent of \(Z\) (without conditioning on \(Y\)), we can use the same function without the third argument.

>>> # Null hypothesis: x is independent of y
>>> p_value = gcm.independence_test(X, Z)
>>> p_value
0.0

Again, we can define a threshold of 0.05, but this time the p-value is clearly below this threshold. This says \(X\) and \(Z\) are dependent on each other. Again, this is what we would expect, since \(Z\) is dependent on \(Y\) and \(Y\) is dependent on \(X\), but we don’t condition on \(Y\).