# Computing Counterfactuals

By computing counterfactuals, we answer the question:

I observed a certain outcome z for a variable Z where variable X was set to a value x. What would have happened to the value of Z, had I intervened on X to assign it a different value x’?

As a concrete example, we can imagine the following:

I’m seeing unhealthy high levels of my cholesterol LDL (Z=10). I didn’t take any medication against it in recent months (X=0). What would have happened to my cholesterol LDL level (Z), had I taken a medication dosage of 5g a day (X := 5)?

## How to use it

To see how the method works, let’s generate some data:

>>> import networkx as nx, numpy as np, pandas as pd
>>> from dowhy import gcm

>>> X = np.random.normal(loc=0, scale=1, size=1000)
>>> Y = 2*X + np.random.normal(loc=0, scale=1, size=1000)
>>> Z = 3*Y + np.random.normal(loc=0, scale=1, size=1000)
>>> training_data = pd.DataFrame(data=dict(X=X, Y=Y, Z=Z))


Next, we’ll model cause-effect relationships as an invertible SCM and fit it to the data:

>>> causal_model = gcm.InvertibleStructuralCausalModel(nx.DiGraph([('X', 'Y'), ('Y', 'Z')])) # X -> Y -> Z
>>> causal_model.set_causal_mechanism('X', gcm.EmpiricalDistribution())

>>> gcm.fit(causal_model, training_data)


Finally, let’s compute the counterfactual when intervening on X:

>>> gcm.counterfactual_samples(
>>>     causal_model,
>>>     {'X': lambda x: 2},
>>>     observed_data=pd.DataFrame(data=dict(X=[1], Y=[2], Z=[3])))
X         Y         Z
0  2  4.034229  9.073294


As we can see, $$X$$ takes our treatment-/intervention-value of 2, and $$Y$$ and $$Z$$ take deterministic values, based on our trained causal models and fixed observed data. I.e., based on the data generation process, if $$X = 1$$, $$Y = 2$$, we would expect $$Z$$ to be 6, but we observed $$Z = 3$$, which means the particular noise value for $$Z$$ in this particular sample is approximately -2.98. Now, given that we know this hidden noise factor, we can estimate the counterfactual value of $$Z$$, had we set $$X := 2$$, which is approximately 9.07 (as can be seen in the result above).

This shows that the observed data is used to calculate the noise data in the system. We can also provide these noise values directly, via:

>>> gcm.counterfactual_samples(
>>>     causal_model,
>>>     {'X': lambda x: 2},
>>>     noise_data=pd.DataFrame(data=dict(X=[0], Y=[-0.007913], Z=[-2.97568])))
X         Y         Z
0  2  4.034229  9.073293


As we see, with $$X = 2$$ and $$Y \approx 4.03$$, $$Z$$ should be approximately 12. But we know the hidden noise for this sample, approximately -2.98. So the counterfactual outcome is again $$Z \approx 9.07$$.

## Understanding the method

Counterfactuals are very similar to Simulating the Impact of Interventions, with an important difference: when performing interventions, we look into the future, for counterfactuals we look into an alternative past. To reflect this in the computation, when performing interventions, we generate all noise using our causal models. For counterfactuals, we use the noise from actual observed data.

To expand on our example above, we assume there are other factors that contribute to cholesterol levels, e.g. exercising or genetic predisposition. While we assume medication helps against high LDL levels, it’s important to take into account all other factors that could also help against it. We want to prove what has helped. Hence, it’s important to use the noise from the real data, not some generated noise from our generative models. Otherwise, I may be able to reduce my cholesterol LDL level in the counterfactual world, where I take medication (X := 5), but not because I took the medication, but because the generated noise of Z also just happened to be low and so caused a low value for Z. By taking the real noise value of Z (derived from the observed data of Z), I can prove that it was the medication that helped.