DoWhy: Different estimation methods for causal inference#

This is a quick introduction to the DoWhy causal inference library. We will load in a sample dataset and use different methods for estimating the causal effect of a (pre-specified)treatment variable on a (pre-specified) outcome variable.

We will see that not all estimators return the correct effect for this dataset.

First, let us add the required path for Python to find the DoWhy code and load all required packages

[1]:
%load_ext autoreload
%autoreload 2
[2]:
import numpy as np
import pandas as pd
import logging

import dowhy
dowhy.enable_notebook_rendering()
from dowhy import CausalModel
import dowhy.datasets

Now, let us load a dataset. For simplicity, we simulate a dataset with linear relationships between common causes and treatment, and common causes and outcome.

Beta is the true causal effect.

[3]:
data = dowhy.datasets.linear_dataset(beta=10,
        num_common_causes=5,
        num_instruments = 2,
        num_treatments=1,
        num_samples=10000,
        treatment_is_binary=True,
        outcome_is_binary=False,
        stddev_treatment_noise=10)
df = data["df"]
df
[3]:
Z0 Z1 W0 W1 W2 W3 W4 v0 y
0 0.0 0.521189 -1.639186 -0.276627 -0.296830 2.271959 2.408246 True 16.223471
1 0.0 0.325566 -0.332861 0.718347 -2.136903 -0.271269 1.088424 False -3.370176
2 0.0 0.693185 0.051449 0.346476 -1.036660 -1.505436 0.382447 False -3.466292
3 0.0 0.807450 -0.657515 -1.643131 -0.833240 1.613055 -1.094913 True -5.200987
4 0.0 0.584739 -2.028879 -2.678483 -0.166404 -1.582827 -1.340383 False -22.972843
... ... ... ... ... ... ... ... ... ...
9995 0.0 0.362923 -1.127461 -1.044314 -0.509937 -1.281489 0.084620 False -10.079448
9996 0.0 0.522627 -1.112406 -2.002737 1.694356 -0.127747 -1.139429 False -7.414506
9997 0.0 0.721496 -0.239057 -0.719546 0.468047 -0.576893 1.610039 False 2.966132
9998 0.0 0.985682 -0.863647 -0.691356 -0.032836 -1.804727 0.262590 True 4.352545
9999 0.0 0.673087 -1.121177 -2.102748 -0.638730 -0.342560 1.675938 True 0.028921

10000 rows × 9 columns

Note that we are using a pandas dataframe to load the data.

Identifying the causal estimand#

We now input a causal graph in the DOT graph format.

[4]:
# With graph
model=CausalModel(
        data = df,
        treatment=data["treatment_name"],
        outcome=data["outcome_name"],
        graph=data["gml_graph"],
        instruments=data["instrument_names"]
        )
[5]:
model.view_model()
../_images/example_notebooks_dowhy_estimation_methods_9_0.png
[6]:
from IPython.display import Image, display
display(Image(filename="causal_model.png"))
../_images/example_notebooks_dowhy_estimation_methods_10_0.png

We get a causal graph. Now identification and estimation is done.

[7]:
identified_estimand = model.identify_effect(proceed_when_unidentifiable=True)
print(identified_estimand)
Estimand type: EstimandType.NONPARAMETRIC_ATE

### Estimand : 1
Estimand name: backdoor
Estimand expression:
  d
─────(E[y|W3,W1,W4,W0,W2])
d[v₀]
Estimand assumption 1, Unconfoundedness: If U→{v0} and U→y then P(y|v0,W3,W1,W4,W0,W2,U) = P(y|v0,W3,W1,W4,W0,W2)

### Estimand : 2
Estimand name: iv
Estimand expression:
 ⎡                              -1⎤
 ⎢    d        ⎛    d          ⎞  ⎥
E⎢─────────(y)⋅⎜─────────([v₀])⎟  ⎥
 ⎣d[Z₀  Z₁]    ⎝d[Z₀  Z₁]      ⎠  ⎦
Estimand assumption 1, As-if-random: If U→→y then ¬(U →→{Z0,Z1})
Estimand assumption 2, Exclusion: If we remove {Z0,Z1}→{v0}, then ¬({Z0,Z1}→y)

### Estimand : 3
Estimand name: frontdoor
No such variable(s) found!

Method 1: Regression#

Use linear regression.

[8]:
causal_estimate_reg = model.estimate_effect(identified_estimand,
        method_name="backdoor.linear_regression",
        test_significance=True)
print(causal_estimate_reg)
print("Causal Estimate is " + str(causal_estimate_reg.value))
*** Causal Estimate ***

## Identified estimand
Estimand type: EstimandType.NONPARAMETRIC_ATE

### Estimand : 1
Estimand name: backdoor
Estimand expression:
  d
─────(E[y|W3,W1,W4,W0,W2])
d[v₀]
Estimand assumption 1, Unconfoundedness: If U→{v0} and U→y then P(y|v0,W3,W1,W4,W0,W2,U) = P(y|v0,W3,W1,W4,W0,W2)

## Realized estimand
b: y~v0+W3+W1+W4+W0+W2
Target units: ate

## Estimate
Mean value: 10.00031962605583
p-value: 0.0

Causal Estimate is 10.00031962605583

Method 2: Distance Matching#

Define a distance metric and then use the metric to match closest points between treatment and control.

[9]:
causal_estimate_dmatch = model.estimate_effect(identified_estimand,
                                              method_name="backdoor.distance_matching",
                                              target_units="att",
                                              method_params={'distance_metric':"minkowski", 'p':2})
print(causal_estimate_dmatch)
print("Causal Estimate is " + str(causal_estimate_dmatch.value))
/home/runner/.cache/pypoetry/virtualenvs/dowhy-n6DJFijf-py3.9/lib/python3.9/site-packages/sklearn/neighbors/_unsupervised.py:179: SyntaxWarning: Parameter p is found in metric_params. The corresponding parameter from __init__ is ignored.
  return self._fit(X)
*** Causal Estimate ***

## Identified estimand
Estimand type: EstimandType.NONPARAMETRIC_ATE

### Estimand : 1
Estimand name: backdoor
Estimand expression:
  d
─────(E[y|W3,W1,W4,W0,W2])
d[v₀]
Estimand assumption 1, Unconfoundedness: If U→{v0} and U→y then P(y|v0,W3,W1,W4,W0,W2,U) = P(y|v0,W3,W1,W4,W0,W2)

## Realized estimand
b: y~v0+W3+W1+W4+W0+W2
Target units: att

## Estimate
Mean value: 10.431665973408665

Causal Estimate is 10.431665973408665

Method 3: Propensity Score Stratification#

We will be using propensity scores to stratify units in the data.

[10]:
causal_estimate_strat = model.estimate_effect(identified_estimand,
                                              method_name="backdoor.propensity_score_stratification",
                                              target_units="att")
print(causal_estimate_strat)
print("Causal Estimate is " + str(causal_estimate_strat.value))
*** Causal Estimate ***

## Identified estimand
Estimand type: EstimandType.NONPARAMETRIC_ATE

### Estimand : 1
Estimand name: backdoor
Estimand expression:
  d
─────(E[y|W3,W1,W4,W0,W2])
d[v₀]
Estimand assumption 1, Unconfoundedness: If U→{v0} and U→y then P(y|v0,W3,W1,W4,W0,W2,U) = P(y|v0,W3,W1,W4,W0,W2)

## Realized estimand
b: y~v0+W3+W1+W4+W0+W2
Target units: att

## Estimate
Mean value: 10.094333265243263

Causal Estimate is 10.094333265243263

Method 4: Propensity Score Matching#

We will be using propensity scores to match units in the data.

[11]:
causal_estimate_match = model.estimate_effect(identified_estimand,
                                              method_name="backdoor.propensity_score_matching",
                                              target_units="atc")
print(causal_estimate_match)
print("Causal Estimate is " + str(causal_estimate_match.value))
*** Causal Estimate ***

## Identified estimand
Estimand type: EstimandType.NONPARAMETRIC_ATE

### Estimand : 1
Estimand name: backdoor
Estimand expression:
  d
─────(E[y|W3,W1,W4,W0,W2])
d[v₀]
Estimand assumption 1, Unconfoundedness: If U→{v0} and U→y then P(y|v0,W3,W1,W4,W0,W2,U) = P(y|v0,W3,W1,W4,W0,W2)

## Realized estimand
b: y~v0+W3+W1+W4+W0+W2
Target units: atc

## Estimate
Mean value: 9.809813300369935

Causal Estimate is 9.809813300369935

Method 5: Weighting#

We will be using (inverse) propensity scores to assign weights to units in the data. DoWhy supports a few different weighting schemes:

  1. Vanilla Inverse Propensity Score weighting (IPS) (weighting_scheme=”ips_weight”)

  2. Self-normalized IPS weighting (also known as the Hajek estimator) (weighting_scheme=”ips_normalized_weight”)

  3. Stabilized IPS weighting (weighting_scheme = “ips_stabilized_weight”)

[12]:
causal_estimate_ipw = model.estimate_effect(identified_estimand,
                                            method_name="backdoor.propensity_score_weighting",
                                            target_units = "ate",
                                            method_params={"weighting_scheme":"ips_weight"})
print(causal_estimate_ipw)
print("Causal Estimate is " + str(causal_estimate_ipw.value))
*** Causal Estimate ***

## Identified estimand
Estimand type: EstimandType.NONPARAMETRIC_ATE

### Estimand : 1
Estimand name: backdoor
Estimand expression:
  d
─────(E[y|W3,W1,W4,W0,W2])
d[v₀]
Estimand assumption 1, Unconfoundedness: If U→{v0} and U→y then P(y|v0,W3,W1,W4,W0,W2,U) = P(y|v0,W3,W1,W4,W0,W2)

## Realized estimand
b: y~v0+W3+W1+W4+W0+W2
Target units: ate

## Estimate
Mean value: 9.994893941263896

Causal Estimate is 9.994893941263896

Method 6: Instrumental Variable#

We will be using the Wald estimator for the provided instrumental variable.

[13]:
causal_estimate_iv = model.estimate_effect(identified_estimand,
        method_name="iv.instrumental_variable", method_params = {'iv_instrument_name': 'Z0'})
print(causal_estimate_iv)
print("Causal Estimate is " + str(causal_estimate_iv.value))
*** Causal Estimate ***

## Identified estimand
Estimand type: EstimandType.NONPARAMETRIC_ATE

### Estimand : 1
Estimand name: iv
Estimand expression:
 ⎡                              -1⎤
 ⎢    d        ⎛    d          ⎞  ⎥
E⎢─────────(y)⋅⎜─────────([v₀])⎟  ⎥
 ⎣d[Z₀  Z₁]    ⎝d[Z₀  Z₁]      ⎠  ⎦
Estimand assumption 1, As-if-random: If U→→y then ¬(U →→{Z0,Z1})
Estimand assumption 2, Exclusion: If we remove {Z0,Z1}→{v0}, then ¬({Z0,Z1}→y)

## Realized estimand
Realized estimand: Wald Estimator
Realized estimand type: EstimandType.NONPARAMETRIC_ATE
Estimand expression:
 ⎡ d    ⎤
E⎢───(y)⎥
 ⎣dZ₀   ⎦
──────────
 ⎡ d     ⎤
E⎢───(v₀)⎥
 ⎣dZ₀    ⎦
Estimand assumption 1, As-if-random: If U→→y then ¬(U →→{Z0,Z1})
Estimand assumption 2, Exclusion: If we remove {Z0,Z1}→{v0}, then ¬({Z0,Z1}→y)
Estimand assumption 3, treatment_effect_homogeneity: Each unit's treatment ['v0'] is affected in the same way by common causes of ['v0'] and ['y']
Estimand assumption 4, outcome_effect_homogeneity: Each unit's outcome ['y'] is affected in the same way by common causes of ['v0'] and ['y']

Target units: ate

## Estimate
Mean value: 8.6012566409475

Causal Estimate is 8.6012566409475

Method 7: Regression Discontinuity#

We will be internally converting this to an equivalent instrumental variables problem.

[14]:
causal_estimate_regdist = model.estimate_effect(identified_estimand,
        method_name="iv.regression_discontinuity",
        method_params={'rd_variable_name':'Z1',
                       'rd_threshold_value':0.5,
                       'rd_bandwidth': 0.15})
print(causal_estimate_regdist)
print("Causal Estimate is " + str(causal_estimate_regdist.value))
*** Causal Estimate ***

## Identified estimand
Estimand type: EstimandType.NONPARAMETRIC_ATE

### Estimand : 1
Estimand name: iv
Estimand expression:
 ⎡                              -1⎤
 ⎢    d        ⎛    d          ⎞  ⎥
E⎢─────────(y)⋅⎜─────────([v₀])⎟  ⎥
 ⎣d[Z₀  Z₁]    ⎝d[Z₀  Z₁]      ⎠  ⎦
Estimand assumption 1, As-if-random: If U→→y then ¬(U →→{Z0,Z1})
Estimand assumption 2, Exclusion: If we remove {Z0,Z1}→{v0}, then ¬({Z0,Z1}→y)

## Realized estimand
Realized estimand: Wald Estimator
Realized estimand type: EstimandType.NONPARAMETRIC_ATE
Estimand expression:
 ⎡        d            ⎤
E⎢──────────────────(y)⎥
 ⎣dlocal_rd_variable   ⎦
─────────────────────────
 ⎡        d             ⎤
E⎢──────────────────(v₀)⎥
 ⎣dlocal_rd_variable    ⎦
Estimand assumption 1, As-if-random: If U→→y then ¬(U →→{Z0,Z1})
Estimand assumption 2, Exclusion: If we remove {Z0,Z1}→{v0}, then ¬({Z0,Z1}→y)
Estimand assumption 3, treatment_effect_homogeneity: Each unit's treatment ['v0'] is affected in the same way by common causes of ['v0'] and ['y']
Estimand assumption 4, outcome_effect_homogeneity: Each unit's outcome ['y'] is affected in the same way by common causes of ['v0'] and ['y']

Target units: ate

## Estimate
Mean value: 1.6581211399709523

Causal Estimate is 1.6581211399709523

Method 8: Doubly Robust Estimator#

Combines a regression estimator and a propensity score estimator to give back a doubly robust estimate.

[15]:
causal_estimate_doubly_robust = model.estimate_effect(identified_estimand,
        method_name="backdoor.doubly_robust",
        method_params={'propensity_score_column':'propensity_score_dr'}
    )
print(causal_estimate_doubly_robust)
print("Causal Estimate is " + str(causal_estimate_doubly_robust.value))
*** Causal Estimate ***

## Identified estimand
Estimand type: EstimandType.NONPARAMETRIC_ATE

### Estimand : 1
Estimand name: backdoor
Estimand expression:
  d
─────(E[y|W3,W1,W4,W0,W2])
d[v₀]
Estimand assumption 1, Unconfoundedness: If U→{v0} and U→y then P(y|v0,W3,W1,W4,W0,W2,U) = P(y|v0,W3,W1,W4,W0,W2)

## Realized estimand
b: y~v0+W3+W1+W4+W0+W2
Target units: ate

## Estimate
Mean value: 10.000181279858218

Causal Estimate is 10.000181279858218

Method 9: Tab-PFN Estimator#

We will use a TabPFN (Prior-Data Fitted Network) as the outcome model to estimate the causal effect via backdoor adjustment.
Best suited for datasets with ≤10,000 samples and ≤500 features; requires ‘pip install tabpfn torch’.

Note: This example uses 10,000 samples thus requires a GPU. For a CPU-compatible walkthrough with smaller datasets, see dowhy_tabpfn_estimator.ipynb.

[16]:
# causal_estimate_tabpfn = model.estimate_effect(identified_estimand,
#         method_name="backdoor.tabpfn",
#         method_params={
#             "n_estimators": 8,
#             "model_type": "auto",
#             "max_num_classes": 10,
#             "use_multi_gpu": False,
#         },
# )
# print(causal_estimate_tabpfn)
# print("Causal Estimate is " + str(causal_estimate_tabpfn.value))
[ ]: