DoWhy: Different estimation methods for causal inference#

This is a quick introduction to the DoWhy causal inference library. We will load in a sample dataset and use different methods for estimating the causal effect of a (pre-specified)treatment variable on a (pre-specified) outcome variable.

We will see that not all estimators return the correct effect for this dataset.

First, let us add the required path for Python to find the DoWhy code and load all required packages

[1]:
%load_ext autoreload
%autoreload 2
[2]:
import numpy as np
import pandas as pd
import logging

import dowhy
dowhy.enable_notebook_rendering()
from dowhy import CausalModel
import dowhy.datasets

Now, let us load a dataset. For simplicity, we simulate a dataset with linear relationships between common causes and treatment, and common causes and outcome.

Beta is the true causal effect.

[3]:
data = dowhy.datasets.linear_dataset(beta=10,
        num_common_causes=5,
        num_instruments = 2,
        num_treatments=1,
        num_samples=10000,
        treatment_is_binary=True,
        outcome_is_binary=False,
        stddev_treatment_noise=10)
df = data["df"]
df
[3]:
Z0 Z1 W0 W1 W2 W3 W4 v0 y
0 0.0 0.836922 -0.635739 -3.107120 0.565293 -0.870615 -0.045307 False -5.671485
1 1.0 0.680337 0.601540 1.893270 -0.084677 0.643179 0.877950 True 17.744425
2 0.0 0.300228 -0.286370 -1.649859 0.538017 -2.244456 1.040170 True 6.983368
3 0.0 0.215431 0.333290 -0.274047 -0.272167 -0.438043 1.368528 True 14.205900
4 0.0 0.847581 0.607704 -0.392435 0.031734 -1.500619 0.494786 True 9.841750
... ... ... ... ... ... ... ... ... ...
9995 0.0 0.177485 -1.776931 -1.320769 -0.568874 0.352540 -0.145678 False -6.902765
9996 0.0 0.808309 -0.234813 -3.226110 -0.799990 -0.742876 2.066283 True 11.335233
9997 0.0 0.543554 -1.103504 0.398692 0.440676 -0.332330 1.428742 False 1.736446
9998 0.0 0.574400 -2.005687 0.652051 -2.688344 -1.427622 0.209042 True -2.002048
9999 1.0 0.101969 -0.079424 0.243855 -0.449377 -1.922466 1.492573 False 0.028715

10000 rows × 9 columns

Note that we are using a pandas dataframe to load the data.

Identifying the causal estimand#

We now input a causal graph in the DOT graph format.

[4]:
# With graph
model=CausalModel(
        data = df,
        treatment=data["treatment_name"],
        outcome=data["outcome_name"],
        graph=data["gml_graph"],
        instruments=data["instrument_names"]
        )
[5]:
model.view_model()
../_images/example_notebooks_dowhy_estimation_methods_9_0.png
[6]:
from IPython.display import Image, display
display(Image(filename="causal_model.png"))
../_images/example_notebooks_dowhy_estimation_methods_10_0.png

We get a causal graph. Now identification and estimation is done.

[7]:
identified_estimand = model.identify_effect(proceed_when_unidentifiable=True)
print(identified_estimand)
Estimand type: EstimandType.NONPARAMETRIC_ATE

### Estimand : 1
Estimand name: backdoor
Estimand expression:
  d
─────(E[y|W2,W4,W0,W3,W1])
d[v₀]
Estimand assumption 1, Unconfoundedness: If U→{v0} and U→y then P(y|v0,W2,W4,W0,W3,W1,U) = P(y|v0,W2,W4,W0,W3,W1)

### Estimand : 2
Estimand name: iv
Estimand expression:
 ⎡                              -1⎤
 ⎢    d        ⎛    d          ⎞  ⎥
E⎢─────────(y)⋅⎜─────────([v₀])⎟  ⎥
 ⎣d[Z₀  Z₁]    ⎝d[Z₀  Z₁]      ⎠  ⎦
Estimand assumption 1, As-if-random: If U→→y then ¬(U →→{Z0,Z1})
Estimand assumption 2, Exclusion: If we remove {Z0,Z1}→{v0}, then ¬({Z0,Z1}→y)

### Estimand : 3
Estimand name: frontdoor
No such variable(s) found!

Method 1: Regression#

Use linear regression.

[8]:
causal_estimate_reg = model.estimate_effect(identified_estimand,
        method_name="backdoor.linear_regression",
        test_significance=True)
print(causal_estimate_reg)
print("Causal Estimate is " + str(causal_estimate_reg.value))
*** Causal Estimate ***

## Identified estimand
Estimand type: EstimandType.NONPARAMETRIC_ATE

### Estimand : 1
Estimand name: backdoor
Estimand expression:
  d
─────(E[y|W2,W4,W0,W3,W1])
d[v₀]
Estimand assumption 1, Unconfoundedness: If U→{v0} and U→y then P(y|v0,W2,W4,W0,W3,W1,U) = P(y|v0,W2,W4,W0,W3,W1)

## Realized estimand
b: y~v0+W2+W4+W0+W3+W1
Target units: ate

## Estimate
Mean value: 9.999936206606924
p-value: [0.]

Causal Estimate is 9.999936206606924

Method 2: Distance Matching#

Define a distance metric and then use the metric to match closest points between treatment and control.

[9]:
causal_estimate_dmatch = model.estimate_effect(identified_estimand,
                                              method_name="backdoor.distance_matching",
                                              target_units="att",
                                              method_params={'distance_metric':"minkowski", 'p':2})
print(causal_estimate_dmatch)
print("Causal Estimate is " + str(causal_estimate_dmatch.value))
*** Causal Estimate ***

## Identified estimand
Estimand type: EstimandType.NONPARAMETRIC_ATE

### Estimand : 1
Estimand name: backdoor
Estimand expression:
  d
─────(E[y|W2,W4,W0,W3,W1])
d[v₀]
Estimand assumption 1, Unconfoundedness: If U→{v0} and U→y then P(y|v0,W2,W4,W0,W3,W1,U) = P(y|v0,W2,W4,W0,W3,W1)

## Realized estimand
b: y~v0+W2+W4+W0+W3+W1
Target units: att

## Estimate
Mean value: 10.455156667560296

Causal Estimate is 10.455156667560296

Method 3: Propensity Score Stratification#

We will be using propensity scores to stratify units in the data.

[10]:
causal_estimate_strat = model.estimate_effect(identified_estimand,
                                              method_name="backdoor.propensity_score_stratification",
                                              target_units="att")
print(causal_estimate_strat)
print("Causal Estimate is " + str(causal_estimate_strat.value))
*** Causal Estimate ***

## Identified estimand
Estimand type: EstimandType.NONPARAMETRIC_ATE

### Estimand : 1
Estimand name: backdoor
Estimand expression:
  d
─────(E[y|W2,W4,W0,W3,W1])
d[v₀]
Estimand assumption 1, Unconfoundedness: If U→{v0} and U→y then P(y|v0,W2,W4,W0,W3,W1,U) = P(y|v0,W2,W4,W0,W3,W1)

## Realized estimand
b: y~v0+W2+W4+W0+W3+W1
Target units: att

## Estimate
Mean value: 9.970363155065543

Causal Estimate is 9.970363155065543

Method 4: Propensity Score Matching#

We will be using propensity scores to match units in the data.

[11]:
causal_estimate_match = model.estimate_effect(identified_estimand,
                                              method_name="backdoor.propensity_score_matching",
                                              target_units="atc")
print(causal_estimate_match)
print("Causal Estimate is " + str(causal_estimate_match.value))
*** Causal Estimate ***

## Identified estimand
Estimand type: EstimandType.NONPARAMETRIC_ATE

### Estimand : 1
Estimand name: backdoor
Estimand expression:
  d
─────(E[y|W2,W4,W0,W3,W1])
d[v₀]
Estimand assumption 1, Unconfoundedness: If U→{v0} and U→y then P(y|v0,W2,W4,W0,W3,W1,U) = P(y|v0,W2,W4,W0,W3,W1)

## Realized estimand
b: y~v0+W2+W4+W0+W3+W1
Target units: atc

## Estimate
Mean value: 10.00883155313543

Causal Estimate is 10.00883155313543

Method 5: Weighting#

We will be using (inverse) propensity scores to assign weights to units in the data. DoWhy supports a few different weighting schemes:

  1. Vanilla Inverse Propensity Score weighting (IPS) (weighting_scheme=”ips_weight”)

  2. Self-normalized IPS weighting (also known as the Hajek estimator) (weighting_scheme=”ips_normalized_weight”)

  3. Stabilized IPS weighting (weighting_scheme = “ips_stabilized_weight”)

[12]:
causal_estimate_ipw = model.estimate_effect(identified_estimand,
                                            method_name="backdoor.propensity_score_weighting",
                                            target_units = "ate",
                                            method_params={"weighting_scheme":"ips_weight"})
print(causal_estimate_ipw)
print("Causal Estimate is " + str(causal_estimate_ipw.value))
*** Causal Estimate ***

## Identified estimand
Estimand type: EstimandType.NONPARAMETRIC_ATE

### Estimand : 1
Estimand name: backdoor
Estimand expression:
  d
─────(E[y|W2,W4,W0,W3,W1])
d[v₀]
Estimand assumption 1, Unconfoundedness: If U→{v0} and U→y then P(y|v0,W2,W4,W0,W3,W1,U) = P(y|v0,W2,W4,W0,W3,W1)

## Realized estimand
b: y~v0+W2+W4+W0+W3+W1
Target units: ate

## Estimate
Mean value: 10.282649493490295

Causal Estimate is 10.282649493490295

Method 6: Instrumental Variable#

We will be using the Wald estimator for the provided instrumental variable.

[13]:
causal_estimate_iv = model.estimate_effect(identified_estimand,
        method_name="iv.instrumental_variable", method_params = {'iv_instrument_name': 'Z0'})
print(causal_estimate_iv)
print("Causal Estimate is " + str(causal_estimate_iv.value))
*** Causal Estimate ***

## Identified estimand
Estimand type: EstimandType.NONPARAMETRIC_ATE

### Estimand : 1
Estimand name: iv
Estimand expression:
 ⎡                              -1⎤
 ⎢    d        ⎛    d          ⎞  ⎥
E⎢─────────(y)⋅⎜─────────([v₀])⎟  ⎥
 ⎣d[Z₀  Z₁]    ⎝d[Z₀  Z₁]      ⎠  ⎦
Estimand assumption 1, As-if-random: If U→→y then ¬(U →→{Z0,Z1})
Estimand assumption 2, Exclusion: If we remove {Z0,Z1}→{v0}, then ¬({Z0,Z1}→y)

## Realized estimand
Realized estimand: Wald Estimator
Realized estimand type: EstimandType.NONPARAMETRIC_ATE
Estimand expression:
 ⎡ d    ⎤
E⎢───(y)⎥
 ⎣dZ₀   ⎦
──────────
 ⎡ d     ⎤
E⎢───(v₀)⎥
 ⎣dZ₀    ⎦
Estimand assumption 1, As-if-random: If U→→y then ¬(U →→{Z0,Z1})
Estimand assumption 2, Exclusion: If we remove {Z0,Z1}→{v0}, then ¬({Z0,Z1}→y)
Estimand assumption 3, treatment_effect_homogeneity: Each unit's treatment ['v0'] is affected in the same way by common causes of ['v0'] and ['y']
Estimand assumption 4, outcome_effect_homogeneity: Each unit's outcome ['y'] is affected in the same way by common causes of ['v0'] and ['y']

Target units: ate

## Estimate
Mean value: 10.918935794115917

Causal Estimate is 10.918935794115917

Method 7: Regression Discontinuity#

We will be internally converting this to an equivalent instrumental variables problem.

[14]:
causal_estimate_regdist = model.estimate_effect(identified_estimand,
        method_name="iv.regression_discontinuity",
        method_params={'rd_variable_name':'Z1',
                       'rd_threshold_value':0.5,
                       'rd_bandwidth': 0.15})
print(causal_estimate_regdist)
print("Causal Estimate is " + str(causal_estimate_regdist.value))
*** Causal Estimate ***

## Identified estimand
Estimand type: EstimandType.NONPARAMETRIC_ATE

### Estimand : 1
Estimand name: iv
Estimand expression:
 ⎡                              -1⎤
 ⎢    d        ⎛    d          ⎞  ⎥
E⎢─────────(y)⋅⎜─────────([v₀])⎟  ⎥
 ⎣d[Z₀  Z₁]    ⎝d[Z₀  Z₁]      ⎠  ⎦
Estimand assumption 1, As-if-random: If U→→y then ¬(U →→{Z0,Z1})
Estimand assumption 2, Exclusion: If we remove {Z0,Z1}→{v0}, then ¬({Z0,Z1}→y)

## Realized estimand
Realized estimand: Wald Estimator
Realized estimand type: EstimandType.NONPARAMETRIC_ATE
Estimand expression:
 ⎡        d            ⎤
E⎢──────────────────(y)⎥
 ⎣dlocal_rd_variable   ⎦
─────────────────────────
 ⎡        d             ⎤
E⎢──────────────────(v₀)⎥
 ⎣dlocal_rd_variable    ⎦
Estimand assumption 1, As-if-random: If U→→y then ¬(U →→{Z0,Z1})
Estimand assumption 2, Exclusion: If we remove {Z0,Z1}→{v0}, then ¬({Z0,Z1}→y)
Estimand assumption 3, treatment_effect_homogeneity: Each unit's treatment ['v0'] is affected in the same way by common causes of ['v0'] and ['y']
Estimand assumption 4, outcome_effect_homogeneity: Each unit's outcome ['y'] is affected in the same way by common causes of ['v0'] and ['y']

Target units: ate

## Estimate
Mean value: 5.7632332979669165

Causal Estimate is 5.7632332979669165

Method 8: Doubly Robust Estimator#

Combines a regression estimator and a propensity score estimator to give back a doubly robust estimate.

[15]:
causal_estimate_doubly_robust = model.estimate_effect(identified_estimand,
        method_name="backdoor.doubly_robust",
        method_params={'propensity_score_column':'propensity_score_dr'}
    )
print(causal_estimate_doubly_robust)
print("Causal Estimate is " + str(causal_estimate_doubly_robust.value))
*** Causal Estimate ***

## Identified estimand
Estimand type: EstimandType.NONPARAMETRIC_ATE

### Estimand : 1
Estimand name: backdoor
Estimand expression:
  d
─────(E[y|W2,W4,W0,W3,W1])
d[v₀]
Estimand assumption 1, Unconfoundedness: If U→{v0} and U→y then P(y|v0,W2,W4,W0,W3,W1,U) = P(y|v0,W2,W4,W0,W3,W1)

## Realized estimand
b: y~v0+W2+W4+W0+W3+W1
Target units: ate

## Estimate
Mean value: 9.99995734126882

Causal Estimate is 9.99995734126882
[ ]: