Demo for the DoWhy causal API

We show a simple example of adding a causal extension to any dataframe.

[1]:
import os, sys
sys.path.append(os.path.abspath("../../../"))
[2]:
import dowhy.datasets
import dowhy.api

import numpy as np
import pandas as pd

from statsmodels.api import OLS
[3]:
data = dowhy.datasets.linear_dataset(beta=5,
        num_common_causes=1,
        num_instruments = 0,
        num_samples=1000,
        treatment_is_binary=True)
df = data['df']
df['y'] = df['y'] + np.random.normal(size=len(df)) # Adding noise to data. Without noise, the variance in Y|X, Z is zero, and mcmc fails.
#data['dot_graph'] = 'digraph { v ->y;X0-> v;X0-> y;}'

treatment= data["treatment_name"][0]
outcome = data["outcome_name"][0]
common_cause = data["common_causes_names"][0]
df
[3]:
W0 v0 y
0 0.230046 False 0.328777
1 -0.625509 True 3.794958
2 -2.600417 False -5.214096
3 0.146942 False 1.361876
4 -0.919551 False -2.228864
... ... ... ...
995 -0.063647 False 0.117779
996 -1.962547 False -4.034488
997 -0.089728 True 5.780041
998 -0.334302 True 3.872871
999 -2.410448 False -4.524930

1000 rows × 3 columns

[4]:
# data['df'] is just a regular pandas.DataFrame
df.causal.do(x=treatment,
                     variable_types={treatment: 'b', outcome: 'c', common_cause: 'c'},
                     outcome=outcome,
                     common_causes=[common_cause]).groupby(treatment).mean().plot(y=outcome, kind='bar')
WARNING:dowhy.causal_model:Causal Graph not provided. DoWhy will construct a graph based on data inputs.
INFO:dowhy.causal_graph:If this is observed data (not from a randomized experiment), there might always be missing confounders. Adding a node named "Unobserved Confounders" to reflect this.
INFO:dowhy.causal_model:Model to find the causal effect of treatment ['v0'] on outcome ['y']
INFO:dowhy.causal_identifier:Common causes of treatment and outcome:['W0', 'U']
WARNING:dowhy.causal_identifier:If this is observed data (not from a randomized experiment), there might always be missing confounders. Causal effect cannot be identified perfectly.
WARN: Do you want to continue by ignoring any unobserved confounders? (use proceed_when_unidentifiable=True to disable this prompt) [y/n] y
INFO:dowhy.causal_identifier:Instrumental variables for treatment and outcome:[]
INFO:dowhy.do_sampler:Using WeightingSampler for do sampling.
INFO:dowhy.do_sampler:Caution: do samplers assume iid data.
[4]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f0f4a609e80>
../_images/example_notebooks_dowhy_causal_api_4_4.png
[5]:
df.causal.do(x={treatment: 1},
              variable_types={treatment:'b', outcome: 'c', common_cause: 'c'},
              outcome=outcome,
              method='weighting',
              common_causes=[common_cause],
              proceed_when_unidentifiable=True).groupby(treatment).mean().plot(y=outcome, kind='bar')
WARNING:dowhy.causal_model:Causal Graph not provided. DoWhy will construct a graph based on data inputs.
INFO:dowhy.causal_graph:If this is observed data (not from a randomized experiment), there might always be missing confounders. Adding a node named "Unobserved Confounders" to reflect this.
INFO:dowhy.causal_model:Model to find the causal effect of treatment ['v0'] on outcome ['y']
INFO:dowhy.causal_identifier:Common causes of treatment and outcome:['W0', 'U']
WARNING:dowhy.causal_identifier:If this is observed data (not from a randomized experiment), there might always be missing confounders. Causal effect cannot be identified perfectly.
INFO:dowhy.causal_identifier:Continuing by ignoring these unobserved confounders because proceed_when_unidentifiable flag is True.
INFO:dowhy.causal_identifier:Instrumental variables for treatment and outcome:[]
INFO:dowhy.do_sampler:Using WeightingSampler for do sampling.
INFO:dowhy.do_sampler:Caution: do samplers assume iid data.
[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f0f48316358>
../_images/example_notebooks_dowhy_causal_api_5_2.png
[6]:
cdf_1 = df.causal.do(x={treatment: 1},
              variable_types={treatment: 'b', outcome: 'c', common_cause: 'c'},
              outcome=outcome,
              dot_graph=data['dot_graph'],
              proceed_when_unidentifiable=True)

cdf_0 = df.causal.do(x={treatment: 0},
              variable_types={treatment: 'b', outcome: 'c', common_cause: 'c'},
              outcome=outcome,
              dot_graph=data['dot_graph'],
              proceed_when_unidentifiable=True)

INFO:dowhy.causal_model:Model to find the causal effect of treatment ['v0'] on outcome ['y']
INFO:dowhy.causal_identifier:Common causes of treatment and outcome:['W0', 'U']
WARNING:dowhy.causal_identifier:If this is observed data (not from a randomized experiment), there might always be missing confounders. Causal effect cannot be identified perfectly.
INFO:dowhy.causal_identifier:Continuing by ignoring these unobserved confounders because proceed_when_unidentifiable flag is True.
INFO:dowhy.causal_identifier:Instrumental variables for treatment and outcome:[]
INFO:dowhy.do_sampler:Using WeightingSampler for do sampling.
INFO:dowhy.do_sampler:Caution: do samplers assume iid data.
INFO:dowhy.causal_model:Model to find the causal effect of treatment ['v0'] on outcome ['y']
INFO:dowhy.causal_identifier:Common causes of treatment and outcome:['W0', 'U']
WARNING:dowhy.causal_identifier:If this is observed data (not from a randomized experiment), there might always be missing confounders. Causal effect cannot be identified perfectly.
INFO:dowhy.causal_identifier:Continuing by ignoring these unobserved confounders because proceed_when_unidentifiable flag is True.
INFO:dowhy.causal_identifier:Instrumental variables for treatment and outcome:[]
INFO:dowhy.do_sampler:Using WeightingSampler for do sampling.
INFO:dowhy.do_sampler:Caution: do samplers assume iid data.
[7]:
cdf_0
[7]:
W0 v0 y propensity_score weight
0 -0.583286 False -2.456860 0.574597 1.740350
1 -1.853274 False -3.699267 0.754520 1.325346
2 -1.530711 False -4.384006 0.713824 1.400906
3 -0.036469 False -2.269524 0.486654 2.054847
4 -1.749115 False -3.311137 0.741816 1.348043
... ... ... ... ... ...
995 -1.227516 False -0.875867 0.672107 1.487859
996 -0.405553 False -1.427872 0.546258 1.830637
997 0.230046 False 0.328777 0.443752 2.253509
998 -2.157752 False -4.830362 0.789181 1.267136
999 -0.453149 False -2.167928 0.553884 1.805431

1000 rows × 5 columns

[8]:
cdf_1
[8]:
W0 v0 y propensity_score weight
0 -0.229733 True 4.754793 0.482075 2.074366
1 -0.238390 True 4.487643 0.480676 2.080404
2 -2.161507 True -1.700668 0.210415 4.752520
3 -1.639879 True 2.173829 0.271959 3.677028
4 0.030840 True 4.465655 0.524224 1.907580
... ... ... ... ... ...
995 -1.688769 True 2.404017 0.265737 3.763121
996 -0.910527 True 2.528176 0.374608 2.669460
997 -2.838431 True -0.226373 0.146704 6.816457
998 0.429132 True 6.865883 0.587791 1.701285
999 -1.509348 True 2.153207 0.289010 3.460088

1000 rows × 5 columns

Comparing the estimate to Linear Regression

First, estimating the effect using the causal data frame, and the 95% confidence interval.

[9]:
(cdf_1['y'] - cdf_0['y']).mean()
[9]:
$\displaystyle 5.006194579687894$
[10]:
1.96*(cdf_1['y'] - cdf_0['y']).std() / np.sqrt(len(df))
[10]:
$\displaystyle 0.2245958282589085$

Comparing to the estimate from OLS.

[11]:
model = OLS(np.asarray(df[outcome]), np.asarray(df[[common_cause, treatment]], dtype=np.float64))
result = model.fit()
result.summary()
[11]:
OLS Regression Results
Dep. Variable: y R-squared (uncentered): 0.938
Model: OLS Adj. R-squared (uncentered): 0.938
Method: Least Squares F-statistic: 7544.
Date: Tue, 07 Jan 2020 Prob (F-statistic): 0.00
Time: 11:54:35 Log-Likelihood: -1408.7
No. Observations: 1000 AIC: 2821.
Df Residuals: 998 BIC: 2831.
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
x1 2.2679 0.025 89.049 0.000 2.218 2.318
x2 5.0091 0.050 100.297 0.000 4.911 5.107
Omnibus: 2.664 Durbin-Watson: 2.034
Prob(Omnibus): 0.264 Jarque-Bera (JB): 2.431
Skew: -0.049 Prob(JB): 0.297
Kurtosis: 2.779 Cond. No. 2.03


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.