Demo for the DoWhy causal API

We show a simple example of adding a causal extension to any dataframe.

[1]:
import os, sys
sys.path.append(os.path.abspath("../../../"))
[2]:
import dowhy.datasets
import dowhy.api

import numpy as np
import pandas as pd

from statsmodels.api import OLS
[3]:
data = dowhy.datasets.linear_dataset(beta=5,
        num_common_causes=1,
        num_instruments = 0,
        num_samples=1000,
        treatment_is_binary=True)
df = data['df']
df['y'] = df['y'] + np.random.normal(size=len(df)) # Adding noise to data. Without noise, the variance in Y|X, Z is zero, and mcmc fails.
#data['dot_graph'] = 'digraph { v ->y;X0-> v;X0-> y;}'

treatment= data["treatment_name"][0]
outcome = data["outcome_name"][0]
common_cause = data["common_causes_names"][0]
df
[3]:
W0 v0 y
0 -1.289334 False -3.324194
1 -1.100803 False -2.035903
2 -3.564829 False -5.986921
3 -2.036167 False -2.869478
4 -2.781469 False -6.127631
... ... ... ...
995 -1.178597 True 2.965997
996 0.223498 True 4.161892
997 -2.582064 False -3.472824
998 -1.998788 False -0.725005
999 -0.789288 False -2.080137

1000 rows × 3 columns

[4]:
# data['df'] is just a regular pandas.DataFrame
df.causal.do(x=treatment,
                     variable_types={treatment: 'b', outcome: 'c', common_cause: 'c'},
                     outcome=outcome,
                     common_causes=[common_cause],
                     proceed_when_unidentifiable=True).groupby(treatment).mean().plot(y=outcome, kind='bar')
WARNING:dowhy.causal_model:Causal Graph not provided. DoWhy will construct a graph based on data inputs.
INFO:dowhy.causal_graph:If this is observed data (not from a randomized experiment), there might always be missing confounders. Adding a node named "Unobserved Confounders" to reflect this.
INFO:dowhy.causal_model:Model to find the causal effect of treatment ['v0'] on outcome ['y']
WARNING:dowhy.causal_identifier:If this is observed data (not from a randomized experiment), there might always be missing confounders. Causal effect cannot be identified perfectly.
INFO:dowhy.causal_identifier:Continuing by ignoring these unobserved confounders because proceed_when_unidentifiable flag is True.
INFO:dowhy.causal_identifier:Instrumental variables for treatment and outcome:[]
INFO:dowhy.causal_identifier:Frontdoor variables for treatment and outcome:[]
INFO:dowhy.do_sampler:Using WeightingSampler for do sampling.
INFO:dowhy.do_sampler:Caution: do samplers assume iid data.
[4]:
<AxesSubplot:xlabel='v0'>
../_images/example_notebooks_dowhy_causal_api_4_2.png
[5]:
df.causal.do(x={treatment: 1},
              variable_types={treatment:'b', outcome: 'c', common_cause: 'c'},
              outcome=outcome,
              method='weighting',
              common_causes=[common_cause],
              proceed_when_unidentifiable=True).groupby(treatment).mean().plot(y=outcome, kind='bar')
WARNING:dowhy.causal_model:Causal Graph not provided. DoWhy will construct a graph based on data inputs.
INFO:dowhy.causal_graph:If this is observed data (not from a randomized experiment), there might always be missing confounders. Adding a node named "Unobserved Confounders" to reflect this.
INFO:dowhy.causal_model:Model to find the causal effect of treatment ['v0'] on outcome ['y']
WARNING:dowhy.causal_identifier:If this is observed data (not from a randomized experiment), there might always be missing confounders. Causal effect cannot be identified perfectly.
INFO:dowhy.causal_identifier:Continuing by ignoring these unobserved confounders because proceed_when_unidentifiable flag is True.
INFO:dowhy.causal_identifier:Instrumental variables for treatment and outcome:[]
INFO:dowhy.causal_identifier:Frontdoor variables for treatment and outcome:[]
INFO:dowhy.do_sampler:Using WeightingSampler for do sampling.
INFO:dowhy.do_sampler:Caution: do samplers assume iid data.
[5]:
<AxesSubplot:xlabel='v0'>
../_images/example_notebooks_dowhy_causal_api_5_2.png
[6]:
cdf_1 = df.causal.do(x={treatment: 1},
              variable_types={treatment: 'b', outcome: 'c', common_cause: 'c'},
              outcome=outcome,
              dot_graph=data['dot_graph'],
              proceed_when_unidentifiable=True)

cdf_0 = df.causal.do(x={treatment: 0},
              variable_types={treatment: 'b', outcome: 'c', common_cause: 'c'},
              outcome=outcome,
              dot_graph=data['dot_graph'],
              proceed_when_unidentifiable=True)

INFO:dowhy.causal_model:Model to find the causal effect of treatment ['v0'] on outcome ['y']
WARNING:dowhy.causal_identifier:If this is observed data (not from a randomized experiment), there might always be missing confounders. Causal effect cannot be identified perfectly.
INFO:dowhy.causal_identifier:Continuing by ignoring these unobserved confounders because proceed_when_unidentifiable flag is True.
INFO:dowhy.causal_identifier:Instrumental variables for treatment and outcome:[]
INFO:dowhy.causal_identifier:Frontdoor variables for treatment and outcome:[]
INFO:dowhy.do_sampler:Using WeightingSampler for do sampling.
INFO:dowhy.do_sampler:Caution: do samplers assume iid data.
INFO:dowhy.causal_model:Model to find the causal effect of treatment ['v0'] on outcome ['y']
WARNING:dowhy.causal_identifier:If this is observed data (not from a randomized experiment), there might always be missing confounders. Causal effect cannot be identified perfectly.
INFO:dowhy.causal_identifier:Continuing by ignoring these unobserved confounders because proceed_when_unidentifiable flag is True.
INFO:dowhy.causal_identifier:Instrumental variables for treatment and outcome:[]
INFO:dowhy.causal_identifier:Frontdoor variables for treatment and outcome:[]
INFO:dowhy.do_sampler:Using WeightingSampler for do sampling.
INFO:dowhy.do_sampler:Caution: do samplers assume iid data.
[7]:
cdf_0
[7]:
W0 v0 y propensity_score weight
0 -0.782669 False -0.792179 0.825287 1.211700
1 -1.386957 False -2.541074 0.943911 1.059422
2 -1.704419 False -3.922381 0.970418 1.030484
3 -0.814005 False -0.313702 0.834584 1.198202
4 0.029784 False -0.307668 0.461182 2.168343
... ... ... ... ... ...
995 -0.587235 False -0.086199 0.757990 1.319279
996 -0.518178 False -0.673428 0.730369 1.369171
997 -0.285695 False -2.356726 0.624268 1.601876
998 -1.146594 False -4.006878 0.910335 1.098497
999 -1.636781 False -4.698411 0.966051 1.035142

1000 rows × 5 columns

[8]:
cdf_1
[8]:
W0 v0 y propensity_score weight
0 -2.094548 True 1.099802 0.013245 75.499593
1 -1.584775 True 1.996253 0.037724 26.508238
2 -2.287651 True -0.454571 0.008865 112.808679
3 -2.287651 True -0.454571 0.008865 112.808679
4 -0.427584 True 2.698249 0.308738 3.238989
... ... ... ... ... ...
995 1.803262 True 8.249998 0.979852 1.020562
996 -0.335329 True 4.826016 0.351590 2.844225
997 0.656759 True 8.514054 0.813633 1.229055
998 0.946281 True 5.565004 0.889192 1.124617
999 0.192547 True 3.447829 0.621941 1.607870

1000 rows × 5 columns

Comparing the estimate to Linear Regression

First, estimating the effect using the causal data frame, and the 95% confidence interval.

[9]:
(cdf_1['y'] - cdf_0['y']).mean()
[9]:
$\displaystyle 4.869733164425347$
[10]:
1.96*(cdf_1['y'] - cdf_0['y']).std() / np.sqrt(len(df))
[10]:
$\displaystyle 0.18841744756332976$

Comparing to the estimate from OLS.

[11]:
model = OLS(np.asarray(df[outcome]), np.asarray(df[[common_cause, treatment]], dtype=np.float64))
result = model.fit()
result.summary()
[11]:
OLS Regression Results
Dep. Variable: y R-squared (uncentered): 0.926
Model: OLS Adj. R-squared (uncentered): 0.926
Method: Least Squares F-statistic: 6276.
Date: Sat, 05 Dec 2020 Prob (F-statistic): 0.00
Time: 18:02:18 Log-Likelihood: -1436.6
No. Observations: 1000 AIC: 2877.
Df Residuals: 998 BIC: 2887.
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
x1 1.8467 0.022 82.505 0.000 1.803 1.891
x2 5.0277 0.067 74.507 0.000 4.895 5.160
Omnibus: 0.268 Durbin-Watson: 1.984
Prob(Omnibus): 0.874 Jarque-Bera (JB): 0.291
Skew: 0.040 Prob(JB): 0.864
Kurtosis: 2.973 Cond. No. 3.02


Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.