Demo for the DoWhy causal API

We show a simple example of adding a causal extension to any dataframe.

[1]:
import dowhy.datasets
import dowhy.api

import numpy as np
import pandas as pd

from statsmodels.api import OLS
[2]:
data = dowhy.datasets.linear_dataset(beta=5,
        num_common_causes=1,
        num_instruments = 0,
        num_samples=1000,
        treatment_is_binary=True)
df = data['df']
df['y'] = df['y'] + np.random.normal(size=len(df)) # Adding noise to data. Without noise, the variance in Y|X, Z is zero, and mcmc fails.
#data['dot_graph'] = 'digraph { v ->y;X0-> v;X0-> y;}'

treatment= data["treatment_name"][0]
outcome = data["outcome_name"][0]
common_cause = data["common_causes_names"][0]
df
[2]:
W0 v0 y
0 0.326689 False 0.448556
1 -0.906599 True 2.325172
2 0.378685 False -1.157150
3 -0.299435 False 0.630886
4 1.042586 True 2.949594
... ... ... ...
995 0.770774 True 5.430327
996 -0.586816 True 5.585402
997 0.722458 False 1.299977
998 -0.395364 True 4.492349
999 -0.586639 True 2.965871

1000 rows × 3 columns

[3]:
# data['df'] is just a regular pandas.DataFrame
df.causal.do(x=treatment,
                     variable_types={treatment: 'b', outcome: 'c', common_cause: 'c'},
                     outcome=outcome,
                     common_causes=[common_cause],
                     proceed_when_unidentifiable=True).groupby(treatment).mean().plot(y=outcome, kind='bar')
[3]:
<AxesSubplot:xlabel='v0'>
../_images/example_notebooks_dowhy_causal_api_3_1.png
[4]:
df.causal.do(x={treatment: 1},
              variable_types={treatment:'b', outcome: 'c', common_cause: 'c'},
              outcome=outcome,
              method='weighting',
              common_causes=[common_cause],
              proceed_when_unidentifiable=True).groupby(treatment).mean().plot(y=outcome, kind='bar')
[4]:
<AxesSubplot:xlabel='v0'>
../_images/example_notebooks_dowhy_causal_api_4_1.png
[5]:
cdf_1 = df.causal.do(x={treatment: 1},
              variable_types={treatment: 'b', outcome: 'c', common_cause: 'c'},
              outcome=outcome,
              dot_graph=data['dot_graph'],
              proceed_when_unidentifiable=True)

cdf_0 = df.causal.do(x={treatment: 0},
              variable_types={treatment: 'b', outcome: 'c', common_cause: 'c'},
              outcome=outcome,
              dot_graph=data['dot_graph'],
              proceed_when_unidentifiable=True)

[6]:
cdf_0
[6]:
W0 v0 y propensity_score weight
0 -0.114414 False 0.806617 0.521814 1.916392
1 -0.437425 False -1.844901 0.527280 1.896524
2 0.908197 False 1.566832 0.504479 1.982241
3 0.180203 False -1.473251 0.516823 1.934897
4 -0.919758 False -1.201705 0.535431 1.867655
... ... ... ... ... ...
995 -0.924271 False -0.188714 0.535507 1.867390
996 1.820202 False 1.088118 0.489009 2.044950
997 0.039937 False -1.274895 0.519200 1.926041
998 0.035028 False 0.371365 0.519283 1.925732
999 -1.157552 False 0.901272 0.539442 1.853767

1000 rows × 5 columns

[7]:
cdf_1
[7]:
W0 v0 y propensity_score weight
0 2.356333 True 6.847476 0.520077 1.922793
1 1.841864 True 6.748936 0.511358 1.955578
2 -0.758494 True 4.397192 0.467292 2.139988
3 -0.741826 True 5.669855 0.467574 2.138699
4 -2.428560 True 2.266709 0.439216 2.276786
... ... ... ... ... ...
995 -1.083720 True 5.126413 0.461803 2.165426
996 -0.411489 True 3.131912 0.473158 2.113458
997 -2.217379 True 2.438822 0.442748 2.258620
998 -1.247205 True 4.836007 0.459047 2.178427
999 -0.586639 True 2.965871 0.470196 2.126771

1000 rows × 5 columns

Comparing the estimate to Linear Regression

First, estimating the effect using the causal data frame, and the 95% confidence interval.

[8]:
(cdf_1['y'] - cdf_0['y']).mean()
[8]:
$\displaystyle 5.08778423567129$
[9]:
1.96*(cdf_1['y'] - cdf_0['y']).std() / np.sqrt(len(df))
[9]:
$\displaystyle 0.102090330103608$

Comparing to the estimate from OLS.

[10]:
model = OLS(np.asarray(df[outcome]), np.asarray(df[[common_cause, treatment]], dtype=np.float64))
result = model.fit()
result.summary()
[10]:
OLS Regression Results
Dep. Variable: y R-squared (uncentered): 0.934
Model: OLS Adj. R-squared (uncentered): 0.934
Method: Least Squares F-statistic: 7076.
Date: Wed, 14 Sep 2022 Prob (F-statistic): 0.00
Time: 18:55:14 Log-Likelihood: -1395.1
No. Observations: 1000 AIC: 2794.
Df Residuals: 998 BIC: 2804.
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
x1 0.7558 0.031 24.243 0.000 0.695 0.817
x2 5.0264 0.045 111.891 0.000 4.938 5.115
Omnibus: 0.256 Durbin-Watson: 2.046
Prob(Omnibus): 0.880 Jarque-Bera (JB): 0.322
Skew: 0.031 Prob(JB): 0.851
Kurtosis: 2.938 Cond. No. 1.48


Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.