Demo for the DoWhy causal API

We show a simple example of adding a causal extension to any dataframe.

[1]:
import dowhy.datasets
import dowhy.api

import numpy as np
import pandas as pd

from statsmodels.api import OLS
[2]:
data = dowhy.datasets.linear_dataset(beta=5,
        num_common_causes=1,
        num_instruments = 0,
        num_samples=1000,
        treatment_is_binary=True)
df = data['df']
df['y'] = df['y'] + np.random.normal(size=len(df)) # Adding noise to data. Without noise, the variance in Y|X, Z is zero, and mcmc fails.
#data['dot_graph'] = 'digraph { v ->y;X0-> v;X0-> y;}'

treatment= data["treatment_name"][0]
outcome = data["outcome_name"][0]
common_cause = data["common_causes_names"][0]
df
[2]:
W0 v0 y
0 0.339059 False 1.523297
1 1.009676 True 4.361143
2 0.376290 True 7.623149
3 0.832195 True 6.504141
4 0.585708 True 4.155991
... ... ... ...
995 -0.560865 False -2.036794
996 -0.439966 False -1.173392
997 -1.545073 False -2.369035
998 1.565063 True 5.896693
999 -1.197883 True 2.477574

1000 rows × 3 columns

[3]:
# data['df'] is just a regular pandas.DataFrame
df.causal.do(x=treatment,
                     variable_types={treatment: 'b', outcome: 'c', common_cause: 'c'},
                     outcome=outcome,
                     common_causes=[common_cause],
                     proceed_when_unidentifiable=True).groupby(treatment).mean().plot(y=outcome, kind='bar')
[3]:
<AxesSubplot:xlabel='v0'>
../_images/example_notebooks_dowhy_causal_api_3_1.png
[4]:
df.causal.do(x={treatment: 1},
              variable_types={treatment:'b', outcome: 'c', common_cause: 'c'},
              outcome=outcome,
              method='weighting',
              common_causes=[common_cause],
              proceed_when_unidentifiable=True).groupby(treatment).mean().plot(y=outcome, kind='bar')
[4]:
<AxesSubplot:xlabel='v0'>
../_images/example_notebooks_dowhy_causal_api_4_1.png
[5]:
cdf_1 = df.causal.do(x={treatment: 1},
              variable_types={treatment: 'b', outcome: 'c', common_cause: 'c'},
              outcome=outcome,
              dot_graph=data['dot_graph'],
              proceed_when_unidentifiable=True)

cdf_0 = df.causal.do(x={treatment: 0},
              variable_types={treatment: 'b', outcome: 'c', common_cause: 'c'},
              outcome=outcome,
              dot_graph=data['dot_graph'],
              proceed_when_unidentifiable=True)

[6]:
cdf_0
[6]:
W0 v0 y propensity_score weight
0 -0.370839 False -1.426270 0.668533 1.495812
1 1.956810 False 1.876996 0.017484 57.195309
2 -0.382916 False -1.248201 0.673950 1.483791
3 0.924422 False 1.606464 0.126667 7.894687
4 0.789872 False 1.731581 0.160123 6.245181
... ... ... ... ... ...
995 1.518796 False 2.492069 0.041540 24.073284
996 1.617724 False 3.848876 0.034233 29.211278
997 1.956810 False 1.876996 0.017484 57.195309
998 -0.185325 False -0.376236 0.580432 1.722855
999 0.177932 False 0.086486 0.398028 2.512383

1000 rows × 5 columns

[7]:
cdf_1
[7]:
W0 v0 y propensity_score weight
0 -0.592847 True 3.291838 0.239989 4.166851
1 1.404890 True 4.027509 0.948201 1.054629
2 0.727954 True 5.849403 0.822222 1.216216
3 -0.582116 True 4.759021 0.243990 4.098536
4 -0.085492 True 4.919241 0.469623 2.129369
... ... ... ... ... ...
995 0.064974 True 4.467355 0.545902 1.831831
996 0.821375 True 3.852592 0.848300 1.178828
997 1.791186 True 6.668863 0.975690 1.024916
998 -0.507771 True 3.795654 0.272923 3.664033
999 0.393533 True 5.167014 0.700954 1.426627

1000 rows × 5 columns

Comparing the estimate to Linear Regression

First, estimating the effect using the causal data frame, and the 95% confidence interval.

[8]:
(cdf_1['y'] - cdf_0['y']).mean()
[8]:
$\displaystyle 5.39681422096609$
[9]:
1.96*(cdf_1['y'] - cdf_0['y']).std() / np.sqrt(len(df))
[9]:
$\displaystyle 0.134680784526509$

Comparing to the estimate from OLS.

[10]:
model = OLS(np.asarray(df[outcome]), np.asarray(df[[common_cause, treatment]], dtype=np.float64))
result = model.fit()
result.summary()
[10]:
OLS Regression Results
Dep. Variable: y R-squared (uncentered): 0.966
Model: OLS Adj. R-squared (uncentered): 0.966
Method: Least Squares F-statistic: 1.407e+04
Date: Wed, 14 Sep 2022 Prob (F-statistic): 0.00
Time: 19:25:36 Log-Likelihood: -1439.7
No. Observations: 1000 AIC: 2883.
Df Residuals: 998 BIC: 2893.
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
x1 1.2791 0.039 32.390 0.000 1.202 1.357
x2 5.0288 0.056 89.202 0.000 4.918 5.139
Omnibus: 3.317 Durbin-Watson: 1.975
Prob(Omnibus): 0.190 Jarque-Bera (JB): 3.386
Skew: 0.130 Prob(JB): 0.184
Kurtosis: 2.884 Cond. No. 2.74


Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.