Demo for the DoWhy causal API

We show a simple example of adding a causal extension to any dataframe.

[1]:
import dowhy.datasets
import dowhy.api

import numpy as np
import pandas as pd

from statsmodels.api import OLS
[2]:
data = dowhy.datasets.linear_dataset(beta=5,
        num_common_causes=1,
        num_instruments = 0,
        num_samples=1000,
        treatment_is_binary=True)
df = data['df']
df['y'] = df['y'] + np.random.normal(size=len(df)) # Adding noise to data. Without noise, the variance in Y|X, Z is zero, and mcmc fails.
#data['dot_graph'] = 'digraph { v ->y;X0-> v;X0-> y;}'

treatment= data["treatment_name"][0]
outcome = data["outcome_name"][0]
common_cause = data["common_causes_names"][0]
df
[2]:
W0 v0 y
0 0.615341 False -0.216582
1 1.547448 True 3.791882
2 -1.220144 True 4.319986
3 -0.167506 True 6.049082
4 2.950866 True 4.455041
... ... ... ...
995 0.978907 True 6.964400
996 2.768105 True 4.851054
997 0.657292 True 7.053682
998 0.462598 False 0.075282
999 0.544589 True 3.206106

1000 rows × 3 columns

[3]:
# data['df'] is just a regular pandas.DataFrame
df.causal.do(x=treatment,
                     variable_types={treatment: 'b', outcome: 'c', common_cause: 'c'},
                     outcome=outcome,
                     common_causes=[common_cause],
                     proceed_when_unidentifiable=True).groupby(treatment).mean().plot(y=outcome, kind='bar')
[3]:
<AxesSubplot: xlabel='v0'>
../_images/example_notebooks_dowhy_causal_api_3_1.png
[4]:
df.causal.do(x={treatment: 1},
              variable_types={treatment:'b', outcome: 'c', common_cause: 'c'},
              outcome=outcome,
              method='weighting',
              common_causes=[common_cause],
              proceed_when_unidentifiable=True).groupby(treatment).mean().plot(y=outcome, kind='bar')
[4]:
<AxesSubplot: xlabel='v0'>
../_images/example_notebooks_dowhy_causal_api_4_1.png
[5]:
cdf_1 = df.causal.do(x={treatment: 1},
              variable_types={treatment: 'b', outcome: 'c', common_cause: 'c'},
              outcome=outcome,
              dot_graph=data['dot_graph'],
              proceed_when_unidentifiable=True)

cdf_0 = df.causal.do(x={treatment: 0},
              variable_types={treatment: 'b', outcome: 'c', common_cause: 'c'},
              outcome=outcome,
              dot_graph=data['dot_graph'],
              proceed_when_unidentifiable=True)

[6]:
cdf_0
[6]:
W0 v0 y propensity_score weight
0 0.067560 False 1.911379 0.466482 2.143707
1 0.823216 False -1.139482 0.298868 3.345962
2 1.662468 False 0.487500 0.161029 6.210075
3 1.219407 False -0.734138 0.226293 4.419058
4 3.170608 False 0.352740 0.043754 22.854844
... ... ... ... ... ...
995 2.290390 False 0.137876 0.095559 10.464770
996 1.037646 False 0.709049 0.257967 3.876460
997 -0.134015 False 1.177118 0.514338 1.944246
998 0.438583 False -2.028024 0.380597 2.627451
999 -0.018765 False -1.847223 0.486952 2.053589

1000 rows × 5 columns

[7]:
cdf_1
[7]:
W0 v0 y propensity_score weight
0 0.151752 True 5.144520 0.553375 1.807092
1 -0.361275 True 5.457729 0.432065 2.314466
2 -0.516995 True 4.685548 0.396163 2.524211
3 -0.151791 True 7.376667 0.481441 2.077096
4 0.533555 True 4.762663 0.640448 1.561408
... ... ... ... ... ...
995 2.328699 True 5.015506 0.907543 1.101876
996 0.922553 True 6.789926 0.720539 1.387849
997 0.890339 True 5.813846 0.714331 1.399912
998 -0.380071 True 4.574668 0.427686 2.338166
999 1.219794 True 3.804645 0.773772 1.292371

1000 rows × 5 columns

Comparing the estimate to Linear Regression

First, estimating the effect using the causal data frame, and the 95% confidence interval.

[8]:
(cdf_1['y'] - cdf_0['y']).mean()
[8]:
$\displaystyle 5.18094679145088$
[9]:
1.96*(cdf_1['y'] - cdf_0['y']).std() / np.sqrt(len(df))
[9]:
$\displaystyle 0.0932881847772126$

Comparing to the estimate from OLS.

[10]:
model = OLS(np.asarray(df[outcome]), np.asarray(df[[common_cause, treatment]], dtype=np.float64))
result = model.fit()
result.summary()
[10]:
OLS Regression Results
Dep. Variable: y R-squared (uncentered): 0.945
Model: OLS Adj. R-squared (uncentered): 0.945
Method: Least Squares F-statistic: 8643.
Date: Tue, 06 Dec 2022 Prob (F-statistic): 0.00
Time: 09:38:55 Log-Likelihood: -1450.1
No. Observations: 1000 AIC: 2904.
Df Residuals: 998 BIC: 2914.
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
x1 0.2364 0.035 6.773 0.000 0.168 0.305
x2 5.0849 0.053 95.597 0.000 4.981 5.189
Omnibus: 0.226 Durbin-Watson: 1.922
Prob(Omnibus): 0.893 Jarque-Bera (JB): 0.264
Skew: 0.035 Prob(JB): 0.876
Kurtosis: 2.962 Cond. No. 2.46


Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.