Demo for the DoWhy causal API#
We show a simple example of adding a causal extension to any dataframe.
[1]:
import dowhy.datasets
import dowhy.api
from dowhy.graph import build_graph_from_str
import numpy as np
import pandas as pd
from statsmodels.api import OLS
[2]:
data = dowhy.datasets.linear_dataset(beta=5,
num_common_causes=1,
num_instruments = 0,
num_samples=1000,
treatment_is_binary=True)
df = data['df']
df['y'] = df['y'] + np.random.normal(size=len(df)) # Adding noise to data. Without noise, the variance in Y|X, Z is zero, and mcmc fails.
nx_graph = build_graph_from_str(data["dot_graph"])
treatment= data["treatment_name"][0]
outcome = data["outcome_name"][0]
common_cause = data["common_causes_names"][0]
df
[2]:
W0 | v0 | y | |
---|---|---|---|
0 | -0.781891 | False | 0.387390 |
1 | 0.870782 | False | -1.129183 |
2 | -0.093696 | True | 7.159128 |
3 | 0.351093 | True | 5.115276 |
4 | 1.265067 | False | 0.173113 |
... | ... | ... | ... |
995 | -0.154945 | False | -1.722020 |
996 | 1.345329 | False | 0.914354 |
997 | 0.580256 | True | 4.642386 |
998 | 0.952156 | False | 1.785742 |
999 | -0.004864 | False | 0.524789 |
1000 rows × 3 columns
[3]:
# data['df'] is just a regular pandas.DataFrame
df.causal.do(x=treatment,
variable_types={treatment: 'b', outcome: 'c', common_cause: 'c'},
outcome=outcome,
common_causes=[common_cause],
).groupby(treatment).mean().plot(y=outcome, kind='bar')
[3]:
<Axes: xlabel='v0'>

[4]:
df.causal.do(x={treatment: 1},
variable_types={treatment:'b', outcome: 'c', common_cause: 'c'},
outcome=outcome,
method='weighting',
common_causes=[common_cause]
).groupby(treatment).mean().plot(y=outcome, kind='bar')
[4]:
<Axes: xlabel='v0'>

[5]:
cdf_1 = df.causal.do(x={treatment: 1},
variable_types={treatment: 'b', outcome: 'c', common_cause: 'c'},
outcome=outcome,
graph=nx_graph
)
cdf_0 = df.causal.do(x={treatment: 0},
variable_types={treatment: 'b', outcome: 'c', common_cause: 'c'},
outcome=outcome,
graph=nx_graph
)
[6]:
cdf_0
[6]:
W0 | v0 | y | propensity_score | weight | |
---|---|---|---|---|---|
0 | -0.188278 | False | 1.290284 | 0.483839 | 2.066804 |
1 | 0.028908 | False | 1.607867 | 0.472148 | 2.117982 |
2 | 0.556427 | False | -0.796434 | 0.443903 | 2.252744 |
3 | 2.459039 | False | 0.225063 | 0.346192 | 2.888574 |
4 | -0.100968 | False | -0.653425 | 0.479136 | 2.087090 |
... | ... | ... | ... | ... | ... |
995 | 0.910922 | False | 0.123965 | 0.425113 | 2.352315 |
996 | 1.041682 | False | -0.095624 | 0.418233 | 2.391009 |
997 | -0.322524 | False | -0.690124 | 0.491075 | 2.036349 |
998 | 0.303651 | False | 0.244013 | 0.457403 | 2.186254 |
999 | -2.134144 | False | -0.111381 | 0.587866 | 1.701069 |
1000 rows × 5 columns
[7]:
cdf_1
[7]:
W0 | v0 | y | propensity_score | weight | |
---|---|---|---|---|---|
0 | -0.218669 | True | 4.191087 | 0.514524 | 1.943545 |
1 | 2.934446 | True | 6.346822 | 0.676644 | 1.477883 |
2 | -0.634608 | True | 5.340087 | 0.492094 | 2.032133 |
3 | 0.352200 | True | 5.885853 | 0.545195 | 1.834206 |
4 | -0.523283 | True | 3.163376 | 0.498098 | 2.007638 |
... | ... | ... | ... | ... | ... |
995 | 1.667235 | True | 5.699182 | 0.614198 | 1.628140 |
996 | 0.581155 | True | 3.079788 | 0.557414 | 1.794000 |
997 | -0.388253 | True | 4.392206 | 0.505381 | 1.978707 |
998 | 0.984084 | True | 6.050734 | 0.578740 | 1.727892 |
999 | -1.159907 | True | 6.541946 | 0.463823 | 2.155993 |
1000 rows × 5 columns
Comparing the estimate to Linear Regression#
First, estimating the effect using the causal data frame, and the 95% confidence interval.
[8]:
(cdf_1['y'] - cdf_0['y']).mean()
[8]:
$\displaystyle 5.06624641268751$
[9]:
1.96*(cdf_1['y'] - cdf_0['y']).std() / np.sqrt(len(df))
[9]:
$\displaystyle 0.0927052409804712$
Comparing to the estimate from OLS.
[10]:
model = OLS(np.asarray(df[outcome]), np.asarray(df[[common_cause, treatment]], dtype=np.float64))
result = model.fit()
result.summary()
[10]:
Dep. Variable: | y | R-squared (uncentered): | 0.936 |
---|---|---|---|
Model: | OLS | Adj. R-squared (uncentered): | 0.936 |
Method: | Least Squares | F-statistic: | 7359. |
Date: | Tue, 10 Jun 2025 | Prob (F-statistic): | 0.00 |
Time: | 16:55:58 | Log-Likelihood: | -1415.9 |
No. Observations: | 1000 | AIC: | 2836. |
Df Residuals: | 998 | BIC: | 2846. |
Df Model: | 2 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
x1 | 0.3877 | 0.031 | 12.569 | 0.000 | 0.327 | 0.448 |
x2 | 5.0039 | 0.045 | 111.778 | 0.000 | 4.916 | 5.092 |
Omnibus: | 2.260 | Durbin-Watson: | 1.893 |
---|---|---|---|
Prob(Omnibus): | 0.323 | Jarque-Bera (JB): | 2.211 |
Skew: | -0.070 | Prob(JB): | 0.331 |
Kurtosis: | 2.818 | Cond. No. | 1.62 |
Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.