Lalonde Pandas API Example

by Adam Kelleher

We’ll run through a quick example using the high-level Python API for the DoSampler. The DoSampler is different from most classic causal effect estimators. Instead of estimating statistics under interventions, it aims to provide the generality of Pearlian causal inference. In that context, the joint distribution of the variables under an intervention is the quantity of interest. It’s hard to represent a joint distribution nonparametrically, so instead we provide a sample from that distribution, which we call a “do” sample.

Here, when you specify an outcome, that is the variable you’re sampling under an intervention. We still have to do the usual process of making sure the quantity (the conditional interventional distribution of the outcome) is identifiable. We leverage the familiar components of the rest of the package to do that “under the hood”. You’ll notice some similarity in the kwargs for the DoSampler.

Getting the Data

First, download the data from the LaLonde example.

from rpy2.robjects import r as R

%load_ext rpy2.ipython
%R install.packages("Matching")
%R library(Matching)
%R data(lalonde)
%R -o lalonde
R[write to console]: Installing package into ‘/home/amit/R/x86_64-pc-linux-gnu-library/3.2’
(as ‘lib’ is unspecified)

R[write to console]: Error in contrib.url(repos, type) :
  trying to use CRAN without setting a mirror
Calls: <Anonymous> ... withVisible -> install.packages -> grep -> contrib.url

R[write to console]: Loading required package: MASS

R[write to console]: ##
##  Matching (Version 4.9-5, Build Date: 2019-03-05)
##  See for additional documentation.
##  Please cite software as:
##   Jasjeet S. Sekhon. 2011. ``Multivariate and Propensity Score Matching
##   Software with Automated Balance Optimization: The Matching package for R.''
##   Journal of Statistical Software, 42(7): 1-52.

Error in contrib.url(repos, type) :
  trying to use CRAN without setting a mirror
Calls: <Anonymous> ... withVisible -> install.packages -> grep -> contrib.url
# the data already loaded in the previous cell. we include the import
# here you so you don't have to keep re-downloading it.

import pandas as pd


The causal Namespace

We’ve created a “namespace” for pandas.DataFrames containing causal inference methods. You can access it here with lalonde.causal, where lalonde is our pandas.DataFrame, and causal contains all our new methods! These methods are magically loaded into your existing (and future) dataframes when you import dowhy.api.

import dowhy.api
AttributeError                            Traceback (most recent call last)
<ipython-input-3-641fb1855e44> in <module>()
----> 1 import dowhy.api

/mnt/c/Users/amshar/code/dowhy/dowhy/api/ in <module>()
----> 1 import dowhy.api.causal_data_frame

/mnt/c/Users/amshar/code/dowhy/dowhy/api/ in <module>()
----> 7 @pd.api.extensions.register_dataframe_accessor("causal")
      8 class CausalAccessor(object):
      9     def __init__(self, pandas_obj):

AttributeError: module 'pandas.api' has no attribute 'extensions'

Now that we have the causal namespace, lets give it a try!

The do Operation

The key feature here is the do method, which produces a new dataframe replacing the treatment variable with values specified, and the outcome with a sample from the interventional distribution of the outcome. If you don’t specify a value for the treatment, it leaves the treatment untouched:

[ ]:
do_df ='treat',
                          common_causes=['nodegr', 'black', 'hisp', 'age', 'educ', 'married'],
                          variable_types={'age': 'c', 'educ':'c', 'black': 'd', 'hisp': 'd',
                                          'married': 'd', 'nodegr': 'd','re78': 'c', 'treat': 'b'})

Notice you get the usual output and prompts about identifiability. This is all dowhy under the hood!

We now have an interventional sample in do_df. It looks very similar to the original dataframe. Compare them:

[ ]:
[ ]:

Treatment Effect Estimation

We could get a naive estimate before for a treatment effect by doing

[ ]:
(lalonde[lalonde['treat'] == 1].mean() - lalonde[lalonde['treat'] == 0].mean())['re78']

We can do the same with our new sample from the interventional distribution to get a causal effect estimate

[ ]:
(do_df[do_df['treat'] == 1].mean() - do_df[do_df['treat'] == 0].mean())['re78']

We could get some rough error bars on the outcome using the normal approximation for a 95% confidence interval, like

[ ]:
import numpy as np
1.96*np.sqrt((do_df[do_df['treat'] == 1].var()/len(do_df[do_df['treat'] == 1])) +
             (do_df[do_df['treat'] == 0].var()/len(do_df[do_df['treat'] == 0])))['re78']

but note that these DO NOT contain propensity score estimation error. For that, a bootstrapping procedure might be more appropriate.

This is just one statistic we can compute from the interventional distribution of 're78'. We can get all of the interventional moments as well, including functions of 're78'. We can leverage the full power of pandas, like

[ ]:
[ ]:

and even plot aggregations, like

[ ]:
%matplotlib inline
[ ]:
import seaborn as sns

sns.barplot(data=lalonde, x='treat', y='re78')
[ ]:
sns.barplot(data=do_df, x='treat', y='re78')

Specifying Interventions

You can find the distribution of the outcome under an intervention to set the value of the treatment.

[ ]:
do_df ={'treat': 1},
                          common_causes=['nodegr', 'black', 'hisp', 'age', 'educ', 'married'],
                          variable_types={'age': 'c', 'educ':'c', 'black': 'd', 'hisp': 'd',
                                          'married': 'd', 'nodegr': 'd','re78': 'c', 'treat': 'b'})
[ ]:

This new dataframe gives the distribution of 're78' when 'treat' is set to 1.

For much more detail on how the do method works, check the docstring:

[ ]:
[ ]: