{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Lalonde Pandas API Example\n",
    "by Adam Kelleher"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll run through a quick example using the high-level Python API for the DoSampler. The DoSampler is different from most classic causal effect estimators. Instead of estimating statistics under interventions, it aims to provide the generality of Pearlian causal inference. In that context, the joint distribution of the variables under an intervention is the quantity of interest. It's hard to represent a joint distribution nonparametrically, so instead we provide a sample from that distribution, which we call a \"do\" sample.\n",
    "\n",
    "Here, when you specify an outcome, that is the variable you're sampling under an intervention. We still have to do the usual process of making sure the quantity (the conditional interventional distribution of the outcome) is identifiable. We leverage the familiar components of the rest of the package to do that \"under the hood\". You'll notice some similarity in the kwargs for the DoSampler.\n",
    "\n",
    "## Getting the Data\n",
    "\n",
    "First, download the data from the LaLonde example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os, sys\n",
    "sys.path.append(os.path.abspath(\"../../../\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from rpy2.robjects import r as R\n",
    "\n",
    "%load_ext rpy2.ipython\n",
    "#%R install.packages(\"Matching\")\n",
    "%R library(Matching)\n",
    "%R data(lalonde)\n",
    "%R -o lalonde\n",
    "lalonde.to_csv(\"lalonde.csv\",index=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# the data already loaded in the previous cell. we include the import\n",
    "# here you so you don't have to keep re-downloading it.\n",
    "\n",
    "import pandas as pd\n",
    "\n",
    "lalonde=pd.read_csv(\"lalonde.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The `causal` Namespace"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We've created a \"namespace\" for `pandas.DataFrame`s containing causal inference methods. You can access it here with `lalonde.causal`, where `lalonde` is our `pandas.DataFrame`, and `causal` contains all our new methods! These methods are magically loaded into your existing (and future) dataframes when you `import dowhy.api`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import dowhy.api"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have the `causal` namespace, lets give it a try! \n",
    "\n",
    "## The `do` Operation\n",
    "\n",
    "The key feature here is the `do` method, which produces a new dataframe replacing the treatment variable with values specified, and the outcome with a sample from the interventional distribution of the outcome. If you don't specify a value for the treatment, it leaves the treatment untouched:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "do_df = lalonde.causal.do(x='treat',\n",
    "                          outcome='re78',\n",
    "                          common_causes=['nodegr', 'black', 'hisp', 'age', 'educ', 'married'],\n",
    "                          variable_types={'age': 'c', 'educ':'c', 'black': 'd', 'hisp': 'd', \n",
    "                                          'married': 'd', 'nodegr': 'd','re78': 'c', 'treat': 'b'},\n",
    "                         proceed_when_unidentifiable=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice you get the usual output and prompts about identifiability. This is all `dowhy` under the hood!\n",
    "\n",
    "We now have an interventional sample in `do_df`. It looks very similar to the original dataframe. Compare them:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lalonde.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "do_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Treatment Effect Estimation\n",
    "\n",
    "We could get a naive estimate before for a treatment effect by doing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "(lalonde[lalonde['treat'] == 1].mean() - lalonde[lalonde['treat'] == 0].mean())['re78']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can do the same with our new sample from the interventional distribution to get a causal effect estimate"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "(do_df[do_df['treat'] == 1].mean() - do_df[do_df['treat'] == 0].mean())['re78']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We could get some rough error bars on the outcome using the normal approximation for a 95% confidence interval, like\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "1.96*np.sqrt((do_df[do_df['treat'] == 1].var()/len(do_df[do_df['treat'] == 1])) + \n",
    "             (do_df[do_df['treat'] == 0].var()/len(do_df[do_df['treat'] == 0])))['re78']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "but note that these DO NOT contain propensity score estimation error. For that, a bootstrapping procedure might be more appropriate."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is just one statistic we can compute from the interventional distribution of `'re78'`. We can get all of the interventional moments as well, including functions of `'re78'`. We can leverage the full power of pandas, like"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "do_df['re78'].describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lalonde['re78'].describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "and even plot aggregations, like"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import seaborn as sns\n",
    "\n",
    "sns.barplot(data=lalonde, x='treat', y='re78')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.barplot(data=do_df, x='treat', y='re78')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Specifying Interventions\n",
    "\n",
    "You can find the distribution of the outcome under an intervention to set the value of the treatment. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "do_df = lalonde.causal.do(x={'treat': 1},\n",
    "                          outcome='re78',\n",
    "                          common_causes=['nodegr', 'black', 'hisp', 'age', 'educ', 'married'],\n",
    "                          variable_types={'age': 'c', 'educ':'c', 'black': 'd', 'hisp': 'd', \n",
    "                                          'married': 'd', 'nodegr': 'd','re78': 'c', 'treat': 'b'},\n",
    "                         proceed_when_unidentifiable=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "do_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This new dataframe gives the distribution of `'re78'` when `'treat'` is set to `1`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For much more detail on how the `do` method works, check the docstring:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "help(lalonde.causal.do)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": false,
   "sideBar": true,
   "skip_h1_title": true,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}