{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# DoWhy: Different estimation methods for causal inference\n", "This is a quick introduction to the DoWhy causal inference library.\n", "We will load in a sample dataset and use different methods for estimating the causal effect of a (pre-specified)treatment variable on a (pre-specified) outcome variable.\n", "\n", "We will see that not all estimators return the correct effect for this dataset.\n", "\n", "First, let us add the required path for Python to find the DoWhy code and load all required packages" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import logging\n", "\n", "import dowhy\n", "from dowhy import CausalModel\n", "import dowhy.datasets " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let us load a dataset. For simplicity, we simulate a dataset with linear relationships between common causes and treatment, and common causes and outcome. \n", "\n", "Beta is the true causal effect. " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Z0Z1W0W1W2W3W4v0y
00.00.521013-1.3253912.116213-0.557359-1.8654441.605934False10.352212
10.00.4637691.0698121.726722-0.6101150.3508670.882315True19.499769
20.00.125765-1.9459241.484627-1.5750320.5200690.559467False2.205429
30.00.861999-0.2213280.9393320.0653491.2911410.056167True12.968877
40.00.893044-0.5335832.024348-0.227786-2.9621550.057679True14.847690
..............................
99950.00.019696-0.5182802.516658-3.063094-1.0350601.231945False7.816809
99960.00.692714-0.5108230.612004-0.758462-1.7866691.217035True15.047879
99970.00.1649930.1906240.365532-0.3127070.7948032.458445True21.472658
99980.00.382386-2.6541951.469548-2.191149-0.6814291.197488False2.936543
99990.00.713687-0.7990160.810676-0.969194-0.4670671.327113True15.621830
\n", "

10000 rows × 9 columns

\n", "
" ], "text/plain": [ " Z0 Z1 W0 W1 W2 W3 W4 v0 \\\n", "0 0.0 0.521013 -1.325391 2.116213 -0.557359 -1.865444 1.605934 False \n", "1 0.0 0.463769 1.069812 1.726722 -0.610115 0.350867 0.882315 True \n", "2 0.0 0.125765 -1.945924 1.484627 -1.575032 0.520069 0.559467 False \n", "3 0.0 0.861999 -0.221328 0.939332 0.065349 1.291141 0.056167 True \n", "4 0.0 0.893044 -0.533583 2.024348 -0.227786 -2.962155 0.057679 True \n", "... ... ... ... ... ... ... ... ... \n", "9995 0.0 0.019696 -0.518280 2.516658 -3.063094 -1.035060 1.231945 False \n", "9996 0.0 0.692714 -0.510823 0.612004 -0.758462 -1.786669 1.217035 True \n", "9997 0.0 0.164993 0.190624 0.365532 -0.312707 0.794803 2.458445 True \n", "9998 0.0 0.382386 -2.654195 1.469548 -2.191149 -0.681429 1.197488 False \n", "9999 0.0 0.713687 -0.799016 0.810676 -0.969194 -0.467067 1.327113 True \n", "\n", " y \n", "0 10.352212 \n", "1 19.499769 \n", "2 2.205429 \n", "3 12.968877 \n", "4 14.847690 \n", "... ... \n", "9995 7.816809 \n", "9996 15.047879 \n", "9997 21.472658 \n", "9998 2.936543 \n", "9999 15.621830 \n", "\n", "[10000 rows x 9 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = dowhy.datasets.linear_dataset(beta=10,\n", " num_common_causes=5, \n", " num_instruments = 2,\n", " num_treatments=1,\n", " num_samples=10000,\n", " treatment_is_binary=True,\n", " outcome_is_binary=False)\n", "df = data[\"df\"]\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that we are using a pandas dataframe to load the data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Identifying the causal estimand" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now input a causal graph in the DOT graph format." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# With graph\n", "model=CausalModel(\n", " data = df,\n", " treatment=data[\"treatment_name\"],\n", " outcome=data[\"outcome_name\"],\n", " graph=data[\"gml_graph\"],\n", " instruments=data[\"instrument_names\"]\n", " )" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "model.view_model()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from IPython.display import Image, display\n", "display(Image(filename=\"causal_model.png\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We get a causal graph. Now identification and estimation is done. " ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Estimand type: nonparametric-ate\n", "\n", "### Estimand : 1\n", "Estimand name: backdoor\n", "Estimand expression:\n", " d \n", "─────(Expectation(y|W0,W1,W3,W4,W2))\n", "d[v₀] \n", "Estimand assumption 1, Unconfoundedness: If U→{v0} and U→y then P(y|v0,W0,W1,W3,W4,W2,U) = P(y|v0,W0,W1,W3,W4,W2)\n", "\n", "### Estimand : 2\n", "Estimand name: iv\n", "Estimand expression:\n", "Expectation(Derivative(y, [Z0, Z1])*Derivative([v0], [Z0, Z1])**(-1))\n", "Estimand assumption 1, As-if-random: If U→→y then ¬(U →→{Z0,Z1})\n", "Estimand assumption 2, Exclusion: If we remove {Z0,Z1}→{v0}, then ¬({Z0,Z1}→y)\n", "\n", "### Estimand : 3\n", "Estimand name: frontdoor\n", "No such variable found!\n", "\n" ] } ], "source": [ "identified_estimand = model.identify_effect(proceed_when_unidentifiable=True)\n", "print(identified_estimand)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Method 1: Regression\n", "\n", "Use linear regression." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "*** Causal Estimate ***\n", "\n", "## Identified estimand\n", "Estimand type: nonparametric-ate\n", "\n", "### Estimand : 1\n", "Estimand name: backdoor\n", "Estimand expression:\n", " d \n", "─────(Expectation(y|W0,W1,W3,W4,W2))\n", "d[v₀] \n", "Estimand assumption 1, Unconfoundedness: If U→{v0} and U→y then P(y|v0,W0,W1,W3,W4,W2,U) = P(y|v0,W0,W1,W3,W4,W2)\n", "\n", "## Realized estimand\n", "b: y~v0+W0+W1+W3+W4+W2\n", "Target units: ate\n", "\n", "## Estimate\n", "Mean value: 10.00020571565491\n", "p-value: [0.]\n", "\n", "Causal Estimate is 10.00020571565491\n" ] } ], "source": [ "causal_estimate_reg = model.estimate_effect(identified_estimand,\n", " method_name=\"backdoor.linear_regression\",\n", " test_significance=True)\n", "print(causal_estimate_reg)\n", "print(\"Causal Estimate is \" + str(causal_estimate_reg.value))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Method 2: Stratification\n", "\n", "We will be using propensity scores to stratify units in the data." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "*** Causal Estimate ***\n", "\n", "## Identified estimand\n", "Estimand type: nonparametric-ate\n", "\n", "### Estimand : 1\n", "Estimand name: backdoor\n", "Estimand expression:\n", " d \n", "─────(Expectation(y|W0,W1,W3,W4,W2))\n", "d[v₀] \n", "Estimand assumption 1, Unconfoundedness: If U→{v0} and U→y then P(y|v0,W0,W1,W3,W4,W2,U) = P(y|v0,W0,W1,W3,W4,W2)\n", "\n", "## Realized estimand\n", "b: y~v0+W0+W1+W3+W4+W2\n", "Target units: att\n", "\n", "## Estimate\n", "Mean value: 10.019860140019505\n", "\n", "Causal Estimate is 10.019860140019505\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/amit/py-envs/env3.8/lib/python3.8/site-packages/sklearn/utils/validation.py:72: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " return f(**kwargs)\n" ] } ], "source": [ "causal_estimate_strat = model.estimate_effect(identified_estimand,\n", " method_name=\"backdoor.propensity_score_stratification\",\n", " target_units=\"att\")\n", "print(causal_estimate_strat)\n", "print(\"Causal Estimate is \" + str(causal_estimate_strat.value))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Method 3: Matching\n", "\n", "We will be using propensity scores to match units in the data." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/amit/py-envs/env3.8/lib/python3.8/site-packages/sklearn/utils/validation.py:72: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " return f(**kwargs)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "*** Causal Estimate ***\n", "\n", "## Identified estimand\n", "Estimand type: nonparametric-ate\n", "\n", "### Estimand : 1\n", "Estimand name: backdoor\n", "Estimand expression:\n", " d \n", "─────(Expectation(y|W0,W1,W3,W4,W2))\n", "d[v₀] \n", "Estimand assumption 1, Unconfoundedness: If U→{v0} and U→y then P(y|v0,W0,W1,W3,W4,W2,U) = P(y|v0,W0,W1,W3,W4,W2)\n", "\n", "## Realized estimand\n", "b: y~v0+W0+W1+W3+W4+W2\n", "Target units: atc\n", "\n", "## Estimate\n", "Mean value: 9.752837406069007\n", "\n", "Causal Estimate is 9.752837406069007\n" ] } ], "source": [ "causal_estimate_match = model.estimate_effect(identified_estimand,\n", " method_name=\"backdoor.propensity_score_matching\",\n", " target_units=\"atc\")\n", "print(causal_estimate_match)\n", "print(\"Causal Estimate is \" + str(causal_estimate_match.value))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Method 4: Weighting\n", "\n", "We will be using (inverse) propensity scores to assign weights to units in the data. DoWhy supports a few different weighting schemes:\n", "1. Vanilla Inverse Propensity Score weighting (IPS) (weighting_scheme=\"ips_weight\")\n", "2. Self-normalized IPS weighting (also known as the Hajek estimator) (weighting_scheme=\"ips_normalized_weight\")\n", "3. Stabilized IPS weighting (weighting_scheme = \"ips_stabilized_weight\")" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "*** Causal Estimate ***\n", "\n", "## Identified estimand\n", "Estimand type: nonparametric-ate\n", "\n", "### Estimand : 1\n", "Estimand name: backdoor\n", "Estimand expression:\n", " d \n", "─────(Expectation(y|W0,W1,W3,W4,W2))\n", "d[v₀] \n", "Estimand assumption 1, Unconfoundedness: If U→{v0} and U→y then P(y|v0,W0,W1,W3,W4,W2,U) = P(y|v0,W0,W1,W3,W4,W2)\n", "\n", "## Realized estimand\n", "b: y~v0+W0+W1+W3+W4+W2\n", "Target units: ate\n", "\n", "## Estimate\n", "Mean value: 12.690127182579836\n", "\n", "Causal Estimate is 12.690127182579836\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/amit/py-envs/env3.8/lib/python3.8/site-packages/sklearn/utils/validation.py:72: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " return f(**kwargs)\n" ] } ], "source": [ "causal_estimate_ipw = model.estimate_effect(identified_estimand,\n", " method_name=\"backdoor.propensity_score_weighting\",\n", " target_units = \"ate\",\n", " method_params={\"weighting_scheme\":\"ips_weight\"})\n", "print(causal_estimate_ipw)\n", "print(\"Causal Estimate is \" + str(causal_estimate_ipw.value))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Method 5: Instrumental Variable\n", "\n", "We will be using the Wald estimator for the provided instrumental variable." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "*** Causal Estimate ***\n", "\n", "## Identified estimand\n", "Estimand type: nonparametric-ate\n", "\n", "### Estimand : 1\n", "Estimand name: iv\n", "Estimand expression:\n", "Expectation(Derivative(y, [Z0, Z1])*Derivative([v0], [Z0, Z1])**(-1))\n", "Estimand assumption 1, As-if-random: If U→→y then ¬(U →→{Z0,Z1})\n", "Estimand assumption 2, Exclusion: If we remove {Z0,Z1}→{v0}, then ¬({Z0,Z1}→y)\n", "\n", "## Realized estimand\n", "Realized estimand: Wald Estimator\n", "Realized estimand type: nonparametric-ate\n", "Estimand expression:\n", " -1\n", "Expectation(Derivative(y, Z0))⋅Expectation(Derivative(v0, Z0)) \n", "Estimand assumption 1, As-if-random: If U→→y then ¬(U →→{Z0,Z1})\n", "Estimand assumption 2, Exclusion: If we remove {Z0,Z1}→{v0}, then ¬({Z0,Z1}→y)\n", "Estimand assumption 3, treatment_effect_homogeneity: Each unit's treatment ['v0'] is affected in the same way by common causes of ['v0'] and y\n", "Estimand assumption 4, outcome_effect_homogeneity: Each unit's outcome y is affected in the same way by common causes of ['v0'] and y\n", "\n", "Target units: ate\n", "\n", "## Estimate\n", "Mean value: 11.085627630661348\n", "\n", "Causal Estimate is 11.085627630661348\n" ] } ], "source": [ "causal_estimate_iv = model.estimate_effect(identified_estimand,\n", " method_name=\"iv.instrumental_variable\", method_params = {'iv_instrument_name': 'Z0'})\n", "print(causal_estimate_iv)\n", "print(\"Causal Estimate is \" + str(causal_estimate_iv.value))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Method 6: Regression Discontinuity\n", "\n", "We will be internally converting this to an equivalent instrumental variables problem." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " local_rd_variable local_treatment local_outcome\n", "0 0.521013 False 10.352212\n", "1 0.463769 True 19.499769\n", "9 0.477009 True 15.650173\n", "13 0.500243 False -0.607821\n", "16 0.596191 True 7.387162\n", "... ... ... ...\n", "9980 0.449123 False 15.319816\n", "9982 0.501416 True 15.874824\n", "9983 0.447196 True 8.519615\n", "9985 0.499184 False -0.135301\n", "9990 0.475909 True 19.444943\n", "\n", "[2005 rows x 3 columns]\n", "*** Causal Estimate ***\n", "\n", "## Identified estimand\n", "Estimand type: nonparametric-ate\n", "\n", "### Estimand : 1\n", "Estimand name: iv\n", "Estimand expression:\n", "Expectation(Derivative(y, [Z0, Z1])*Derivative([v0], [Z0, Z1])**(-1))\n", "Estimand assumption 1, As-if-random: If U→→y then ¬(U →→{Z0,Z1})\n", "Estimand assumption 2, Exclusion: If we remove {Z0,Z1}→{v0}, then ¬({Z0,Z1}→y)\n", "\n", "## Realized estimand\n", "Realized estimand: Wald Estimator\n", "Realized estimand type: nonparametric-ate\n", "Estimand expression:\n", " \n", "Expectation(Derivative(y, local_rd_variable))⋅Expectation(Derivative(v0, local\n", "\n", " -1\n", "_rd_variable)) \n", "Estimand assumption 1, As-if-random: If U→→y then ¬(U →→{Z0,Z1})\n", "Estimand assumption 2, Exclusion: If we remove {Z0,Z1}→{v0}, then ¬({Z0,Z1}→y)\n", "Estimand assumption 3, treatment_effect_homogeneity: Each unit's treatment ['local_treatment'] is affected in the same way by common causes of ['local_treatment'] and local_outcome\n", "Estimand assumption 4, outcome_effect_homogeneity: Each unit's outcome local_outcome is affected in the same way by common causes of ['local_treatment'] and local_outcome\n", "\n", "Target units: ate\n", "\n", "## Estimate\n", "Mean value: 17.24115839926478\n", "\n", "Causal Estimate is 17.24115839926478\n" ] } ], "source": [ "causal_estimate_regdist = model.estimate_effect(identified_estimand,\n", " method_name=\"iv.regression_discontinuity\", \n", " method_params={'rd_variable_name':'Z1',\n", " 'rd_threshold_value':0.5,\n", " 'rd_bandwidth': 0.1})\n", "print(causal_estimate_regdist)\n", "print(\"Causal Estimate is \" + str(causal_estimate_regdist.value))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": false, "sideBar": true, "skip_h1_title": true, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }