{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Mediation analysis with DoWhy: Direct and Indirect Effects" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "\n", "from dowhy import CausalModel\n", "import dowhy.datasets\n", "\n", "# Warnings and logging\n", "import warnings\n", "warnings.filterwarnings('ignore')\n", "import logging\n", "logging.getLogger(\"dowhy\").setLevel(logging.INFO)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating a dataset" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " FD0 W0 v0 y\n", "0 11.009406 0.521425 2.654356 33.373490\n", "1 -6.848972 -0.637748 -1.947278 -22.097380\n", "2 -0.710087 -0.192628 -0.120618 -2.832689\n", "3 -0.092251 -0.314083 -0.229222 -1.640303\n", "4 13.049299 1.138318 3.028420 41.798294\n" ] } ], "source": [ "# Creating a dataset with a single confounder and a single mediator (num_frontdoor_variables)\n", "data = dowhy.datasets.linear_dataset(10, num_common_causes=1, num_samples=10000,\n", " num_instruments=0, num_effect_modifiers=0,\n", " num_treatments=1,\n", " num_frontdoor_variables=1,\n", " treatment_is_binary=False,\n", " outcome_is_binary=False)\n", "df = data['df']\n", "print(df.head())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 1: Modeling the causal mechanism\n", "We create a dataset following a causal graph based on the frontdoor criterion. That is, there is no direct effect of the treatment on outcome; all effect is mediated through the frontdoor variable FD0." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:dowhy.causal_model:Model to find the causal effect of treatment ['v0'] on outcome ['y']\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "model = CausalModel(df,\n", " data[\"treatment_name\"],data[\"outcome_name\"],\n", " data[\"gml_graph\"],\n", " missing_nodes_as_confounders=True,\n", " logging_level=logging.INFO)\n", "\n", "model.view_model()\n", "from IPython.display import Image, display\n", "display(Image(filename=\"causal_model.png\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 2: Identifying the natural direct and indirect effects\n", "We use the `estimand_type` argument to specify that the target estimand should be for a **natural direct effect** or the **natural indirect effect**. For definitions, see [Interpretation and Identification of Causal Mediation](https://ftp.cs.ucla.edu/pub/stat_ser/r389-imai-etal-commentary-r421-reprint.pdf) by Judea Pearl.\n", "\n", "Natural direct effect: Effect due to the path v0->y\n", "Natural indirect effect: Effece due to the path v0->FD0->y (mediated by FD0)." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING:dowhy.causal_identifier:If this is observed data (not from a randomized experiment), there might always be missing confounders. Causal effect cannot be identified perfectly.\n", "INFO:dowhy.causal_identifier:Continuing by ignoring these unobserved confounders because proceed_when_unidentifiable flag is True.\n", "INFO:dowhy.causal_identifier:Mediators for treatment and outcome:['FD0']\n", "INFO:dowhy.causal_identifier:All common causes are observed. Causal effect can be identified.\n", "INFO:dowhy.causal_identifier:All common causes are observed. Causal effect can be identified.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Estimand type: nonparametric-nde\n", "\n", "### Estimand : 1\n", "Estimand name: mediation\n", "Estimand expression:\n", "Expectation(Derivative(y, [FD0])*Derivative([FD0], [v0]))\n", "Estimand assumption 1, Mediation: FD0 intercepts (blocks) all directed paths from v0 to y except the path {v0}→{y}.\n", "Estimand assumption 2, First-stage-unconfoundedness: If U→{v0} and U→{FD0} then P(FD0|v0,U) = P(FD0|v0)\n", "Estimand assumption 3, Second-stage-unconfoundedness: If U→{FD0} and U→{y} then P(y|FD0, v0, U) = P(y|FD0, v0)\n", "\n" ] } ], "source": [ "# Natural direct effect (nde)\n", "identified_estimand_nde = model.identify_effect(estimand_type=\"nonparametric-nde\", \n", " proceed_when_unidentifiable=True)\n", "print(identified_estimand_nde)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING:dowhy.causal_identifier:If this is observed data (not from a randomized experiment), there might always be missing confounders. Causal effect cannot be identified perfectly.\n", "INFO:dowhy.causal_identifier:Continuing by ignoring these unobserved confounders because proceed_when_unidentifiable flag is True.\n", "INFO:dowhy.causal_identifier:Mediators for treatment and outcome:['FD0']\n", "INFO:dowhy.causal_identifier:All common causes are observed. Causal effect can be identified.\n", "INFO:dowhy.causal_identifier:All common causes are observed. Causal effect can be identified.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Estimand type: nonparametric-nie\n", "\n", "### Estimand : 1\n", "Estimand name: mediation\n", "Estimand expression:\n", "\n", "Estimand assumption 1, Mediation: FD0 intercepts (blocks) all directed paths from v0 to y except the path {v0}→{y}.\n", "Estimand assumption 2, First-stage-unconfoundedness: If U→{v0} and U→{FD0} then P(FD0|v0,U) = P(FD0|v0)\n", "Estimand assumption 3, Second-stage-unconfoundedness: If U→{FD0} and U→{y} then P(y|FD0, v0, U) = P(y|FD0, v0)\n", "\n" ] } ], "source": [ "# Natural indirect effect (nie)\n", "identified_estimand_nie = model.identify_effect(estimand_type=\"nonparametric-nie\", \n", " proceed_when_unidentifiable=True)\n", "print(identified_estimand_nie)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 3: Estimation of the effect\n", "Currently only two stage linear regression is supported for estimation. We plan to add a non-parametric Monte Carlo method soon as described in [Imai, Keele and Yamamoto (2010)](https://projecteuclid.org/euclid.ss/1280841733).\n", "\n", "The estimator converts the mediation effect estimation to a series of backdoor effect estimations. \n", "1. The first-stage model estimates the effect from treatment (v0) to the mediator (FD0).\n", "2. The second-stage model estimates the effect from mediator (FD0) to the outcome (Y).\n", "\n", "For estimating the natural indirect effect, there is also an additional second-stage model that estimates the effect of treatment on the outcome, conditioned on the mediator. It assumes the same model as given for for the `second_stage_model` parameter." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:dowhy.causal_estimator:INFO: Using Two Stage Regression Estimator\n", "INFO:dowhy.causal_estimator:b: FD0~v0+W0\n", "INFO:dowhy.causal_estimator:INFO: Using Linear Regression Estimator\n", "INFO:dowhy.causal_estimator:b: y~FD0+v0+W0\n", "INFO:dowhy.causal_estimator:INFO: Using Linear Regression Estimator\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "*** Causal Estimate ***\n", "\n", "## Identified estimand\n", "Estimand type: nonparametric-nde\n", "\n", "### Estimand : 1\n", "Estimand name: mediation\n", "Estimand expression:\n", "Expectation(Derivative(y, [FD0])*Derivative([FD0], [v0]))\n", "Estimand assumption 1, Mediation: FD0 intercepts (blocks) all directed paths from v0 to y except the path {v0}→{y}.\n", "Estimand assumption 2, First-stage-unconfoundedness: If U→{v0} and U→{FD0} then P(FD0|v0,U) = P(FD0|v0)\n", "Estimand assumption 3, Second-stage-unconfoundedness: If U→{FD0} and U→{y} then P(y|FD0, v0, U) = P(y|FD0, v0)\n", "\n", "## Realized estimand\n", "(b: FD0~v0+W0) * (b: y~FD0+v0+W0)\n", "Target units: ate\n", "\n", "## Estimate\n", "Mean value: 11.704233518070275\n", "\n" ] } ], "source": [ "import dowhy.causal_estimators.linear_regression_estimator\n", "causal_estimate_nde = model.estimate_effect(identified_estimand_nde,\n", " method_name=\"mediation.two_stage_regression\",\n", " confidence_intervals=False,\n", " test_significance=False,\n", " method_params = {\n", " 'first_stage_model': dowhy.causal_estimators.linear_regression_estimator.LinearRegressionEstimator,\n", " 'second_stage_model': dowhy.causal_estimators.linear_regression_estimator.LinearRegressionEstimator\n", " }\n", " )\n", "print(causal_estimate_nde)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that the value equals the true value of the natural direct effect (up to random noise). " ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "11.704233518070275 11.712215459759832\n" ] } ], "source": [ "print(causal_estimate_nde.value, data[\"ate\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The parameter is called ate because in the simulated dataset, the indirect effect is set to be zero. \n", "Now let us check whether the indirect effect estimator returns the (correct) estimate of zero." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:dowhy.causal_estimator:INFO: Using Two Stage Regression Estimator\n", "INFO:dowhy.causal_estimator:b: FD0~v0+W0\n", "INFO:dowhy.causal_estimator:INFO: Using Linear Regression Estimator\n", "INFO:dowhy.causal_estimator:b: y~FD0+v0+W0\n", "INFO:dowhy.causal_estimator:INFO: Using Linear Regression Estimator\n", "INFO:dowhy.causal_estimator:b: y~v0+W0\n", "INFO:dowhy.causal_estimator:INFO: Using Linear Regression Estimator\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "*** Causal Estimate ***\n", "\n", "## Identified estimand\n", "Estimand type: nonparametric-nie\n", "\n", "### Estimand : 1\n", "Estimand name: mediation\n", "Estimand expression:\n", "\n", "Estimand assumption 1, Mediation: FD0 intercepts (blocks) all directed paths from v0 to y except the path {v0}→{y}.\n", "Estimand assumption 2, First-stage-unconfoundedness: If U→{v0} and U→{FD0} then P(FD0|v0,U) = P(FD0|v0)\n", "Estimand assumption 3, Second-stage-unconfoundedness: If U→{FD0} and U→{y} then P(y|FD0, v0, U) = P(y|FD0, v0)\n", "\n", "## Realized estimand\n", "b: y~v0+W0-(b: FD0~v0+W0) * (b: y~FD0+v0+W0)\n", "Target units: ate\n", "\n", "## Estimate\n", "Mean value: 0.000848600067884675\n", "\n" ] } ], "source": [ "causal_estimate_nie = model.estimate_effect(identified_estimand_nie,\n", " method_name=\"mediation.two_stage_regression\",\n", " confidence_intervals=False,\n", " test_significance=False,\n", " method_params = {\n", " 'first_stage_model': dowhy.causal_estimators.linear_regression_estimator.LinearRegressionEstimator,\n", " 'second_stage_model': dowhy.causal_estimators.linear_regression_estimator.LinearRegressionEstimator\n", " }\n", " )\n", "print(causal_estimate_nie)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 4: Refutations\n", "TODO" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.6" } }, "nbformat": 4, "nbformat_minor": 4 }