{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Estimating the effect of a Member Rewards program\n", "An example on how DoWhy can be used to estimate the effect of a subscription or a rewards program for customers. \n", "\n", "Suppose that a website has a membership rewards program where customers receive additional benefits if they sign up. How do we know if the program is effective? Here the relevant causal question is:\n", "> What is the impact of offering the membership rewards program on total sales?\n", "\n", "And the equivalent counterfactual question is, \n", "> If the current members had not signed up for the program, how much less would they have spent on the website?\n", "\n", "In formal language, we are interested in the Average Treatment Effect on the Treated (ATT). " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## I. Formulating the causal model\n", "Suppose that the rewards program was introduced in January 2019. The outcome variable is the total spends at the end of the year. \n", "We have data on all monthly transactions of every user and on the time of signup for those who chose to signup for the rewards program. Here's what the data looks like." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
user_idsignup_monthmonthspendtreatment
0051489True
1052510True
2053491True
3054450True
4055420True
..................
1199959999108429True
1199969999109408True
11999799991010405True
11999899991011476True
11999999991012473True
\n", "

120000 rows × 5 columns

\n", "
" ], "text/plain": [ " user_id signup_month month spend treatment\n", "0 0 5 1 489 True\n", "1 0 5 2 510 True\n", "2 0 5 3 491 True\n", "3 0 5 4 450 True\n", "4 0 5 5 420 True\n", "... ... ... ... ... ...\n", "119995 9999 10 8 429 True\n", "119996 9999 10 9 408 True\n", "119997 9999 10 10 405 True\n", "119998 9999 10 11 476 True\n", "119999 9999 10 12 473 True\n", "\n", "[120000 rows x 5 columns]" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Creating some simulated data for our example example\n", "import pandas as pd\n", "import numpy as np\n", "num_users = 10000\n", "num_months = 12\n", "\n", "signup_months = np.random.choice(np.arange(1, num_months), num_users) * np.random.randint(0,2, size=num_users)\n", "df = pd.DataFrame({\n", " 'user_id': np.repeat(np.arange(num_users), num_months),\n", " 'signup_month': np.repeat(signup_months, num_months), # signup month == 0 means customer did not sign up\n", " 'month': np.tile(np.arange(1, num_months+1), num_users), # months are from 1 to 12\n", " 'spend': np.random.poisson(500, num_users*num_months) #np.random.beta(a=2, b=5, size=num_users * num_months)*1000 # centered at 500\n", "})\n", "# Assigning a treatment value based on the signup month \n", "df[\"treatment\"] = (1-(df[\"signup_month\"]==0)).astype(bool)\n", "# Simulating effect of month (monotonically increasing--customers buy the most in December)\n", "df[\"spend\"] = df[\"spend\"] - df[\"month\"]*10\n", "# The treatment effect (simulating a simple treatment effect of 100)\n", "after_signup = (df[\"signup_month\"] < df[\"month\"]) & (df[\"signup_month\"] !=0)\n", "df.loc[after_signup,\"spend\"] = df[after_signup][\"spend\"] + 100\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The importance of time\n", "Time plays a crucial role in modeling this problem. \n", "\n", "Rewards signup can affect the future transactions, but not those that happened before it. In fact, the transaction prior to the rewards signup can be assumed to cause the rewards signup decision. Therefore we can split up the variables for each user in terms of \n", "\n", "1) Activity prior to the treatment (causes the treatment)\n", "2) Activity after the treatment (is the outcome of applying treatment)\n", "\n", "Of course, many important variables that affect signup and total spend are missing (e.g., the type of products bought, length of a user's account, geography, etc.). So we'll need a node denoting `Unobserved Confounders`. \n", "\n", "Below is the causal graph for a user who signed up in month `i=3`. The analysis will be similar for any `i`. " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import os, sys\n", "sys.path.append(os.path.abspath(\"../../../\"))\n", "import dowhy\n", "\n", "# Setting the signup month (for ease of analysis)\n", "i = 6" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "scrolled": true }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:dowhy.causal_model:Model to find the causal effect of treatment ['treatment'] on outcome ['post_spends']\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ " user_id signup_month treatment pre_spends post_spends\n", "0 4 0 False 462.4 401.666667\n", "1 7 0 False 467.4 410.000000\n", "2 14 0 False 466.8 399.833333\n", "3 16 0 False 461.0 403.666667\n", "4 17 0 False 467.8 404.166667\n", "... ... ... ... ... ...\n", "5394 9991 0 False 453.2 385.166667\n", "5395 9992 0 False 482.2 404.333333\n", "5396 9994 0 False 469.6 409.833333\n", "5397 9996 0 False 462.8 399.000000\n", "5398 9997 0 False 484.4 407.333333\n", "\n", "[5399 rows x 5 columns]\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "\n", "\n", "causal_graph = \"\"\"digraph {\n", "treatment[label=\"Program Signup in month i\"];\n", "pre_spends;\n", "post_spends;\n", "Z->treatment;\n", "U[label=\"Unobserved Confounders\"]; \n", "pre_spends -> treatment;\n", "treatment->post_spends;\n", "signup_month->post_spends; signup_month->pre_spends;\n", "signup_month->treatment;\n", "U->treatment; U->pre_spends; U->post_spends;\n", "}\"\"\"\n", "\n", "# Post-process the data based on the graph and the month of the treatment (signup)\n", "df_i_signupmonth = df[df.signup_month.isin([0,i])].groupby([\"user_id\", \"signup_month\", \"treatment\"]).apply(\n", " lambda x: pd.Series({'pre_spends': np.sum(np.where(x.month < i, x.spend,0))/np.sum(np.where(x.month i, x.spend,0))/np.sum(np.where(x.month>i, 1,0)) })\n", ").reset_index()\n", "print(df_i_signupmonth)\n", "model = dowhy.CausalModel(data=df_i_signupmonth,\n", " graph=causal_graph.replace(\"\\n\", \" \"),\n", " treatment=\"treatment\",\n", " outcome=\"post_spends\")\n", "model.view_model()\n", "from IPython.display import Image, display\n", "display(Image(filename=\"causal_model.png\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "More generally, we can include any activity data for the customer in the above graph. All prior- and post-activity data will occupy the same place (and have the same edges) as the Amount spent node (prior and post respectively). " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## II. Identifying the causal effect\n", "For the sake of this example, let us assume that unobserved confounding does not play a big part. " ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "scrolled": true }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING:dowhy.causal_identifier:If this is observed data (not from a randomized experiment), there might always be missing confounders. Causal effect cannot be identified perfectly.\n", "INFO:dowhy.causal_identifier:Continuing by ignoring these unobserved confounders because proceed_when_unidentifiable flag is True.\n", "INFO:dowhy.causal_identifier:Instrumental variables for treatment and outcome:['Z']\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Estimand type: nonparametric-ate\n", "\n", "### Estimand : 1\n", "Estimand name: backdoor1 (Default)\n", "Estimand expression:\n", " d \n", "────────────(Expectation(post_spends|signup_month,pre_spends))\n", "d[treatment] \n", "Estimand assumption 1, Unconfoundedness: If U→{treatment} and U→post_spends then P(post_spends|treatment,signup_month,pre_spends,U) = P(post_spends|treatment,signup_month,pre_spends)\n", "\n", "### Estimand : 2\n", "Estimand name: iv\n", "Estimand expression:\n", "Expectation(Derivative(post_spends, [Z])*Derivative([treatment], [Z])**(-1))\n", "Estimand assumption 1, As-if-random: If U→→post_spends then ¬(U →→{Z})\n", "Estimand assumption 2, Exclusion: If we remove {Z}→{treatment}, then ¬({Z}→post_spends)\n", "\n" ] } ], "source": [ "identified_estimand = model.identify_effect(proceed_when_unidentifiable=True)\n", "print(identified_estimand)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Based on the graph, DoWhy determines that the signup month and amount spent in the pre-treatment months (`signup_month`, `pre_spend`) needs to be conditioned on." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## III. Estimating the effect\n", "We now estimate the effect based on the backdoor estimand, setting the target units to \"att\"." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "scrolled": true }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:dowhy.causal_estimator:INFO: Using Propensity Score Matching Estimator\n", "INFO:dowhy.causal_estimator:b: post_spends~treatment+signup_month+pre_spends\n", "/home/amit/python-virtual-envs/env3.6/lib/python3.6/site-packages/sklearn/utils/validation.py:73: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " return f(**kwargs)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "*** Causal Estimate ***\n", "\n", "## Identified estimand\n", "Estimand type: nonparametric-ate\n", "\n", "### Estimand : 1\n", "Estimand name: backdoor1 (Default)\n", "Estimand expression:\n", " d \n", "────────────(Expectation(post_spends|signup_month,pre_spends))\n", "d[treatment] \n", "Estimand assumption 1, Unconfoundedness: If U→{treatment} and U→post_spends then P(post_spends|treatment,signup_month,pre_spends,U) = P(post_spends|treatment,signup_month,pre_spends)\n", "\n", "## Realized estimand\n", "b: post_spends~treatment+signup_month+pre_spends\n", "Target units: att\n", "\n", "## Estimate\n", "Mean value: 95.22363945578238\n", "\n" ] } ], "source": [ "estimate = model.estimate_effect(identified_estimand, \n", " method_name=\"backdoor1.propensity_score_matching\",\n", " target_units=\"att\")\n", "print(estimate)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The analysis tells us the Average Treatment Effect on the Treated (ATT). That is, the average effect on total spend for the customers that signed up for the Rewards Program in month `i=3` (compared to the case where they had not signed up). We can similarly calculate the effects for customers who signed up in any other month by changing the value of `i`(line 2 above) and then rerunning the analysis. \n", "\n", "Note that the estimation suffers from left and right-censoring. \n", "1. **Left-censoring**: If a customer signs up in the first month, we do not have enough transaction history to match them to similar customers who did not sign up (and thus apply the backdoor identified estimand). \n", "2. **Right-censoring**: If a customer signs up in the last month, we do not enough *future* (post-treatment) transactions to estimate the outcome after signup. \n", "\n", "Thus, even if the effect of signup was the same across all months, the *estimated effects* may be different by month of signup, due to lack of data (and thus high variance in estimated pre-treatment or post-treatment transactions activity)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## IV. Refuting the estimate\n", "We refute the estimate using the placebo treatment refuter. This refuter substitutes the treatment by an independent random variable and checks whether our estimate now goes to zero (it should!)." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:dowhy.causal_refuters.placebo_treatment_refuter:Refutation over 2 simulated datasets of permute treatment\n", "INFO:dowhy.causal_estimator:INFO: Using Propensity Score Matching Estimator\n", "INFO:dowhy.causal_estimator:b: post_spends~placebo+signup_month+pre_spends\n", "/home/amit/python-virtual-envs/env3.6/lib/python3.6/site-packages/sklearn/utils/validation.py:73: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " return f(**kwargs)\n", "INFO:dowhy.causal_estimator:INFO: Using Propensity Score Matching Estimator\n", "INFO:dowhy.causal_estimator:b: post_spends~placebo+signup_month+pre_spends\n", "/home/amit/python-virtual-envs/env3.6/lib/python3.6/site-packages/sklearn/utils/validation.py:73: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " return f(**kwargs)\n", "WARNING:dowhy.causal_refuters.placebo_treatment_refuter:We assume a Normal Distribution as the sample has less than 100 examples.\n", " Note: The underlying distribution may not be Normal. We assume that it approaches normal with the increase in sample size.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Refute: Use a Placebo Treatment\n", "Estimated effect:95.22363945578238\n", "New effect:0.30931122448979725\n", "p value:0.39121893285584125\n", "\n" ] } ], "source": [ "refutation = model.refute_estimate(identified_estimand, estimate, method_name=\"placebo_treatment_refuter\",\n", " placebo_type=\"permute\", num_simulations=2)\n", "print(refutation)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.6" } }, "nbformat": 4, "nbformat_minor": 4 }