{ "cells": [ { "cell_type": "markdown", "id": "29d3a3fd", "metadata": {}, "source": [ "# Time-decay Attribution and Causal Analysis\n", "\n", "**The problem**: Advertisers run campaigns across multiple channels (search, display, video) but struggle to know which ads actually drive sales. The standard approach, last-touch attribution (LTA), credits the final ad a customer saw before purchasing. This can systematically overcredit lower-funnel ads (search) that appear right before conversion, while undercrediting upper-funnel ads (display, video) that built awareness earlier. The result is misallocated budgets and underinvestment in channels with higher true ROI.\n", "\n", "**Two measurement approaches**:\n", "1. **Time Decay Attribution (TDA)**: Reallocates LTA credit by modeling ad carryover, i.e., how impressions accumulate and decay over time. If upper-funnel ads have longer carryover, they gain credit in the time-decay approach vs LTA.\n", "2. **Causal Attribution (CA)**: Uses DoWhy to decompose YoY sales growth into ad-driven contributions. Thus we know how much each spend change contributed to business outcome. \n", "\n", "These two approaches together tell you what types of ads influenced conversions beyond rule-based approaches and what drove growth." ] }, { "cell_type": "code", "execution_count": null, "id": "cae2f75d", "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import networkx as nx\n", "import numpy as np\n", "import pandas as pd\n", "from scipy import optimize\n", "from sklearn.linear_model import Ridge\n", "from sklearn.model_selection import TimeSeriesSplit\n", "from sklearn.metrics import mean_absolute_error\n", "import warnings\n", "warnings.filterwarnings('ignore')\n", "\n", "from dowhy import gcm\n", "from dowhy.gcm.util import plot as plot_graph\n", "from dowhy.gcm.falsify import falsify_graph, apply_suggestions" ] }, { "cell_type": "markdown", "id": "aa935ef2", "metadata": {}, "source": [ "## 1. Data Generation\n", "\n", "**Synthetic data with 4 ad channels: SP, SB, Display, Video**:\n", "1. SP/SB: high LTA share (75% combined), lower iROI (\\$2.20-2.50), fast decay\n", "2. Display/Video: low LTA share (25%), higher iROI (\\$3.80-4.50), slow decay" ] }, { "cell_type": "code", "execution_count": null, "id": "03ec3c4d-139b-4a93-8ac2-25ec44e3b11b", "metadata": {}, "outputs": [], "source": [ "def generate_advertising_data(start_date='2024-01-01', end_date='2025-12-31', seed=0):\n", " np.random.seed(seed)\n", " \n", " dates = pd.date_range(start=start_date, end=end_date, freq='D')\n", " n_days = len(dates)\n", " \n", " day_of_year = np.array([d.dayofyear for d in dates])\n", " year = np.array([d.year for d in dates])\n", " day_of_week = np.array([d.dayofweek for d in dates])\n", " \n", " # Seasonality: ±15% weekly, ±30% yearly (peaks ~June)\n", " weekly_pattern = 1 + 0.15 * np.sin(2 * np.pi * day_of_week / 7)\n", " yearly_pattern = 1 + 0.3 * np.sin(2 * np.pi * (day_of_year - 80) / 365)\n", " yoy_growth = np.where(year == 2025, 1.08, 1.0) # 8% YoY growth\n", " \n", " # Events: with year-over-year variation\n", " special_shopping_event = np.zeros(n_days)\n", " other_shopping_event = np.zeros(n_days)\n", " \n", " for i, d in enumerate(dates):\n", " # === SPECIAL SHOPPING EVENTS ===\n", " # Prime Day \n", " if d.year == 2024 and d.month == 7 and 16 <= d.day <= 17:\n", " special_shopping_event[i] = 1\n", " if d.year == 2025 and d.month == 7 and 8 <= d.day <= 9:\n", " special_shopping_event[i] = 1\n", " # Black Friday week \n", " if d.year == 2024 and d.month == 11 and 25 <= d.day <= 30: # Thanksgiving Nov 28\n", " special_shopping_event[i] = 1\n", " if d.year == 2025 and d.month == 11 and 24 <= d.day <= 29: # Thanksgiving Nov 27\n", " special_shopping_event[i] = 1\n", " # Holiday shopping season\n", " if d.month == 12 and d.day <= 24:\n", " special_shopping_event[i] = 0.7\n", " # Cyber Monday extended\n", " if d.year == 2024 and d.month == 12 and 1 <= d.day <= 2:\n", " special_shopping_event[i] = 1\n", " if d.year == 2025 and d.month == 11 and d.day == 30:\n", " special_shopping_event[i] = 1\n", " \n", " # === OTHER SHOPPING EVENTS ===\n", " # Valentine's Day\n", " if d.month == 2 and 7 <= d.day <= 14:\n", " other_shopping_event[i] = 1\n", " # Easter \n", " if d.year == 2024 and d.month == 3 and 25 <= d.day <= 31: # Easter Mar 31\n", " other_shopping_event[i] = 1\n", " if d.year == 2025 and d.month == 4 and 14 <= d.day <= 20: # Easter Apr 20\n", " other_shopping_event[i] = 1\n", " # Mother's Day week\n", " if d.year == 2024 and d.month == 5 and 6 <= d.day <= 12: # May 12\n", " other_shopping_event[i] = 1\n", " if d.year == 2025 and d.month == 5 and 5 <= d.day <= 11: # May 11\n", " other_shopping_event[i] = 1\n", " # Memorial Day \n", " if d.year == 2024 and d.month == 5 and 24 <= d.day <= 27:\n", " other_shopping_event[i] = 1\n", " if d.year == 2025 and d.month == 5 and 23 <= d.day <= 26:\n", " other_shopping_event[i] = 1\n", " # Father's Day week\n", " if d.year == 2024 and d.month == 6 and 10 <= d.day <= 16: # Jun 16\n", " other_shopping_event[i] = 1\n", " if d.year == 2025 and d.month == 6 and 9 <= d.day <= 15: # Jun 15\n", " other_shopping_event[i] = 1\n", " # Back to school \n", " if d.month == 8 and 1 <= d.day <= 31:\n", " other_shopping_event[i] = 0.7\n", " # Labor Day\n", " if d.year == 2024 and d.month == 9 and 1 <= d.day <= 2:\n", " other_shopping_event[i] = 1\n", " if d.year == 2025 and d.month == 9 and 1 <= d.day <= 1:\n", " other_shopping_event[i] = 1\n", " # Halloween\n", " if d.month == 10 and 20 <= d.day <= 31:\n", " other_shopping_event[i] = 0.7\n", " # Super Bowl week \n", " if d.year == 2024 and d.month == 2 and 5 <= d.day <= 11: # Feb 11\n", " other_shopping_event[i] = 1\n", " if d.year == 2025 and d.month == 2 and 3 <= d.day <= 9: # Feb 9\n", " other_shopping_event[i] = 1\n", " \n", " # Spend: SP/SB track demand closely; Display/Video more stable for brand building\n", " base_spend_sp = 15000 * yoy_growth * weekly_pattern * yearly_pattern\n", " base_spend_sb = 8000 * yoy_growth * weekly_pattern * yearly_pattern\n", " base_spend_display = 12000 * yoy_growth * weekly_pattern * (yearly_pattern ** 0.5)\n", " base_spend_video = 6000 * yoy_growth * weekly_pattern * (yearly_pattern ** 0.3)\n", " \n", " event_multiplier = 1 + 0.5 * special_shopping_event + 0.2 * other_shopping_event\n", " \n", " # Noise: upper-funnel has more variance\n", " sp_spend = base_spend_sp * event_multiplier * (1 + 0.15 * np.random.randn(n_days))\n", " sb_spend = base_spend_sb * event_multiplier * (1 + 0.12 * np.random.randn(n_days))\n", " display_spend = base_spend_display * event_multiplier * (1 + 0.18 * np.random.randn(n_days))\n", " video_spend = base_spend_video * event_multiplier * (1 + 0.20 * np.random.randn(n_days))\n", " \n", " sp_spend = np.maximum(sp_spend, 100)\n", " sb_spend = np.maximum(sb_spend, 50)\n", " display_spend = np.maximum(display_spend, 100)\n", " video_spend = np.maximum(video_spend, 50)\n", " \n", " # Impressions via CPM: SP=$2.50, SB=$3.00, Display=$1.50, Video=$8.00\n", " sp_impressions = sp_spend / 2.5 * 1000\n", " sb_impressions = sb_spend / 3.0 * 1000\n", " display_impressions = display_spend / 1.5 * 1000\n", " video_impressions = video_spend / 8.0 * 1000\n", " \n", " # Ground truth: upper-funnel has higher iROI but gets less LTA credit\n", " iroi_sp, iroi_sb, iroi_display, iroi_video = 2.5, 2.2, 3.8, 4.5\n", " carryover_sp, carryover_sb, carryover_display, carryover_video = 0.3, 0.4, 0.7, 0.85\n", " lta_rate_sp, lta_rate_sb, lta_rate_display, lta_rate_video = 0.45, 0.30, 0.18, 0.07\n", " \n", " # Adstock: effective_spend[t] = spend[t] + carryover * effective_spend[t-1]\n", " def apply_adstock(x, carryover):\n", " adstocked = np.zeros_like(x)\n", " adstocked[0] = x[0]\n", " for t in range(1, len(x)):\n", " adstocked[t] = x[t] + carryover * adstocked[t-1]\n", " return adstocked\n", " \n", " sp_adstocked = apply_adstock(sp_spend, carryover_sp)\n", " sb_adstocked = apply_adstock(sb_spend, carryover_sb)\n", " display_adstocked = apply_adstock(display_spend, carryover_display)\n", " video_adstocked = apply_adstock(video_spend, carryover_video)\n", " \n", " # Total sales = organic + (iROI × adstocked_spend) + event_lift + noise\n", " organic_baseline = 80000 * yoy_growth * weekly_pattern * yearly_pattern\n", " incremental_from_sp = iroi_sp * sp_adstocked\n", " incremental_from_sb = iroi_sb * sb_adstocked\n", " incremental_from_display = iroi_display * display_adstocked\n", " incremental_from_video = iroi_video * video_adstocked\n", " event_lift = 30000 * special_shopping_event + 10000 * other_shopping_event\n", " \n", " total_sales = (organic_baseline + \n", " incremental_from_sp + incremental_from_sb + \n", " incremental_from_display + incremental_from_video +\n", " event_lift + 5000 * np.random.randn(n_days))\n", " total_sales = np.maximum(total_sales, 10000)\n", " \n", " # LTA: 60% of sales are ad-touched, distributed by LTA contribution rates (independent of true iROI)\n", " ad_touched_sales = 0.6 * total_sales * (1 + 0.1 * np.random.randn(n_days))\n", " lta_sp = np.maximum(lta_rate_sp * ad_touched_sales * (1 + 0.08 * np.random.randn(n_days)), 0)\n", " lta_sb = np.maximum(lta_rate_sb * ad_touched_sales * (1 + 0.10 * np.random.randn(n_days)), 0)\n", " lta_display = np.maximum(lta_rate_display * ad_touched_sales * (1 + 0.12 * np.random.randn(n_days)), 0)\n", " lta_video = np.maximum(lta_rate_video * ad_touched_sales * (1 + 0.15 * np.random.randn(n_days)), 0)\n", " \n", " df = pd.DataFrame({\n", " 'activity_date': dates,\n", " 'year': year,\n", " 'quarter': [d.quarter for d in dates],\n", " 'month': [d.month for d in dates],\n", " 'day_of_week': day_of_week,\n", " 'sp_spend': sp_spend, 'sb_spend': sb_spend,\n", " 'display_spend': display_spend, 'video_spend': video_spend,\n", " 'sp_impressions': sp_impressions, 'sb_impressions': sb_impressions,\n", " 'display_impressions': display_impressions, 'video_impressions': video_impressions,\n", " 'special_shopping_event': special_shopping_event,\n", " 'other_shopping_event': other_shopping_event,\n", " 'total_sales': total_sales,\n", " 'ad_touched_sales': ad_touched_sales,\n", " 'lta_sales_sp': lta_sp, 'lta_sales_sb': lta_sb,\n", " 'lta_sales_display': lta_display, 'lta_sales_video': lta_video,\n", " })\n", " \n", " ground_truth = {\n", " 'iroi': {'SP': iroi_sp, 'SB': iroi_sb, 'Display': iroi_display, 'Video': iroi_video},\n", " 'carryover': {'SP': carryover_sp, 'SB': carryover_sb, 'Display': carryover_display, 'Video': carryover_video},\n", " 'lta_rate': {'SP': lta_rate_sp, 'SB': lta_rate_sb, 'Display': lta_rate_display, 'Video': lta_rate_video}\n", " }\n", " \n", " return df, ground_truth\n", "\n", "df, ground_truth = generate_advertising_data()\n", "print(f\"Data: {len(df)} days, {df['activity_date'].min().date()} to {df['activity_date'].max().date()}\")\n", "print(f\"Special event days: {(df['special_shopping_event'] > 0).sum()} ({(df['special_shopping_event'] > 0).mean()*100:.1f}%)\")\n", "print(f\"Other event days: {(df['other_shopping_event'] > 0).sum()} ({(df['other_shopping_event'] > 0).mean()*100:.1f}%)\")\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "fe7788d6", "metadata": {}, "outputs": [], "source": [ "print(\"Ground Truth:\")\n", "print(f\" iROI: {ground_truth['iroi']}\")\n", "print(f\" Carryover: {ground_truth['carryover']}\")\n", "print(f\" LTA rates: {ground_truth['lta_rate']}\")" ] }, { "cell_type": "code", "execution_count": null, "id": "63c51830", "metadata": {}, "outputs": [], "source": [ "df_2024 = df[df['year'] == 2024]\n", "df_2025 = df[df['year'] == 2025]\n", "\n", "print(f\"YoY Change (2025 vs 2024):\")\n", "for col in ['total_sales', 'sp_spend', 'sb_spend', 'display_spend', 'video_spend']:\n", " pct = (df_2025[col].mean() - df_2024[col].mean()) / df_2024[col].mean() * 100\n", " print(f\" {col}: {pct:+.1f}%\")" ] }, { "cell_type": "markdown", "id": "c1d0d537", "metadata": {}, "source": [ "## 2. Time Decay MTA\n", "\n", "Reallocate LTA credit using adstock transformation. Channels with higher carryover (Display/Video) accumulate more effective impressions over time, gaining credit vs LTA." ] }, { "cell_type": "code", "execution_count": null, "id": "74d4b5ba", "metadata": {}, "outputs": [], "source": [ "def apply_adstock(series, carryover):\n", " \"\"\"Geometric adstock: result[t] = x[t] + carryover * result[t-1]\"\"\"\n", " result = np.zeros_like(series, dtype=float)\n", " result[0] = series[0]\n", " for t in range(1, len(series)):\n", " result[t] = series[t] + carryover * result[t-1]\n", " return result\n", "\n", "def build_adstocked_features(X, carryover_map):\n", " \"\"\"Apply channel-specific carryover to build adstocked feature matrix.\"\"\"\n", " X_adstocked = pd.DataFrame(index=X.index)\n", " for col in X.columns:\n", " carry = carryover_map.get(col, 0.5)\n", " X_adstocked[col] = apply_adstock(X[col].values, carry)\n", " return X_adstocked\n", "\n", "def time_decay_attribution(X_adstocked, y, lam=10.0):\n", " \"\"\"Ridge regression on adstocked features, rescaled to sum to total LTA.\"\"\"\n", " X_arr = X_adstocked.values\n", " X_mean = X_arr.mean(axis=0)\n", " X_std = X_arr.std(axis=0) + 1e-8\n", " X_scaled = (X_arr - X_mean) / X_std\n", " \n", " model = Ridge(alpha=lam, fit_intercept=True)\n", " model.fit(X_scaled, y.values)\n", " coefs = model.coef_ / X_std\n", " y_pred = model.predict(X_scaled)\n", " \n", " contributions = {}\n", " for i, col in enumerate(X_adstocked.columns):\n", " contrib = coefs[i] * X_adstocked[col].values\n", " contributions[col] = np.maximum(contrib, 0).sum()\n", " \n", " total_contrib = sum(contributions.values())\n", " if total_contrib > 0:\n", " scale = y.sum() / total_contrib\n", " contributions = {k: v * scale for k, v in contributions.items()}\n", " \n", " return contributions, model, y_pred\n", "\n", "def cv_objective(params, X, y, channels, carryover_bounds, lambda_bounds, n_splits=5):\n", " \"\"\"Cross-validation objective: validation MAE + overfitting penalty + regularization on carryover.\"\"\"\n", " n_channels = len(channels)\n", " carryover_map = {channels[i]: params[i] for i in range(n_channels)}\n", " lam = params[n_channels]\n", " \n", " # Boundary check\n", " for carry in carryover_map.values():\n", " if carry < carryover_bounds[0] or carry > carryover_bounds[1]:\n", " return 1e10\n", " if lam < lambda_bounds[0] or lam > lambda_bounds[1]:\n", " return 1e10\n", " \n", " X_adstocked = build_adstocked_features(X, carryover_map)\n", " tscv = TimeSeriesSplit(n_splits=n_splits)\n", " \n", " train_maes, val_maes = [], []\n", " \n", " for train_idx, val_idx in tscv.split(X_adstocked):\n", " X_train, X_val = X_adstocked.iloc[train_idx], X_adstocked.iloc[val_idx]\n", " y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]\n", " \n", " # Standardize using train stats only\n", " X_train_arr = X_train.values\n", " X_mean = X_train_arr.mean(axis=0)\n", " X_std = X_train_arr.std(axis=0) + 1e-8\n", " X_train_scaled = (X_train_arr - X_mean) / X_std\n", " \n", " model = Ridge(alpha=lam, fit_intercept=True)\n", " model.fit(X_train_scaled, y_train.values)\n", " \n", " y_train_pred = model.predict(X_train_scaled)\n", " X_val_scaled = (X_val.values - X_mean) / X_std\n", " y_val_pred = model.predict(X_val_scaled)\n", " \n", " train_maes.append(mean_absolute_error(y_train, y_train_pred))\n", " val_maes.append(mean_absolute_error(y_val, y_val_pred))\n", " \n", " train_mae = np.mean(train_maes)\n", " val_mae = np.mean(val_maes)\n", " gap_penalty = max(0, val_mae - train_mae) / (train_mae + 1e-8)\n", " carryover_penalty = sum(c**2 for c in carryover_map.values()) / n_channels\n", " \n", " return val_mae + 0.1 * gap_penalty * val_mae + 0.05 * carryover_penalty * val_mae\n", "\n", "def optimize_carryover(X, y, channels, carryover_bounds=(0.1, 0.95), lambda_bounds=(1, 100), maxiter=50):\n", " \"\"\"Find optimal carryover rates via differential evolution.\"\"\"\n", " n_channels = len(channels)\n", " bounds = [carryover_bounds] * n_channels + [lambda_bounds]\n", " \n", " result = optimize.differential_evolution(\n", " cv_objective,\n", " bounds=bounds,\n", " args=(X, y, channels, carryover_bounds, lambda_bounds),\n", " seed=0,\n", " maxiter=maxiter,\n", " tol=1e-6,\n", " disp=False,\n", " workers=1\n", " )\n", " \n", " carryover_map = {channels[i]: result.x[i] for i in range(n_channels)}\n", " lam_opt = result.x[n_channels]\n", " \n", " return carryover_map, lam_opt" ] }, { "cell_type": "code", "execution_count": null, "id": "d8f2a3b1", "metadata": {}, "outputs": [], "source": [ "channels = ['SP', 'SB', 'Display', 'Video']\n", "X_impressions = df.set_index('activity_date')[['sp_impressions', 'sb_impressions', 'display_impressions', 'video_impressions']].copy()\n", "X_impressions.columns = channels\n", "y_lta_total = df.set_index('activity_date')[['lta_sales_sp', 'lta_sales_sb', 'lta_sales_display', 'lta_sales_video']].sum(axis=1)\n", "\n", "print(f\"X: {X_impressions.shape}, y: {y_lta_total.shape}\")" ] }, { "cell_type": "code", "execution_count": null, "id": "a1b2c3d4", "metadata": {}, "outputs": [], "source": [ "carryover_opt, lam_opt = optimize_carryover(\n", " X_impressions, y_lta_total, channels,\n", " carryover_bounds=(0.1, 0.95),\n", " lambda_bounds=(5, 100),\n", " maxiter=50\n", ")\n", "\n", "print(\"Optimized carryover rates:\")\n", "for ch, carry in carryover_opt.items():\n", " half_life = np.log(0.5) / np.log(carry) if carry > 0 else 0\n", " true_carry = ground_truth['carryover'].get(ch, 'N/A')\n", " print(f\" {ch}: {carry:.3f} (half-life: {half_life:.1f}d) | actual: {true_carry}\")\n", "print(f\"Lambda: {lam_opt:.2f}\")" ] }, { "cell_type": "code", "execution_count": null, "id": "e5f6g7h8", "metadata": {}, "outputs": [], "source": [ "X_adstocked = build_adstocked_features(X_impressions, carryover_opt)\n", "mta_contrib, _, _ = time_decay_attribution(X_adstocked, y_lta_total, lam_opt)\n", "\n", "lta_totals = {\n", " 'SP': df['lta_sales_sp'].sum(),\n", " 'SB': df['lta_sales_sb'].sum(),\n", " 'Display': df['lta_sales_display'].sum(),\n", " 'Video': df['lta_sales_video'].sum()\n", "}\n", "\n", "total_lta = sum(lta_totals.values())\n", "total_mta = sum(mta_contrib.values())" ] }, { "cell_type": "code", "execution_count": null, "id": "i9j0k1l2", "metadata": {}, "outputs": [], "source": [ "print(\"LTA vs MTA Attribution:\")\n", "print(f\"{'Channel':<10} {'LTA ($M)':<12} {'LTA %':<10} {'MTA ($M)':<12} {'MTA %':<10} {'Shift':<10}\")\n", "print(\"-\" * 64)\n", "for ch in channels:\n", " lta_val = lta_totals[ch] / 1e6\n", " lta_pct = lta_totals[ch] / total_lta * 100\n", " mta_val = mta_contrib[ch] / 1e6\n", " mta_pct = mta_contrib[ch] / total_mta * 100\n", " shift = mta_pct - lta_pct\n", " print(f\"{ch:<10} {lta_val:<12.1f} {lta_pct:<10.1f} {mta_val:<12.1f} {mta_pct:<10.1f} {shift:+.1f}pp\")" ] }, { "cell_type": "code", "execution_count": null, "id": "m3n4o5p6", "metadata": {}, "outputs": [], "source": [ "fig, axes = plt.subplots(1, 2, figsize=(12, 5))\n", "\n", "x = np.arange(len(channels))\n", "width = 0.35\n", "\n", "ax1 = axes[0]\n", "lta_pcts = [lta_totals[c]/total_lta*100 for c in channels]\n", "mta_pcts = [mta_contrib[c]/total_mta*100 for c in channels]\n", "ax1.bar(x - width/2, lta_pcts, width, label='Last Touch', color='tab:red')\n", "ax1.bar(x + width/2, mta_pcts, width, label='Time Decay MTA', color='tab:blue')\n", "ax1.set_xticks(x)\n", "ax1.set_xticklabels(channels)\n", "ax1.set_ylabel('Share (%)')\n", "ax1.set_title('LTA vs MTA Attribution')\n", "ax1.legend()\n", "ax1.set_ylim(0, 55)\n", "\n", "ax2 = axes[1]\n", "shifts = [mta_pcts[i] - lta_pcts[i] for i in range(len(channels))]\n", "colors = ['tab:green' if s > 0 else 'tab:red' for s in shifts]\n", "ax2.bar(channels, shifts, color=colors)\n", "ax2.axhline(y=0, color='black', linewidth=0.5)\n", "ax2.set_ylabel('Credit Shift (pp)')\n", "ax2.set_title('MTA - LTA Shift')\n", "\n", "plt.tight_layout()\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "section3-header", "metadata": {}, "source": [ "## 3. Causal Attribution (DoWhy)\n", "\n", "Decompose YoY sales growth into driver contributions using DoWhy GCM:\n", "1. Build DAG: spend variables, other drivers → sale\n", "2. Fit causal model\n", "3. Attribute mean change via Shapley values\n", "\n", "### 3.1 With Raw Spend" ] }, { "cell_type": "markdown", "id": "a795195f-1952-47b7-8bb7-5175b4927498", "metadata": {}, "source": [ "This is the default approach by fitting the model directly on observed spend without any time-series transformation." ] }, { "cell_type": "code", "execution_count": null, "id": "u1v2w3x4", "metadata": {}, "outputs": [], "source": [ "df_causal = df[['activity_date', 'year', 'quarter',\n", " 'sp_spend', 'sb_spend', 'display_spend', 'video_spend',\n", " 'special_shopping_event', 'other_shopping_event',\n", " 'total_sales']].copy()\n", "df_causal.columns = ['activity_date', 'year', 'quarter',\n", " 'sp_spend', 'sb_spend', 'display_spend', 'video_spend',\n", " 'special_shopping_event', 'other_shopping_event', 'sale']\n", "\n", "print(f\"Causal data: {len(df_causal)} rows\")" ] }, { "cell_type": "code", "execution_count": null, "id": "c1f3a132-0502-4649-90e8-bfab6c1b4b02", "metadata": {}, "outputs": [], "source": [ "# Build graph matching DGP structure\n", "day_of_year = df['activity_date'].dt.dayofyear\n", "df_causal['yearly_seasonality'] = 1 + 0.3 * np.sin(2 * np.pi * (day_of_year - 80) / 365)\n", "df_causal['weekly_seasonality'] = 1 + 0.15 * np.sin(2 * np.pi * df['activity_date'].dt.dayofweek / 7)\n", "\n", "edges = []\n", "\n", "# Spend → sale\n", "for col in ['sp_spend', 'sb_spend', 'display_spend', 'video_spend']:\n", " edges.append((col, 'sale'))\n", "\n", "# Seasonality → sale \n", "edges.append(('yearly_seasonality', 'sale'))\n", "edges.append(('weekly_seasonality', 'sale'))\n", "\n", "# Seasonality → all spend\n", "for col in ['sp_spend', 'sb_spend', 'display_spend', 'video_spend']:\n", " edges.append(('yearly_seasonality', col))\n", " edges.append(('weekly_seasonality', col))\n", "\n", "# Events → sale\n", "edges.append(('special_shopping_event', 'sale'))\n", "edges.append(('other_shopping_event', 'sale'))\n", "\n", "# Events → all spend \n", "for col in ['sp_spend', 'sb_spend', 'display_spend', 'video_spend']:\n", " edges.append(('special_shopping_event', col))\n", " edges.append(('other_shopping_event', col))\n", "\n", "causal_graph = nx.DiGraph(edges)\n", "plot_graph(causal_graph)" ] }, { "cell_type": "code", "execution_count": null, "id": "2cc4eb04-11b2-465f-aa55-b6b56a525205", "metadata": {}, "outputs": [], "source": [ "# Graph validation\n", "RUN_GRAPH_VALIDATION = True\n", "\n", "if RUN_GRAPH_VALIDATION:\n", " validation_data = df_causal[['sp_spend','sb_spend','display_spend','video_spend',\n", " 'yearly_seasonality', 'weekly_seasonality','special_shopping_event', 'other_shopping_event','sale']]\n", " result = falsify_graph(causal_graph, validation_data, suggestions=True, n_permutations=20)\n", " print(result)\n", " causal_graph = apply_suggestions(causal_graph, result)\n", " print(f\"\\nRevised graph edges: {list(causal_graph.edges())}\")\n", "else:\n", " print(\"Graph validation skipped\")" ] }, { "cell_type": "code", "execution_count": null, "id": "g3h4i5j6", "metadata": {}, "outputs": [], "source": [ "causal_model = gcm.StructuralCausalModel(causal_graph)\n", "gcm.auto.assign_causal_mechanisms(causal_model, df_causal, quality=gcm.auto.AssignmentQuality.BETTER)\n", "gcm.fit(causal_model, df_causal)" ] }, { "cell_type": "code", "execution_count": null, "id": "k7l8m9n0", "metadata": {}, "outputs": [], "source": [ "# Define growth period for comparison \n", "\n", "df_new = df_causal[(df_causal['year'] == 2025) & (df_causal['quarter'].isin([1, 2]))].copy()\n", "df_old = df_causal[(df_causal['year'] == 2024) & (df_causal['quarter'].isin([1, 2]))].copy()\n", "\n", "mean_old = df_old['sale'].mean()\n", "mean_new = df_new['sale'].mean()\n", "total_change = mean_new - mean_old\n", "\n", "print(f\"Comparing: 2025 H1 ({len(df_new)}d) vs 2024 H1 ({len(df_old)}d)\")\n", "print(f\"Mean daily sales: 2024=${mean_old:,.0f}, 2025=${mean_new:,.0f}\")\n", "print(f\"Daily change: ${total_change:,.0f}\")" ] }, { "cell_type": "code", "execution_count": null, "id": "o1p2q3r4", "metadata": {}, "outputs": [], "source": [ "# Decompose mean change via Shapley values\n", "# Can swap np.mean for np.median or np.var to decompose those instead\n", "\n", "median_contribs, uncertainty = gcm.confidence_intervals(\n", " lambda: gcm.distribution_change(\n", " causal_model, df_old, df_new, 'sale',\n", " num_samples=2000,\n", " difference_estimation_func=lambda x, y: np.mean(y) - np.mean(x)\n", " ),\n", " confidence_level=0.95,\n", " num_bootstrap_resamples=20\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "s5t6u7v8", "metadata": {}, "outputs": [], "source": [ "growth_rate = (mean_new - mean_old) / mean_old * 100\n", "\n", "print(\"Causal Attribution to YoY Sales Change:\")\n", "print(f\"{'Driver':<20} {'Contrib $':<12} {'% of Total':<12} {'% of Growth':<12} {'iROAS':<10} {'95% CI':<22}\")\n", "print(\"-\" * 98)\n", "\n", "sum_contribs = 0\n", "total_abs_contrib = sum(abs(v) for v in median_contribs.values())\n", "\n", "spend_changes = {}\n", "for col in ['sp_spend', 'sb_spend', 'display_spend', 'video_spend']:\n", " spend_changes[col] = df_new[col].mean() - df_old[col].mean()\n", "\n", "for node in sorted(median_contribs.keys(), key=lambda x: -abs(median_contribs[x])):\n", " val = median_contribs[node]\n", " sum_contribs += val\n", " pct_total = (val / total_abs_contrib * 100) if total_abs_contrib > 0 else 0\n", " pct_growth = (val / mean_old * 100)\n", " lb, ub = uncertainty[node]\n", " sig = \"*\" if (lb > 0 or ub < 0) else \"\"\n", " \n", " if node in spend_changes:\n", " spend_chg = spend_changes[node]\n", " iroas = val / spend_chg if spend_chg != 0 else 0\n", " print(f\"{node:<20} ${val:>10,.0f} {pct_total:>10.1f}% {pct_growth:>10.2f}% ${iroas:>7.2f} {sig:<2} [{lb:>8,.0f}, {ub:>8,.0f}]\")\n", " else:\n", " print(f\"{node:<20} ${val:>10,.0f} {pct_total:>10.1f}% {pct_growth:>10.2f}% {'N/A':>8} {sig:<2} [{lb:>8,.0f}, {ub:>8,.0f}]\")\n", "\n", "print(\"-\" * 98)\n", "print(f\"{'Sum of contributions':<20} ${sum_contribs:>10,.0f} {'':>11} {sum_contribs/mean_old*100:>10.2f}%\")\n", "print(f\"{'Actual mean change':<20} ${total_change:>10,.0f} {'':>11} {growth_rate:>10.2f}%\")\n", "\n", "print(f\"\\nKPI summary (daily sales):\")\n", "print(f\" 2024 H1: mean=${mean_old:,.0f}, median=${df_old['sale'].median():,.0f}\")\n", "print(f\" 2025 H1: mean=${mean_new:,.0f}, median=${df_new['sale'].median():,.0f}\")\n", "print(f\" YoY growth: {growth_rate:.2f}% (mean)\")\n", "\n", "print(f\"\\n* = CI excludes 0\")\n", "print(f\"% of Growth = contribution / baseline mean\")\n", "print(f\"iROAS = sales contribution / spend change\")\n", "print(f\"'sale' row = unexplained variance (organic growth, noise)\")\n", "print(f\"Note: Sum ≠ actual due to Monte Carlo sampling in Shapley estimation. Increase num_samples/num_bootstrap_resamples to reduce gap.\")" ] }, { "cell_type": "code", "execution_count": null, "id": "w9x0y1z2", "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=(10, 6))\n", "\n", "drivers = list(median_contribs.keys())\n", "contribs = [median_contribs[d] for d in drivers]\n", "sorted_idx = np.argsort(contribs)[::-1]\n", "drivers = [drivers[i] for i in sorted_idx]\n", "contribs = [contribs[i] for i in sorted_idx]\n", "\n", "colors = ['tab:green' if c > 0 else 'tab:red' for c in contribs]\n", "ax.barh(drivers, contribs, color=colors)\n", "ax.axvline(x=0, color='black', linewidth=0.5)\n", "ax.set_xlabel('Contribution to Daily Sales Change ($)')\n", "ax.set_title('Causal Attribution: H1 2025 vs H1 2024')\n", "ax.invert_yaxis()\n", "plt.tight_layout()\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "id": "a3b4c5d6", "metadata": {}, "outputs": [], "source": [ "print(\"iROI Estimation (via intervention):\")\n", "print(f\"{'Channel':<15} {'Est. iROI':<15} {'Actual iROI':<15}\")\n", "print(\"-\" * 45)\n", "\n", "for spend_var, ch in [('sp_spend', 'SP'), ('sb_spend', 'SB'), \n", " ('display_spend', 'Display'), ('video_spend', 'Video')]:\n", " step = 1000\n", " effect = gcm.average_causal_effect(\n", " causal_model=causal_model,\n", " target_node='sale',\n", " interventions_alternative={spend_var: lambda x, s=step: x + s},\n", " interventions_reference={spend_var: lambda x: x},\n", " num_samples_to_draw=10000\n", " )\n", " iroi = effect / step\n", " actual_iroi = ground_truth['iroi'][ch]\n", " print(f\"{ch:<15} ${iroi:<14.2f} ${actual_iroi:<14.2f}\")" ] }, { "cell_type": "markdown", "id": "transition-32", "metadata": {}, "source": [ "The iROI estimates differ from ground truth. Possible reasons:\n", "- DAG uses raw spend, but DGP applies adstock before computing sales\n", "- Auto-assigned mechanism may not capture the adstock transform\n", "- Unmeasured confounders (competitor activity, macro factors, etc.)\n", "\n", "### 3.2 With Adstocked Spend\n", "\n", "Section 3.1 used raw spend as inputs, but our DGP generates sales from adstocked spend, where each day's effect accumulates from prior days' spending. This mismatch could explain the iROI gaps above.\n", "\n", "To test this, we transform spend using ground truth carryover rates and refit:" ] }, { "cell_type": "code", "execution_count": null, "id": "bf94685b-e9b5-4b5a-b44f-a7b8de6ce5b1", "metadata": {}, "outputs": [], "source": [ "# Create adstocked spend columns using ground truth carryover rates\n", "def apply_adstock(x, carryover):\n", " adstocked = np.zeros_like(x)\n", " adstocked[0] = x[0]\n", " for t in range(1, len(x)):\n", " adstocked[t] = x[t] + carryover * adstocked[t-1]\n", " return adstocked\n", "\n", "df_causal_adstocked = df_causal.copy()\n", "df_causal_adstocked['sp_spend'] = apply_adstock(df_causal['sp_spend'].values, 0.3)\n", "df_causal_adstocked['sb_spend'] = apply_adstock(df_causal['sb_spend'].values, 0.4)\n", "df_causal_adstocked['display_spend'] = apply_adstock(df_causal['display_spend'].values, 0.7)\n", "df_causal_adstocked['video_spend'] = apply_adstock(df_causal['video_spend'].values, 0.85)\n", "\n", "# Rebuild causal model with adstocked data\n", "causal_model_adstocked = gcm.StructuralCausalModel(causal_graph)\n", "gcm.auto.assign_causal_mechanisms(causal_model_adstocked, df_causal_adstocked, quality=gcm.auto.AssignmentQuality.BETTER)\n", "gcm.fit(causal_model_adstocked, df_causal_adstocked)\n", "\n", "# Test iROI\n", "print(\"iROI with adstocked spend:\")\n", "print(f\"{'Channel':<15} {'Est. iROI':<15} {'Actual iROI':<15}\")\n", "print(\"-\" * 45)\n", "\n", "for spend_var, ch in [('sp_spend', 'SP'), ('sb_spend', 'SB'), \n", " ('display_spend', 'Display'), ('video_spend', 'Video')]:\n", " step = 1000\n", " effect = gcm.average_causal_effect(\n", " causal_model=causal_model_adstocked,\n", " target_node='sale',\n", " interventions_alternative={spend_var: lambda x, s=step: x + s},\n", " interventions_reference={spend_var: lambda x: x},\n", " num_samples_to_draw=10000\n", " )\n", " iroi = effect / step\n", " actual_iroi = ground_truth['iroi'][ch]\n", " print(f\"{ch:<15} ${iroi:<14.2f} ${actual_iroi:<14.2f}\")" ] }, { "cell_type": "markdown", "id": "adstock-explanation", "metadata": {}, "source": [ "SP and SB improved drastically because these channels have correlated spend in the DGP, with both following demand seasonality and spike during events. We generated data that way to mimic real-world investment preferences.\n", "\n", "When inputs are highly correlated, the model struggles to separate individual effects (that's why SB showed bigger differences before because it was absorbing variance from other channels). Different carryover rates transform each channel differently, reducing correlation and letting the model distinguish their contributions.\n", "\n", "In contrast, Display and Video were already close because their spend is more stable and less correlated with other channels (dampened seasonality in DGP). The model already works reasonably well without the transform.\n", "\n", "Real data doesn't come with known carryover rates, so this is more of a sanity check on the DGP than a practical approach." ] }, { "cell_type": "markdown", "id": "i1j2k3l4", "metadata": {}, "source": [ "## 4. Summary" ] }, { "cell_type": "markdown", "id": "563caf9e-7fcf-49d5-bb22-951ff8a4f624", "metadata": {}, "source": [ "For reporting: LTA reflects what most attribution systems currently credits, MTA shows what credit would look like with carryover modeled, and causal attribution estimates actual business impact. When presenting to stakeholders, you can show all three to highlight the gap between LTA and MTA/causal highlights where budget reallocation opportunities exist." ] }, { "cell_type": "code", "execution_count": null, "id": "m5n6o7p8", "metadata": { "scrolled": true }, "outputs": [], "source": [ "print(\"=\" * 70)\n", "print(\"Summary: LTA vs MTA vs Incrementality\")\n", "print(\"=\" * 70)\n", "print(f\"{'Channel':<10} {'LTA %':<12} {'MTA %':<12} {'Shift':<12} {'Actual iROI':<12}\")\n", "print(\"-\" * 58)\n", "\n", "for ch in channels:\n", " lta_pct = lta_totals[ch] / total_lta * 100\n", " mta_pct = mta_contrib[ch] / total_mta * 100\n", " shift = mta_pct - lta_pct\n", " iroi = ground_truth['iroi'][ch]\n", " print(f\"{ch:<10} {lta_pct:<12.1f} {mta_pct:<12.1f} {shift:+11.1f}pp ${iroi:<11.2f}\")\n", "\n", "print(\"-\" * 58)\n", "print(f\"\\nSP+SB LTA share: {(lta_totals['SP']+lta_totals['SB'])/total_lta*100:.0f}%\")\n", "print(f\"SP+SB MTA share: {(mta_contrib['SP']+mta_contrib['SB'])/total_mta*100:.0f}%\")\n", "print(f\"Display+Video gain: {(mta_contrib['Display']+mta_contrib['Video'])/total_mta*100 - (lta_totals['Display']+lta_totals['Video'])/total_lta*100:+.0f}pp\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.11" } }, "nbformat": 4, "nbformat_minor": 5 }