Causal Attributions and Root-Cause Analysis in an Online Shop

This notebook is an extended and updated version of the corresponding blog post: Root Cause Analysis with DoWhy, an Open Source Python Library for Causal Machine Learning

In this example, we look at an online store and analyze how different factors influence our profit. In particular, we want to analyze an unexpected drop in profit and identify the potential root cause of it. For this, we can make use of Graphical Causal Models (GCM).

The scenario

Suppose we are selling a smartphone in an online shop with a retail price of $999. The overall profit from the product depends on several factors, such as the number of sold units, operational costs or ad spending. On the other hand, the number of sold units, for instance, depends on the number of visitors on the product page, the price itself and potential ongoing promotions. Suppose we observe a steady profit of our product over the year 2021, but suddenly, there is a significant drop in profit at the beginning of 2022. Why?

In the following scenario, we will use DoWhy to get a better understanding of the causal impacts of factors influencing the profit and to identify the causes for the profit drop. To analyze our problem at hand, we first need to define our belief about the causal relationships. For this, we collect daily records of the different factors affecting profit. These factors are:

  • Shopping Event?: A binary value indicating whether a special shopping event took place, such as Black Friday or Cyber Monday sales.

  • Ad Spend: Spending on ad campaigns.

  • Page Views: Number of visits on the product detail page.

  • Unit Price: Price of the device, which could vary due to temporary discounts.

  • Sold Units: Number of sold phones.

  • Revenue: Daily revenue.

  • Operational Cost: Daily operational expenses which includes production costs, spending on ads, administrative expenses, etc.

  • Profit: Daily profit.

Looking at these attributes, we can use our domain knowledge to describe the cause-effect relationships in the form of a directed acyclic graph, which represents our causal graph in the following. The graph is shown here:

[1]:
from IPython.display import Image
Image('online-shop-graph.png')
[1]:
../_images/example_notebooks_gcm_online_shop_4_0.png

In this scenario we know the following:

Shopping Event? impacts:
→ Ad Spend: To promote the product on special shopping events, we require additional ad spending.
→ Page Views: Shopping events typically attract a large number of visitors to an online retailer due to discounts and various offers.
→ Unit Price: Typically, retailers offer some discount on the usual retail price on days with a shopping event.
→ Sold Units: Shopping events often take place during annual celebrations like Christmas, Father’s day, etc, when people often buy more than usual.
Ad Spend impacts:
→ Page Views: The more we spend on ads, the more likely people will visit the product page.
→ Operational Cost: Ad spending is part of the operational cost.
Page Views impacts:
→ Sold Units: The more people visiting the product page, the more likely the product is bought. This is quite obvious seeing that if no one would visit the page, there wouldn’t be any sale.
Unit Price impacts:
→ Sold Units: The higher/lower the price, the less/more units are sold.
→ Revenue: The daily revenue typically consist of the product of the number of sold units and unit price.
Sold Units impacts:
→ Sold Units: Same argument as before, the number of sold units heavily influences the revenue.
→ Operational Cost: There is a manufacturing cost for each unit we produce and sell. The more units we well the higher the revenue, but also the higher the manufacturing costs.
Operational Cost impacts:
→ Profit: The profit is based on the generated revenue minus the operational cost.
Revenue impacts:
→ Profit: Same reason as for the operational cost.

Step 1: Define causal model

Now, let us model these causal relationships. In the first step, we need to define a so-called structural causal model (SCM), which is a combination of the causal graph and the underlying generative models describing the data generation process.

The causal graph can be defined via:

[2]:
import networkx as nx

causal_graph = nx.DiGraph([('Page Views', 'Sold Units'),
                           ('Revenue', 'Profit'),
                           ('Unit Price', 'Sold Units'),
                           ('Unit Price', 'Revenue'),
                           ('Shopping Event?', 'Page Views'),
                           ('Shopping Event?', 'Sold Units'),
                           ('Shopping Event?', 'Unit Price'),
                           ('Shopping Event?', 'Ad Spend'),
                           ('Ad Spend', 'Page Views'),
                           ('Ad Spend', 'Operational Cost'),
                           ('Sold Units', 'Revenue'),
                           ('Sold Units', 'Operational Cost'),
                           ('Operational Cost', 'Profit')])

To verify that we did not forget an edge, we can plot this graph:

[3]:
from dowhy.utils import plot

plot(causal_graph)
../_images/example_notebooks_gcm_online_shop_10_0.png

Next, we look at the data from 2021:

[4]:
import pandas as pd
import numpy as np

pd.options.display.float_format = '${:,.2f}'.format  # Format dollar columns

data_2021 = pd.read_csv('2021 Data.csv', index_col='Date')
data_2021.head()
[4]:
Shopping Event? Ad Spend Page Views Unit Price Sold Units Revenue Operational Cost Profit
Date
2021-01-01 False $1,490.49 11861 $999.00 2317 $2,314,683.00 $1,659,999.89 $654,683.11
2021-01-02 False $1,455.92 11776 $999.00 2355 $2,352,645.00 $1,678,959.08 $673,685.92
2021-01-03 False $1,405.82 11861 $999.00 2391 $2,388,609.00 $1,696,906.14 $691,702.86
2021-01-04 False $1,379.30 11677 $999.00 2344 $2,341,656.00 $1,673,380.64 $668,275.36
2021-01-05 False $1,234.20 11871 $999.00 2412 $2,409,588.00 $1,707,252.61 $702,335.39

As we see, we have one sample for each day in 2021 with all the variables in the causal graph. Note that in the synthetic data we consider here, shopping events were also generated randomly.

We defined the causal graph, but we still need to assign generative models to the nodes. We can either manually specify those models, and configure them if needed, or automatically infer “appropriate” models using heuristics from data. We will leverage the latter here:

[5]:
from dowhy import gcm

# Create the structural causal model object
scm = gcm.StructuralCausalModel(causal_graph)

# Automatically assign generative models to each node based on the given data
auto_assignment_summary = gcm.auto.assign_causal_mechanisms(scm, data_2021, override_models=True, quality=gcm.auto.AssignmentQuality.GOOD)

Whenever available, we recommend assigning models based on prior knowledge as then models would closely mimic the physics of the domain, and not rely on nuances of the data. However, here we asked DoWhy to do this for us instead.

After automatically assign the models, we can print a summary to obtain some insights into the selected models:

[6]:
print(auto_assignment_summary)
When using this auto assignment function, the given data is used to automatically assign a causal mechanism to each node. Note that causal mechanisms can also be customized and assigned manually.
The following types of causal mechanisms are considered for the automatic selection:

If root node:
An empirical distribution, i.e., the distribution is represented by randomly sampling from the provided data. This provides a flexible and non-parametric way to model the marginal distribution and is valid for all types of data modalities.

If non-root node and the data is continuous:
Additive Noise Models (ANM) of the form X_i = f(PA_i) + N_i, where PA_i are the parents of X_i and the unobserved noise N_i is assumed to be independent of PA_i.To select the best model for f, different regression models are evaluated and the model with the smallest mean squared error is selected.Note that minimizing the mean squared error here is equivalent to selecting the best choice of an ANM.

If non-root node and the data is discrete:
Discrete Additive Noise Models have almost the same definition as non-discrete ANMs, but come with an additional constraint for f to only return discrete values.
Note that 'discrete' here refers to numerical values with an order. If the data is categorical, consider representing them as strings to ensure proper model selection.

If non-root node and the data is categorical:
A functional causal model based on a classifier, i.e., X_i = f(PA_i, N_i).
Here, N_i follows a uniform distribution on [0, 1] and is used to randomly sample a class (category) using the conditional probability distribution produced by a classification model.Here, different model classes are evaluated using the (negative) F1 score and the best performing model class is selected.

In total, 8 nodes were analyzed:

--- Node: Shopping Event?
Node Shopping Event? is a root node. Therefore, assigning 'Empirical Distribution' to the node representing the marginal distribution.

--- Node: Unit Price
Node Unit Price is a non-root node with continuous data. Assigning 'AdditiveNoiseModel using LinearRegression' to the node.
This represents the causal relationship as Unit Price := f(Shopping Event?) + N.
For the model selection, the following models were evaluated on the mean squared error (MSE) metric:
LinearRegression: 157.79163619768184
Pipeline(steps=[('polynomialfeatures', PolynomialFeatures(include_bias=False)),
                ('linearregression', LinearRegression)]): 157.792175065806
HistGradientBoostingRegressor: 423.9719537935019

--- Node: Ad Spend
Node Ad Spend is a non-root node with continuous data. Assigning 'AdditiveNoiseModel using LinearRegression' to the node.
This represents the causal relationship as Ad Spend := f(Shopping Event?) + N.
For the model selection, the following models were evaluated on the mean squared error (MSE) metric:
LinearRegression: 16319.500330115121
Pipeline(steps=[('polynomialfeatures', PolynomialFeatures(include_bias=False)),
                ('linearregression', LinearRegression)]): 16595.23940944529
HistGradientBoostingRegressor: 80714.44787173529

--- Node: Page Views
Node Page Views is a non-root node with discrete data. Assigning 'Discrete AdditiveNoiseModel using LinearRegression' to the node.
This represents the discrete causal relationship as Page Views := f(Ad Spend,Shopping Event?) + N.
For the model selection, the following models were evaluated on the mean squared error (MSE) metric:
LinearRegression: 76312.30613366146
Pipeline(steps=[('polynomialfeatures', PolynomialFeatures(include_bias=False)),
                ('linearregression', LinearRegression)]): 82110.52324357697
HistGradientBoostingRegressor: 1385356.6867034372

--- Node: Sold Units
Node Sold Units is a non-root node with discrete data. Assigning 'Discrete AdditiveNoiseModel using LinearRegression' to the node.
This represents the discrete causal relationship as Sold Units := f(Page Views,Shopping Event?,Unit Price) + N.
For the model selection, the following models were evaluated on the mean squared error (MSE) metric:
LinearRegression: 8895.211650261243
Pipeline(steps=[('polynomialfeatures', PolynomialFeatures(include_bias=False)),
                ('linearregression', LinearRegression)]): 17913.812554860324
HistGradientBoostingRegressor: 248984.11844653316

--- Node: Revenue
Node Revenue is a non-root node with continuous data. Assigning 'AdditiveNoiseModel using Pipeline' to the node.
This represents the causal relationship as Revenue := f(Sold Units,Unit Price) + N.
For the model selection, the following models were evaluated on the mean squared error (MSE) metric:
Pipeline(steps=[('polynomialfeatures', PolynomialFeatures(include_bias=False)),
                ('linearregression', LinearRegression)]): 3.1367602579306645e-19
LinearRegression: 75887386.6304224
HistGradientBoostingRegressor: 139007431109.02496

--- Node: Operational Cost
Node Operational Cost is a non-root node with continuous data. Assigning 'AdditiveNoiseModel using Pipeline' to the node.
This represents the causal relationship as Operational Cost := f(Ad Spend,Sold Units) + N.
For the model selection, the following models were evaluated on the mean squared error (MSE) metric:
Pipeline(steps=[('polynomialfeatures', PolynomialFeatures(include_bias=False)),
                ('linearregression', LinearRegression)]): 38.93918389063613
LinearRegression: 39.03186306098712
HistGradientBoostingRegressor: 16790847483.6117

--- Node: Profit
Node Profit is a non-root node with continuous data. Assigning 'AdditiveNoiseModel using LinearRegression' to the node.
This represents the causal relationship as Profit := f(Operational Cost,Revenue) + N.
For the model selection, the following models were evaluated on the mean squared error (MSE) metric:
LinearRegression: 1.3096568160242712e-18
Pipeline(steps=[('polynomialfeatures', PolynomialFeatures(include_bias=False)),
                ('linearregression', LinearRegression)]): 1.1839842279223088e-06
HistGradientBoostingRegressor: 20316936213.927013

===Note===
Note, based on the selected auto assignment quality, the set of evaluated models changes.
For more insights toward the quality of the fitted graphical causal model, consider using the evaluate_causal_model function after fitting the causal mechanisms.

As we see, while the auto assignment also considered non-linear models, a linear model is sufficient for most relationships, except for Revenue, which is the product of Sold Units and Unit Price.

Step 2: Fit causal models to data

After assigning a model to each node, we need to learn the parameters of the model:

[7]:
gcm.fit(scm, data_2021)
Fitting causal mechanism of node Operational Cost: 100%|██████████| 8/8 [00:00<00:00, 366.59it/s]

The fit method learns the parameters of the generative models in each node. Before we continue, let’s have a quick look into the performance of the causal mechanisms and how well they capture the distribution:

[8]:
print(gcm.evaluate_causal_model(scm, data_2021, compare_mechanism_baselines=True, evaluate_invertibility_assumptions=False))
Evaluating causal mechanisms...: 100%|██████████| 8/8 [00:00<00:00, 1019.86it/s]
Test permutations of given graph: 100%|██████████| 50/50 [00:15<00:00,  3.23it/s]
../_images/example_notebooks_gcm_online_shop_22_1.png
Evaluated the performance of the causal mechanisms and the overall average KL divergence between generated and observed distribution and the graph structure. The results are as follows:

==== Evaluation of Causal Mechanisms ====
The used evaluation metrics are:
- KL divergence (only for root-nodes): Evaluates the divergence between the generated and the observed distribution.
- Mean Squared Error (MSE): Evaluates the average squared differences between the observed values and the conditional expectation of the causal mechanisms.
- Normalized MSE (NMSE): The MSE normalized by the standard deviation for better comparison.
- R2 coefficient: Indicates how much variance is explained by the conditional expectations of the mechanisms. Note, however, that this can be misleading for nonlinear relationships.
- F1 score (only for categorical non-root nodes): The harmonic mean of the precision and recall indicating the goodness of the underlying classifier model.
- (normalized) Continuous Ranked Probability Score (CRPS): The CRPS generalizes the Mean Absolute Percentage Error to probabilistic predictions. This gives insights into the accuracy and calibration of the causal mechanisms.
NOTE: Every metric focuses on different aspects and they might not consistently indicate a good or bad performance.
We will mostly utilize the CRPS for comparing and interpreting the performance of the mechanisms, since this captures the most important properties for the causal model.

--- Node Shopping Event?
- The KL divergence between generated and observed distribution is 0.31816246480702554.
The estimated KL divergence indicates an overall very good representation of the data distribution.

--- Node Unit Price
- The MSE is 142.68654632246577.
- The NMSE is 1.6347283705531683.
- The R2 coefficient is -7.204640164205758.
- The normalized CRPS is 0.2440986264900173.
The estimated CRPS indicates a good model performance.
The mechanism is better or equally good than all 7 baseline mechanisms.

--- Node Ad Spend
- The MSE is 16253.505906786368.
- The NMSE is 0.5969671127486269.
- The R2 coefficient is 0.5787175531672434.
- The normalized CRPS is 0.3484040351794088.
The estimated CRPS indicates a good model performance.
The mechanism is better or equally good than all 7 baseline mechanisms.

--- Node Page Views
- The MSE is 99568.83835616439.
- The NMSE is 0.3336113809492044.
- The R2 coefficient is 0.7805477123354272.
- The normalized CRPS is 0.16539902924581723.
The estimated CRPS indicates a very good model performance.
The mechanism is better or equally good than all 7 baseline mechanisms.

--- Node Sold Units
- The MSE is 9065.813698630136.
- The NMSE is 0.2864724678087952.
- The R2 coefficient is 0.8001271533017116.
- The normalized CRPS is 0.12039085993581684.
The estimated CRPS indicates a very good model performance.
The mechanism is better or equally good than all 7 baseline mechanisms.

--- Node Revenue
- The MSE is 3.8556011503731087e-19.
- The NMSE is 1.6546116938603654e-15.
- The R2 coefficient is 1.0.
- The normalized CRPS is 3.0839647426347093e-16.
The estimated CRPS indicates a very good model performance.
The mechanism is better or equally good than all 7 baseline mechanisms.

--- Node Operational Cost
- The MSE is 39.10243995700817.
- The NMSE is 1.8974258718893184e-05.
- The R2 coefficient is 0.9999999996135077.
- The normalized CRPS is 1.0437794352052223e-05.
The estimated CRPS indicates a very good model performance.
The mechanism is better or equally good than all 7 baseline mechanisms.

--- Node Profit
- The MSE is 5.937863404927461e-19.
- The NMSE is 3.604851662130177e-15.
- The R2 coefficient is 1.0.
- The normalized CRPS is 3.418333500582639e-16.
The estimated CRPS indicates a very good model performance.
The mechanism is better or equally good than all 7 baseline mechanisms.

==== Evaluation of Generated Distribution ====
The overall average KL divergence between the generated and observed distribution is 1.0358152874483946
The estimated KL divergence indicates some mismatches between the distributions.

==== Evaluation of the Causal Graph Structure ====
+-------------------------------------------------------------------------------------------------------+
|                                         Falsificaton Summary                                          |
+-------------------------------------------------------------------------------------------------------+
| The given DAG is informative because 0 / 50 of the permutations lie in the Markov                     |
| equivalence class of the given DAG (p-value: 0.00).                                                   |
| The given DAG violates 6/18 LMCs and is better than 70.0% of the permuted DAGs (p-value: 0.30).       |
| Based on the provided significance level (0.2) and because the DAG is informative,                    |
| we reject the DAG.                                                                                    |
+-------------------------------------------------------------------------------------------------------+

==== NOTE ====
Always double check the made model assumptions with respect to the graph structure and choice of causal mechanisms.
All these evaluations give some insight into the goodness of the causal model, but should not be overinterpreted, since some causal relationships can be intrinsically hard to model. Furthermore, many algorithms are fairly robust against misspecifications or poor performances of causal mechanisms.

The fitted causal mechanisms are fairly good representations of the data generation process, with some minor inaccuracies. However, this is to be expected given the small sample size and relatively small signal-to-noise ratio for many nodes. Most importantly, all the baseline mechanisms did not perform better, which is a good indicator that our model selection is appropriate. Based on the evaluation, we also do not reject the given causal graph.

The selection of baseline models or the p-value for graph falsification can be configured as well. For more details, take a look at the corresponding evaluate_causal_model documentation.

Step 3: Answer causal questions

Generate new samples

Since we learned about the data generation process, we can also generate new samples:

[9]:
gcm.draw_samples(scm, num_samples=10)
[9]:
Shopping Event? Unit Price Ad Spend Page Views Sold Units Revenue Operational Cost Profit
0 False $999.00 $1,494.94 11961 2443 $2,440,557.00 $1,723,000.93 $717,556.07
1 False $999.00 $1,102.06 11698 2337 $2,334,663.00 $1,669,611.92 $665,051.08
2 False $999.00 $1,243.60 11780 2459 $2,456,541.00 $1,730,750.87 $725,790.13
3 False $999.00 $1,348.47 11595 2325 $2,322,675.00 $1,663,857.27 $658,817.73
4 True $913.46 $2,557.68 20052 5887 $5,377,521.54 $3,446,071.78 $1,931,449.77
5 False $999.00 $1,217.63 11556 2311 $2,308,689.00 $1,656,721.84 $651,967.16
6 False $961.47 $1,169.87 11579 2605 $2,504,626.57 $1,803,671.87 $700,954.70
7 False $999.00 $1,259.66 11801 2485 $2,482,515.00 $1,743,762.13 $738,752.87
8 False $999.00 $1,257.03 11577 2301 $2,298,699.00 $1,651,756.66 $646,942.34
9 False $999.00 $1,385.51 11776 2338 $2,335,662.00 $1,670,401.68 $665,260.32

We have drawn 10 samples from the joint distribution following the learned causal relationships.

What are the key factors influencing the variance in profit?

At this point, we want to understand which factors drive changes in the Profit. Let us first have a closer look at the Profit over time. For this, we plot the Profit over time for 2021, where the produced plot shows the Profit in dollars on the Y-axis and the time on the X-axis.

[10]:
data_2021['Profit'].plot(ylabel='Profit in $', figsize=(15,5), rot=45)
[10]:
<Axes: xlabel='Date', ylabel='Profit in $'>
../_images/example_notebooks_gcm_online_shop_30_1.png

We see some significant spikes in the Profit across the year. We can further quantify this by looking at the standard deviation:

[11]:
data_2021['Profit'].std()
[11]:
$\displaystyle 259247.66010978$

The estimated standard deviation of ~259247 dollars is quite significant. Looking at the causal graph, we see that Revenue and Operational Cost have a direct impact on the Profit, but which of them contribute the most to the variance? To find this out, we can make use of the direct arrow strength algorithm that quantifies the causal influence of a specific arrow in the graph:

[12]:
import numpy as np

# Note: The percentage conversion only makes sense for purely positive attributions.
def convert_to_percentage(value_dictionary):
    total_absolute_sum = np.sum([abs(v) for v in value_dictionary.values()])
    return {k: abs(v) / total_absolute_sum * 100 for k, v in value_dictionary.items()}


arrow_strengths = gcm.arrow_strength(scm, target_node='Profit')

plot(causal_graph,
     causal_strengths=convert_to_percentage(arrow_strengths),
     figure_size=[15, 10])
../_images/example_notebooks_gcm_online_shop_34_0.png

In this causal graph, we see how much each node contributes to the variance in Profit. For simplicity, the contributions are converted to percentages. Since Profit itself is only the difference between Revenue and Operational Cost, we do not expect further factors influencing the variance. As we see, Revenue has more impact than Operational Cost. This makes sense seeing that Revenue typically varies more than Operational Cost due to the stronger dependency on the number of sold units. Note that the direct arrow strength method also supports the use of other kinds of measures, for instance, KL divergence.

While the direct influences are helpful in understanding which direct parents influence the most on the variance in Profit, this mostly confirms our prior belief. The question of which factor is ultimately responsible for this high variance is, however, still unclear. For instance, Revenue itself is based on Sold Units and the Unit Price. Although we could recursively apply the direct arrow strength to all nodes, we would not get a correctly weighted insight into the influence of upstream nodes on the variance.

What are the important causal factors contributing to the variance in Profit? To find this out, we can use the intrinsic causal contribution method that attributes the variance in Profit to the upstream nodes in the causal graph by only considering information that is newly added by a node and not just inherited from its parents. For instance, a node that is simply a rescaled version of its parent would not have any intrinsic contribution. See the corresponding research paper for more details.

Let’s apply the method to the data:

[13]:
iccs = gcm.intrinsic_causal_influence(scm, target_node='Profit', num_samples_randomization=500)
Estimating Shapley Values. Average change of Shapley values in run 22 (110 evaluated permutations): 1.3304253352208222%: 100%|██████████| 1/1 [12:09<00:00, 729.53s/it]
[14]:
from dowhy.utils import bar_plot

bar_plot(convert_to_percentage(iccs), ylabel='Variance attribution in %')
../_images/example_notebooks_gcm_online_shop_37_0.png

The scores shown in this bar chart are percentages indicating how much variance each node is contributing to Profit — without inheriting the variance from its parents in the causal graph. As we see quite clearly, the Shopping Event has by far the biggest influence on the variance in our Profit. This makes sense, seeing that the sales are heavily impacted during promotion periods like Black Friday or Prime Day and, thus, impact the overall profit. Surprisingly, we also see that factors such as the number of sold units or number of page views have a rather small influence, i.e., the large variance in profit can be almost completely explained by the shopping events. Let’s check this visually by marking the days where we had a shopping event. To do so, we use the pandas plot function again, but additionally mark all points in the plot with a vertical red bar where a shopping event occured:

[15]:
import matplotlib.pyplot as plt

data_2021['Profit'].plot(ylabel='Profit in $', figsize=(15,5), rot=45)
plt.vlines(np.arange(0, data_2021.shape[0])[data_2021['Shopping Event?']], data_2021['Profit'].min(), data_2021['Profit'].max(), linewidth=10, alpha=0.3, color='r')
[15]:
<matplotlib.collections.LineCollection at 0x7f6c87dcde50>
../_images/example_notebooks_gcm_online_shop_39_1.png

We clearly see that the shopping events coincide with the high peaks in profit. While we could have investigated this manually by looking at all kinds of different relationships or using domain knowledge, the tasks gets much more difficult as the complexity of the system increases. With a few lines of code, we obtained these insights from DoWhy.

What are the key factors explaining the Profit drop on a particular day?

After a successful year in terms of profit, newer technologies come to the market and, thus, we want to keep the profit up and get rid of excess inventory by selling more devices. In order to increase the demand, we therefore lower the retail price by 10% at the beginning of 2022. Based on a prior analysis, we know that a decrease of 10% in the price would roughly increase the demand by 13.75%, a slight surplus. Following the price elasticity of demand model, we expect an increase of around 37.5% in number of Sold Units. Let us take a look if this is true by loading the data for the first day in 2022 and taking the fraction between the numbers of Sold Units from both years for that day:

[16]:
first_day_2022 = pd.read_csv('2022 First Day.csv', index_col='Date')
(first_day_2022['Sold Units'][0] / data_2021['Sold Units'][0] - 1) * 100
[16]:
$\displaystyle 18.9469141130773$

Surprisingly, we only increased the number of sold units by ~19%. This will certainly impact the profit given that the revenue is much smaller than expected. Let us compare it with the previous year at the same time:

[17]:
(1 - first_day_2022['Profit'][0] / data_2021['Profit'][0]) * 100
[17]:
$\displaystyle 8.57891513840979$

Indeed, the profit dropped by ~8.5%. Why is this the case seeing that we would expect a much higher demand due to the decreased price? Let us investigate what is going on here.

In order to figure out what contributed to the Profit drop, we can make use of DoWhy’s anomaly attribution feature. Here, we only need to specify the target node we are interested in (the Profit) and the anomaly sample we want to analyze (the first day of 2022). These results are then plotted in a bar chart indicating the attribution scores of each node for the given anomaly sample:

[18]:
attributions = gcm.attribute_anomalies(scm, target_node='Profit', anomaly_samples=first_day_2022)

bar_plot({k: v[0] for k, v in attributions.items()}, ylabel='Anomaly attribution score')
Estimating Shapley Values. Average change of Shapley values in run 33 (165 evaluated permutations): 1.420391402084745%: 100%|██████████| 1/1 [00:03<00:00,  3.19s/it]
../_images/example_notebooks_gcm_online_shop_47_1.png

A positive attribution score means that the corresponding node contributed to the observed anomaly, which is in our case the drop in Profit. A negative score of a node indicates that the observed value for the node is actually reducing the likelihood of the anomaly (e.g., a higher demand due to the decreased price should increase the profit). More details about the interpretation of the score can be found in the corresponding reserach paper. Interestingly, the Page Views stand out as a factor explaining the Profit drop that day as indicated in the bar chart shown here.

While this method gives us a point estimate of the attributions for the particular models and parameters we learned, we can also use DoWhy’s confidence interval feature, which incorporates uncertainties about the fitted model parameters and algorithmic approximations:

[19]:
gcm.config.disable_progress_bars()  # We turn off the progress bars here to reduce the number of outputs.

median_attributions, confidence_intervals, = gcm.confidence_intervals(
    gcm.fit_and_compute(gcm.attribute_anomalies,
                        scm,
                        bootstrap_training_data=data_2021,
                        target_node='Profit',
                        anomaly_samples=first_day_2022),
    num_bootstrap_resamples=10)
[20]:
bar_plot(median_attributions, confidence_intervals, 'Anomaly attribution score')
../_images/example_notebooks_gcm_online_shop_50_0.png

Note, in this bar chart we see the median attributions over multiple runs on smaller data sets, where each run re-fits the models and re-evaluates the attributions. We get a similar picture as before, but the confidence interval of the attribution to Sold Units also contains zero, meaning its contribution is insignificant. But some important questions still remain: Was this only a coincidence and, if not, which part in our system has changed? To find this out, we need to collect some more data.

Note that the results differ depending on the selected data, since they are sample specific. On other days, other factors could be relevant. Furthermore, note that the analysis (including the confidence intervals) always relies on the modeling assumptions made. In other words, if the models change or have a poor fit, one would also expect different results.

What caused the profit drop in Q1 2022?

While the previous analysis is based on a single observation, let us see if this was just coincidence or if this is a persistent issue. When preparing the quarterly business report, we have some more data available from the first three months. We first check if the profit dropped on average in the first quarter of 2022 as compared to 2021. Similar as before, we can do this by taking the fraction between the average Profit of 2022 and 2021 for the first quarter:

[21]:
data_first_quarter_2021 = data_2021[data_2021.index <= '2021-03-31']
data_first_quarter_2022 = pd.read_csv("2022 First Quarter.csv", index_col='Date')

(1 - data_first_quarter_2022['Profit'].mean() / data_first_quarter_2021['Profit'].mean()) * 100
[21]:
$\displaystyle 13.0494881794224$

Indeed, the profit drop is persistent in the first quarter of 2022. Now, what is the root cause of this? Let us apply the distribution change method to identify the part in the system that has changed:

[22]:
median_attributions, confidence_intervals = gcm.confidence_intervals(
    lambda: gcm.distribution_change(scm,
                                    data_first_quarter_2021,
                                    data_first_quarter_2022,
                                    target_node='Profit',
                                    # Here, we are intersted in explaining the differences in the mean.
                                    difference_estimation_func=lambda x, y: np.mean(y) - np.mean(x))
)
[23]:
bar_plot(median_attributions, confidence_intervals, 'Profit change attribution in $')
../_images/example_notebooks_gcm_online_shop_58_0.png

In our case, the distribution change method explains the change in the mean of Profit, i.e., a negative value indicates that a node contributes to a decrease and a positive value to an increase of the mean. Using the bar chart, we get now a very clear picture that the change in Unit Price has actually a slightly positive contribution to the expected Profit due to the increase of Sold Units, but it seems that the issue is coming from the Page Views which has a negative value. While we already understood this as a main driver of the drop at the beginning of 2022, we have now isolated and confirmed that something changed for the Page Views as well. Let’s compare the average Page Views with the previous year.

[24]:
(1 - data_first_quarter_2022['Page Views'].mean() / data_first_quarter_2021['Page Views'].mean()) * 100
[24]:
$\displaystyle 14.347627108364$

Indeed, the number of Page Views dropped by ~14%. Since we eliminated all other potential factors, we can now dive deeper into the Page Views and see what is going on there. This is a hypothetical scenario, but we could imagine it could be due to a change in the search algorithm which ranks this product lower in the results and therefore drives fewer customers to the product page. Knowing this, we could now start mitigating the issue.

Data generation process

While the exact same data cannot be reproduced, the following dataset generator should provide quite similar types of data and has various parameters to adjust:

[25]:
from dowhy.datasets import sales_dataset

data_2021 = sales_dataset(start_date="2021-01-01", end_date="2021-12-31")
data_2022 = sales_dataset(start_date="2022-01-01", end_date="2022-12-31", change_of_price=0.9)