Attributing Distributional Changes ================================== When attributing distribution changes, we answer the question: What mechanism in my system changed between two sets of data? For example, in a distributed computing system, we want to know why an important system metric changed in a negative way. How to use it ^^^^^^^^^^^^^^ To see how the method works, let's take the example from above and assume we have a system of three services X, Y, Z, producing latency numbers. The first dataset ``data_old`` is before the deployment, ``data_new`` is after the deployment: >>> import networkx as nx, numpy as np, pandas as pd >>> from dowhy import gcm >>> from scipy.stats import halfnorm >>> X = halfnorm.rvs(size=1000, loc=0.5, scale=0.2) >>> Y = halfnorm.rvs(size=1000, loc=1.0, scale=0.2) >>> Z = np.maximum(X, Y) + np.random.normal(loc=0, scale=1, size=1000) >>> data_old = pd.DataFrame(data=dict(X=X, Y=Y, Z=Z)) >>> X = halfnorm.rvs(size=1000, loc=0.5, scale=0.2) >>> Y = halfnorm.rvs(size=1000, loc=1.0, scale=0.2) >>> Z = X + Y + np.random.normal(loc=0, scale=1, size=1000) >>> data_new = pd.DataFrame(data=dict(X=X, Y=Y, Z=Z)) The change here simulates an accidental conversion of multi-threaded code into sequential one (waiting for X and Y in parallel vs. waiting for them sequentially). Next, we'll model cause-effect relationships as a probabilistic causal model: >>> causal_model = gcm.ProbabilisticCausalModel(nx.DiGraph([('X', 'Z'), ('Y', 'Z')])) # X -> Z <- Y >>> causal_model.set_causal_mechanism('X', gcm.EmpiricalDistribution()) >>> causal_model.set_causal_mechanism('Y', gcm.EmpiricalDistribution()) >>> causal_model.set_causal_mechanism('Z', gcm.AdditiveNoiseModel(gcm.ml.create_linear_regressor())) Finally, we attribute changes in distributions to changes in causal mechanisms: >>> attributions = gcm.distribution_change(causal_model, data_old, data_new, 'Z') >>> attributions {'X': -0.0066425020480165905, 'Y': 0.009816959724738061, 'Z': 0.21957816956354193} As we can see, :math:`Z` got the highest attribution score here, which matches what we would expect, given that we changed the mechanism for variable :math:`Z` in our data generation. As the reader may have noticed, there is no fitting step involved when using this method. The reason is, that this function will call ``fit`` internally. To be precise, this function will make two copies of the causal graph and fit one graph to the first dataset and the second graph to the second dataset.