Estimating Confidence Intervals ================================ Our trained generative causal models and specific sample set might not accurately represent the ground truth and there are typically numerical approximations in our algorithms. Confidence intervals help us quantify the uncertainty of our results introduced by these inaccuracies and approximations. We therefore recommend to use confidence intervals as a default for all causal queries. This avoids a false sense of confidence in a specific value that might be due to a skewed data set or models that are not fitted optimally. To compute confidence intervals, we use a method called `bootstrapping `_ and it's implemented in the function :py:func:`dowhy.gcm.confidence_intervals`. How to use it ------------- Let's say we have a function that `computes Pi using the Monte Carlo method `_ with 1000 trials: >>> import numpy as np >>> >>> def compute_pi_monte_carlo(): >>> trials = 1000 >>> return 4*(np.random.default_rng().uniform(-1, 1, (trials,))**2+ >>> np.random.default_rng().uniform(-1, 1, (trials,))**2 <= 1).sum() / trials Executing this function will give us a result: >>> compute_pi_monte_carlo() 3.188 This looks like a reasonable number, but we can't tell how reliable that result is, given that it was computed using a stochastic method. Now, let's use :py:func:`dowhy.gcm.confidence_intervals` to compute the confidence interval: >>> from dowhy import gcm >>> >>> median, intervals = gcm.confidence_intervals(compute_pi_monte_carlo, >>> num_bootstrap_resamples=1000) >>> median, intervals (array([3.136]), array([[3.0518, 3.224 ]])) The first item in the result is the median of all results calculated. The second is the interval in which we are 95% confident that computed values will fall into. By increasing the number of ``trials`` we can improve the reliability of ``compute_pi_monte_carlo`` and hence, narrow the confidence interval. E.g. with ``trials = 10_000`` we get an interval of [3.1132 , 3.16602], with ``trials = 100_000`` we get [3.13256 , 3.150016]. Computing confidence intervals for causal queries ------------------------------------------------- With this understanding, we can now apply ``confidence_intervals`` to one of our causal queries, e.g. ``distribution_change``. Let's generate some data first: >>> import pandas as pd >>> from scipy.stats import halfnorm >>> >>> X = halfnorm.rvs(size=1000, loc=0.5, scale=0.2) >>> Y = halfnorm.rvs(size=1000, loc=1.0, scale=0.2) >>> Z = np.maximum(X, Y) + np.random.normal(loc=0, scale=1, size=1000) >>> data_old = pd.DataFrame(data=dict(X=X, Y=Y, Z=Z)) >>> >>> X = halfnorm.rvs(size=1000, loc=0.5, scale=0.2) >>> Y = halfnorm.rvs(size=1000, loc=1.0, scale=0.2) >>> Z = X + Y + np.random.normal(loc=0, scale=1, size=1000) >>> data_new = pd.DataFrame(data=dict(X=X, Y=Y, Z=Z)) Here, the causal mechanism for Z changes between the old and new data. As always, next, we'll model cause-effect relationships as an SCM and assign causal mechanisms: >>> import networkx as nx >>> causal_model = gcm.StructuralCausalModel(nx.DiGraph([('X', 'Z'), ('Y', 'Z')])) # X -> Z <- Y >>> gcm.auto.assign_causal_mechanisms(causal_model, data_old) And now, instead of calling ``distribution_change`` directly we call it using ``confidence_intervals``. However, since ``confidence_intervals`` takes a function that takes no parameters, we must first define a function without parameters: >>> def f(): >>> return gcm.distribution_change(causal_model, data_old, data_new, 'Z') And now: >>> gcm.confidence_intervals(f) ({'X': 0.0005804787010800414, 'Y': 0.0030143299612704804, 'Z': 0.1724206687905231}, {'X': array([-0.00427554, 0.00891714]), 'Y': array([-0.00605089, 0.01782684]), 'Z': array([0.15441771, 0.19062329])}) The result again contains a first item, which is the median values of the contribution scores. The second item contains the intervals of the contribution scores for each variable. To avoid defining a new function, we can streamline that call by using a lambda: >>> gcm.confidence_intervals(lambda: gcm.distribution_change(causal_model, >>> data_old, data_new, >>> target_node='Z')) Conveniently bootstrapping graph training on random subsets of training data ---------------------------------------------------------------------------- Many of the causal queries in the GCM package require a trained causal graph as a first argument. To compute confidence intervals for these methods, we need to explicitly re-train our causal graph multiple times with different random subsets of data and also run our causal query with each newly trained graph. To do this conveniently, the GCM package provides a function ``bootstrap_training_and_sampling``. Assuming that we have ``data`` and a causal graph: >>> Z = np.random.normal(loc=0, scale=1, size=1000) >>> X = 2*Z + np.random.normal(loc=0, scale=1, size=1000) >>> Y = 3*X + 4*Z + np.random.normal(loc=0, scale=1, size=1000) >>> data = pd.DataFrame(dict(X=X, Y=Y, Z=Z)) >>> >>> causal_model = gcm.StructuralCausalModel(nx.DiGraph([('Z', 'Y'), ('Z', 'X'), ('X', 'Y')])) >>> gcm.auto.assign_causal_mechanisms(causal_model, data_old) we can now use ``bootstrap_training_and_sampling`` as follows: >>> strength_median, strength_intervals = gcm.confidence_intervals( >>> gcm.bootstrap_training_and_sampling(gcm.arrow_strength, >>> causal_model, >>> bootstrap_training_data=data, >>> target_node='Y')) >>> strength_median, strength_intervals ({('X', 'Y'): 45.90886398636573, ('Z', 'Y'): 15.47129383737619}, {('X', 'Y'): array([42.88319632, 50.43890079]), ('Z', 'Y'): array([13.44202416, 17.74266107])}) Runtime cost versus confidence ------------------------------ In certain scenarios it is prohibitely expensive to re-train causal graphs multiple times. E.g. when using AutoGluon as prediction models for the additive noise model, a single ``fit`` execution can quickly go into hours. Bootstrapping with 20 resamples will then quickly go into days depending on how much we can parallelize. For that reason, sometimes the tradeoff is to only bootstrap on the causal query, not on the training. To make this analogous to the approach we used above, there is the function :py:func:`bootstrap_sampling`. This function assumes that the causal graph is already trained. Then it can be used as follows: >>> gcm.fit(causal_model, data) >>> >>> strength_median, strength_intervals = gcm.confidence_intervals( >>> gcm.bootstrap_sampling(gcm.arrow_strength, >>> causal_model, >>> target_node='Y')) >>> >>> strength_median, strength_intervals ({('X', 'Y'): 46.07299374572871, ('Z', 'Y'): 15.358850280972195}, {('X', 'Y'): array([44.95914495, 47.63918151]), ('Z', 'Y'): array([15.04323069, 15.72570547])}) :py:func:`dowhy.gcm.bootstrap_sampling` accomplishes the same thing as the lambda we've seen earlier. However, using ``bootstrap_sampling`` is more expressive. We therefore recommend its use over a lambda for all causal queries that take a causal graph as the first argument. Understanding the need for confidence intervals ----------------------------------------------- As explained in the beginnging, the models and specific sample set might not accurately represent the ground truth and there are typically numerical approximations in our algorithms. The error comes from three sources: - Variance of the "optimal" parameters for a model, i.e. running ``fit`` twice with the same/slightly different data can yield two different models (becomes even worse when there is a stochastic element in the fit process of the prediction models as well). For instance, training an AutoGluon model twice on even exactly the same data would return two different models with slightly different performance. - Approximations in our algorithms. For instance, when estimating distribution change with 6+ upstream nodes, we will only approximate the Shapley values (the approximation has a stochastic factor). Therefore, running distribution change twice with exactly the same input and generated samples will return two different results. - Even if we do not approximate the Shapley values, the algorithms are typically based on some specific samples from the generative models, i.e. variance can also come from the variance between two sets of drawn samples. Therefore, even if the algorithm itself is deterministic (e.g. evaluate all possible subsets for the Shapley values), we would use randomly generated samples which yields different results when calling an algorithm twice.