# Estimating Confidence Intervals

Our trained generative causal models and specific sample set might not accurately represent the ground truth and there are typically numerical approximations in our algorithms. When we answer a causal query without computing its confidence intervals, we essentially obtain point estimates. However, these are not very useful when assessing the confidence in our results. Confidence intervals assist us in quantifying the uncertainty in our results introduced by modeling inaccuracies and numerical approximations.

We therefore recommend to use confidence intervals as a default for all causal queries. This avoids a false sense of confidence in a specific value that might be due to a skewed data set or models that are not fitted optimally.

To compute confidence intervals, we use a method called bootstrapping and it’s implemented in the
function `dowhy.gcm.confidence_intervals()`

.

## How to use it

Let’s say we have a function that computes Pi using the Monte Carlo method with 1000 trials:

```
>>> import numpy as np
>>>
>>> def compute_pi_monte_carlo():
>>> trials = 1000
>>> return 4*(np.random.default_rng().uniform(-1, 1, (trials,))**2+
>>> np.random.default_rng().uniform(-1, 1, (trials,))**2 <= 1).sum() / trials
```

Executing this function will give us a result:

```
>>> compute_pi_monte_carlo()
3.188
```

This looks like a reasonable number, but we can’t tell how reliable that result is, given that it
was computed using a stochastic method. Now, let’s use
`dowhy.gcm.confidence_intervals()`

to compute the confidence interval:

```
>>> from dowhy import gcm
>>>
>>> median, intervals = gcm.confidence_intervals(compute_pi_monte_carlo,
>>> num_bootstrap_resamples=1000)
>>> median, intervals
(array([3.136]), array([[3.0518, 3.224 ]]))
```

The first item in the result is the median of all results calculated. The second is the interval
in which we are 95% confident that computed values will fall into. By increasing the number of
`trials`

we can improve the reliability of `compute_pi_monte_carlo`

and hence, narrow the
confidence interval. E.g. with `trials = 10_000`

we get an interval of [3.1132 , 3.16602], with
`trials = 100_000`

we get [3.13256 , 3.150016].

## Computing confidence intervals for causal queries

With this understanding, we can now apply `confidence_intervals`

to one of our causal queries,
e.g. `distribution_change`

. Let’s generate some data first:

```
>>> import pandas as pd
>>> from scipy.stats import halfnorm
>>>
>>> X = halfnorm.rvs(size=1000, loc=0.5, scale=0.2)
>>> Y = halfnorm.rvs(size=1000, loc=1.0, scale=0.2)
>>> Z = np.maximum(X, Y) + np.random.normal(loc=0, scale=1, size=1000)
>>> data_old = pd.DataFrame(data=dict(X=X, Y=Y, Z=Z))
>>>
>>> X = halfnorm.rvs(size=1000, loc=0.5, scale=0.2)
>>> Y = halfnorm.rvs(size=1000, loc=1.0, scale=0.2)
>>> Z = X + Y + np.random.normal(loc=0, scale=1, size=1000)
>>> data_new = pd.DataFrame(data=dict(X=X, Y=Y, Z=Z))
```

Here, the causal mechanism for Z changes between the old and new data. As always, next, we’ll model cause-effect relationships as an SCM and assign causal mechanisms:

```
>>> import networkx as nx
>>> causal_model = gcm.StructuralCausalModel(nx.DiGraph([('X', 'Z'), ('Y', 'Z')])) # X -> Z <- Y
>>> gcm.auto.assign_causal_mechanisms(causal_model, data_old)
```

And now, instead of calling `distribution_change`

directly we call it using
`confidence_intervals`

. However, since `confidence_intervals`

takes a function that takes no
parameters, we must first define a function without parameters:

```
>>> def f():
>>> return gcm.distribution_change(causal_model, data_old, data_new, 'Z')
```

And now:

```
>>> gcm.confidence_intervals(f)
({'X': 0.0005804787010800414, 'Y': 0.0030143299612704804, 'Z': 0.1724206687905231},
{'X': array([-0.00427554, 0.00891714]),
'Y': array([-0.00605089, 0.01782684]),
'Z': array([0.15441771, 0.19062329])})
```

The result again contains a first item, which is the median values of the contribution scores. The second item contains the intervals of the contribution scores for each variable.

To avoid defining a new function, we can also streamline that call by using a lambda:

```
>>> gcm.confidence_intervals(lambda: gcm.distribution_change(causal_model,
>>> data_old, data_new,
>>> target_node='Z'))
```

## Conveniently bootstrapping graph training on random subsets of training data

Many of the causal queries in the GCM package require a trained causal graph as a first argument. To
compute confidence intervals for these methods, we need to explicitly re-train our causal graph
multiple times with different random subsets of data and also run our causal query with each newly
trained graph. To do this conveniently, the GCM package provides a function
`fit_and_compute`

. Assuming that we have `data`

and a causal graph:

```
>>> Z = np.random.normal(loc=0, scale=1, size=1000)
>>> X = 2*Z + np.random.normal(loc=0, scale=1, size=1000)
>>> Y = 3*X + 4*Z + np.random.normal(loc=0, scale=1, size=1000)
>>> data = pd.DataFrame(dict(X=X, Y=Y, Z=Z))
>>>
>>> causal_model = gcm.StructuralCausalModel(nx.DiGraph([('Z', 'Y'), ('Z', 'X'), ('X', 'Y')]))
>>> gcm.auto.assign_causal_mechanisms(causal_model, data_old)
```

we can now use `fit_and_compute`

as follows:

```
>>> strength_median, strength_intervals = gcm.confidence_intervals(
>>> gcm.fit_and_compute(gcm.arrow_strength,
>>> causal_model,
>>> bootstrap_training_data=data,
>>> target_node='Y'))
>>> strength_median, strength_intervals
({('X', 'Y'): 45.90886398636573, ('Z', 'Y'): 15.47129383737619},
{('X', 'Y'): array([42.88319632, 50.43890079]), ('Z', 'Y'): array([13.44202416, 17.74266107])})
```

## Runtime cost versus confidence

In certain scenarios it is prohibitely expensive to re-train causal graphs multiple times. E.g. when
using AutoGluon as prediction models for the additive noise model, a single `fit`

execution can
quickly go into hours. Bootstrapping with 20 resamples will then quickly go into days depending
on how much we can parallelize.

For that reason, sometimes the tradeoff is to only bootstrap on the causal query, not on the
training. To make this analogous to the approach we used above, there is the function
`bootstrap_sampling()`

. This function assumes that the causal graph is already trained. Then
it can be used as follows:

```
>>> gcm.fit(causal_model, data)
>>>
>>> strength_median, strength_intervals = gcm.confidence_intervals(
>>> gcm.bootstrap_sampling(gcm.arrow_strength,
>>> causal_model,
>>> target_node='Y'))
>>>
>>> strength_median, strength_intervals
({('X', 'Y'): 46.07299374572871, ('Z', 'Y'): 15.358850280972195},
{('X', 'Y'): array([44.95914495, 47.63918151]), ('Z', 'Y'): array([15.04323069, 15.72570547])})
```

`dowhy.gcm.bootstrap_sampling()`

accomplishes the same thing as the lambda we’ve seen earlier. However,
using `bootstrap_sampling`

is more expressive. We therefore recommend its use over a lambda for
all causal queries that take a causal graph as the first argument.

## Understanding the need for confidence intervals

As explained in the beginnging, the models and specific sample set might not accurately represent the ground truth and there are typically numerical approximations in our algorithms. The error comes from three sources:

Variance of the “optimal” parameters for a model, i.e. running

`fit`

twice with the same/slightly different data can yield two different models (becomes even worse when there is a stochastic element in the fit process of the prediction models as well). For instance, training an AutoGluon model twice on even exactly the same data would return two different models with slightly different performance.Approximations in our algorithms. For instance, when estimating distribution change with 6+ upstream nodes, we will only approximate the Shapley values (the approximation has a stochastic factor). Therefore, running distribution change twice with exactly the same input and generated samples will return two different results.

Even if we do not approximate the Shapley values, the algorithms are typically based on some specific samples from the generative models, i.e. variance can also come from the variance between two sets of drawn samples. Therefore, even if the algorithm itself is deterministic (e.g. evaluate all possible subsets for the Shapley values), we would use randomly generated samples which yields different results when calling an algorithm twice.