dowhy.causal_refuters package
Subpackages
Submodules
dowhy.causal_refuters.add_unobserved_common_cause module
- class dowhy.causal_refuters.add_unobserved_common_cause.AddUnobservedCommonCause(*args, **kwargs)[source]
Bases:
CausalRefuter
Add an unobserved confounder for refutation.
- AddUnobservedCommonCause class supports three methods:
Simulation of an unobserved confounder
Linear partial R2 : Sensitivity Analysis for linear models.
Non-Parametric partial R2 based : Sensitivity Analyis for non-parametric models.
Supports additional parameters that can be specified in the refute_estimate() method.
Initialize the parameters required for the refuter.
For direct_simulation, if effect_strength_on_treatment or effect_strength_on_outcome is not given, it is calculated automatically as a range between the minimum and maximum effect strength of observed confounders on treatment and outcome respectively.
- Parameters:
simulation_method – The method to use for simulating effect of unobserved confounder. Possible values are [“direct-simulation”, “linear-partial-R2”, “non-parametric-partial-R2”, “e-value”].
confounders_effect_on_treatment – str : The type of effect on the treatment due to the unobserved confounder. Possible values are [‘binary_flip’, ‘linear’]
confounders_effect_on_outcome – str : The type of effect on the outcome due to the unobserved confounder. Possible values are [‘binary_flip’, ‘linear’]
effect_strength_on_treatment – float, numpy.ndarray: [Used when simulation_method=”direct-simulation”] Strength of the confounder’s effect on treatment. When confounders_effect_on_treatment is linear, it is the regression coefficient. When the confounders_effect_on_treatment is binary flip, it is the probability with which effect of unobserved confounder can invert the value of the treatment.
effect_strength_on_outcome – float, numpy.ndarray: Strength of the confounder’s effect on outcome. Its interpretation depends on confounders_effect_on_outcome and the simulation_method. When simulation_method is direct-simulation, for a linear effect it behaves like the regression coefficient and for a binary flip, it is the probability with which it can invert the value of the outcome.
partial_r2_confounder_treatment – float, numpy.ndarray: [Used when simulation_method is linear-partial-R2 or non-parametric-partial-R2] Partial R2 of the unobserved confounder wrt the treatment conditioned on the observed confounders. Only in the case of general non-parametric-partial-R2, it is the fraction of variance in the reisz representer that is explained by the unobserved confounder; specifically (1-r), where r is the ratio of variance of reisz representer, alpha^2, based on observed confounders and that based on all confounders.
partial_r2_confounder_outcome – float, numpy.ndarray: [Used when simulation_method is linear-partial-R2 or non-parametric-partial-R2] Partial R2 of the unobserved confounder wrt the outcome conditioned on the treatment and observed confounders.
frac_strength_treatment – float: This parameter decides the effect strength of the simulated confounder as a fraction of the effect strength of observed confounders on treatment. Defaults to 1.
frac_strength_outcome – float: This parameter decides the effect strength of the simulated confounder as a fraction of the effect strength of observed confounders on outcome. Defaults to 1.
plotmethod – string: Type of plot to be shown. If None, no plot is generated. This parameter is used only only when more than one treatment confounder effect values or outcome confounder effect values are provided. Default is “colormesh”. Supported values are “contour”, “colormesh” when more than one value is provided for both confounder effect value parameters; “line” when provided for only one of them.
percent_change_estimate – It is the percentage of reduction of treatment estimate that could alter the results (default = 1). if percent_change_estimate = 1, the robustness value describes the strength of association of confounders with treatment and outcome in order to reduce the estimate by 100% i.e bring it down to 0. (relevant only for Linear Sensitivity Analysis, ignore for rest)
confounder_increases_estimate – True implies that confounder increases the absolute value of estimate and vice versa. (Default = False). (relevant only for Linear Sensitivity Analysis, ignore for rest)
benchmark_common_causes – names of variables for bounding strength of confounders. (relevant only for partial-r2 based simulation methods)
significance_level – confidence interval for statistical inference(default = 0.05). (relevant only for partial-r2 based simulation methods)
null_hypothesis_effect – assumed effect under the null hypothesis. (relevant only for linear-partial-R2, ignore for rest)
plot_estimate – Generate contour plot for estimate while performing sensitivity analysis. (default = True). (relevant only for partial-r2 based simulation methods)
num_splits – number of splits for cross validation. (default = 5). (relevant only for non-parametric-partial-R2 simulation method)
:param shuffle_data : shuffle data or not before splitting into folds (default = False). (relevant only for non-parametric-partial-R2 simulation method) :param shuffle_random_seed: seed for randomly shuffling data. (relevant only for non-parametric-partial-R2 simulation method) :param alpha_s_estimator_param_list: list of dictionaries with parameters for finding alpha_s. (relevant only for non-parametric-partial-R2 simulation method) :param g_s_estimator_list: list of estimator objects for finding g_s. These objects should have fit() and predict() functions implemented. (relevant only for non-parametric-partial-R2 simulation method) :param g_s_estimator_param_list: list of dictionaries with parameters for tuning respective estimators in “g_s_estimator_list”. The order of the dictionaries in the list should be consistent with the estimator objects order in “g_s_estimator_list”. (relevant only for non-parametric-partial-R2 simulation method)
- dowhy.causal_refuters.add_unobserved_common_cause.include_simulated_confounder(data: DataFrame, treatment_name: str, outcome_name: str, kappa_t: float, kappa_y: float, variables_of_interest: List, convergence_threshold: float = 0.1, c_star_max: int = 1000)[source]
This function simulates an unobserved confounder based on the data using the following steps: 1. It calculates the “residuals” from the treatment and outcome model i.) The outcome model has outcome as the dependent variable and all the observed variables including treatment as independent variables ii.) The treatment model has treatment as the dependent variable and all the observed variables as independent variables.
2. U is an intermediate random variable drawn from the normal distribution with the weighted average of residuals as mean and a unit variance U ~ N(c1*d_y + c2*d_t, 1) where *d_y and d_t are residuals from the treatment and outcome model *c1 and c2 are coefficients to the residuals
The final U, which is the simulated unobserved confounder is obtained by debiasing the intermediate variable U by residualising it with X
Choosing the coefficients c1 and c2: The coefficients are chosen based on these basic assumptions: 1. There is a hyperbolic relationship satisfying c1*c2 = c_star 2. c_star is chosen from a range of possible values based on the correlation of the obtained simulated variable with outcome and treatment. 3. The product of correlations with treatment and outcome should be at a minimum distance to the maximum correlations with treatment and outcome in any of the observed confounders 4. The ratio of the weights should be such that they maintain the ratio of the maximum possible observed coefficients within some confidence interval
- Parameters:
c_star_max – The maximum possible value for the hyperbolic curve on which the coefficients to the residuals lie. It defaults to 1000 in the code if not specified by the user. :type int
convergence_threshold – The threshold to check the plateauing of the correlation while selecting a c_star. It defaults to 0.1 in the code if not specified by the user :type float
- Returns:
The simulated values of the unobserved confounder based on the data :type pandas.core.series.Series
- dowhy.causal_refuters.add_unobserved_common_cause.sensitivity_e_value(data: DataFrame, target_estimand: IdentifiedEstimand, estimate: CausalEstimate, treatment_name: List[str], outcome_name: List[str], plot_estimate: bool = True) EValueSensitivityAnalyzer [source]
- dowhy.causal_refuters.add_unobserved_common_cause.sensitivity_linear_partial_r2(data: DataFrame, estimate: CausalEstimate, treatment_name: str, frac_strength_treatment: float = 1.0, frac_strength_outcome: float = 1.0, percent_change_estimate: float = 1.0, benchmark_common_causes: Optional[List[str]] = None, significance_level: Optional[float] = None, null_hypothesis_effect: Optional[float] = None, plot_estimate: bool = True) LinearSensitivityAnalyzer [source]
Add an unobserved confounder for refutation using Linear partial R2 methond (Sensitivity Analysis for linear models).
- Parameters:
data – pd.DataFrame: Data to run the refutation
estimate – CausalEstimate: Estimate to run the refutation
treatment_name – str: Name of the treatment
frac_strength_treatment – float: This parameter decides the effect strength of the simulated confounder as a fraction of the effect strength of observed confounders on treatment. Defaults to 1.
frac_strength_outcome – float: This parameter decides the effect strength of the simulated confounder as a fraction of the effect strength of observed confounders on outcome. Defaults to 1.
percent_change_estimate – It is the percentage of reduction of treatment estimate that could alter the results (default = 1). if percent_change_estimate = 1, the robustness value describes the strength of association of confounders with treatment and outcome in order to reduce the estimate by 100% i.e bring it down to 0. (relevant only for Linear Sensitivity Analysis, ignore for rest)
benchmark_common_causes – names of variables for bounding strength of confounders. (relevant only for partial-r2 based simulation methods)
significance_level – confidence interval for statistical inference(default = 0.05). (relevant only for partial-r2 based simulation methods)
null_hypothesis_effect – assumed effect under the null hypothesis. (relevant only for linear-partial-R2, ignore for rest)
plot_estimate – Generate contour plot for estimate while performing sensitivity analysis. (default = True). (relevant only for partial-r2 based simulation methods)
- dowhy.causal_refuters.add_unobserved_common_cause.sensitivity_non_parametric_partial_r2(estimate: CausalEstimate, kappa_t: Optional[Union[float, ndarray]] = None, kappa_y: Optional[Union[float, ndarray]] = None, frac_strength_treatment: float = 1.0, frac_strength_outcome: float = 1.0, benchmark_common_causes: Optional[List[str]] = None, plot_estimate: bool = True, alpha_s_estimator_list: Optional[List] = None, alpha_s_estimator_param_list: Optional[List[Dict]] = None, g_s_estimator_list: Optional[List] = None, g_s_estimator_param_list: Optional[List[Dict]] = None, plugin_reisz: bool = False)[source]
Add an unobserved confounder for refutation using Non-parametric partial R2 methond (Sensitivity Analysis for non-parametric models).
- Parameters:
estimate – CausalEstimate: Estimate to run the refutation
kappa_t – float, numpy.ndarray: Partial R2 of the unobserved confounder wrt the treatment conditioned on the observed confounders. Only in the case of general non-parametric-partial-R2, it is the fraction of variance in the reisz representer that is explained by the unobserved confounder; specifically (1-r), where r is the ratio of variance of reisz representer, alpha^2, based on observed confounders and that based on all confounders.
kappa_y – float, numpy.ndarray: Partial R2 of the unobserved confounder wrt the outcome conditioned on the treatment and observed confounders.
frac_strength_treatment – float: This parameter decides the effect strength of the simulated confounder as a fraction of the effect strength of observed confounders on treatment. Defaults to 1.
frac_strength_outcome – float: This parameter decides the effect strength of the simulated confounder as a fraction of the effect strength of observed confounders on outcome. Defaults to 1.
benchmark_common_causes – names of variables for bounding strength of confounders. (relevant only for partial-r2 based simulation methods)
plot_estimate – Generate contour plot for estimate while performing sensitivity analysis. (default = True). (relevant only for partial-r2 based simulation methods)
alpha_s_estimator_list – list of estimator objects for estimating alpha_s. These objects should have fit() and predict() methods (relevant only for non-parametric-partial-R2 method)
alpha_s_estimator_param_list – list of dictionaries with parameters for finding alpha_s. (relevant only for non-parametric-partial-R2 simulation method)
g_s_estimator_list – list of estimator objects for finding g_s. These objects should have fit() and predict() functions implemented. (relevant only for non-parametric-partial-R2 simulation method)
g_s_estimator_param_list – list of dictionaries with parameters for tuning respective estimators in “g_s_estimator_list”. The order of the dictionaries in the list should be consistent with the estimator objects order in “g_s_estimator_list”. (relevant only for non-parametric-partial-R2 simulation method)
- Plugin_reisz:
bool: Flag on whether to use the plugin estimator or the nonparametric estimator for reisz representer function (alpha_s).
- dowhy.causal_refuters.add_unobserved_common_cause.sensitivity_simulation(data: DataFrame, target_estimand: IdentifiedEstimand, estimate: CausalEstimate, treatment_name: str, outcome_name: str, kappa_t: Optional[Union[float, ndarray]] = None, kappa_y: Optional[Union[float, ndarray]] = None, confounders_effect_on_treatment: str = 'binary_flip', confounders_effect_on_outcome: str = 'linear', frac_strength_treatment: float = 1.0, frac_strength_outcome: float = 1.0, plotmethod: Optional[str] = None, show_progress_bar=False, **_) CausalRefutation [source]
This function attempts to add an unobserved common cause to the outcome and the treatment. At present, we have implemented the behavior for one dimensional behaviors for continuous and binary variables. This function can either take single valued inputs or a range of inputs. The function then looks at the data type of the input and then decides on the course of action.
- Parameters:
data – pd.DataFrame: Data to run the refutation
target_estimand – IdentifiedEstimand: Identified estimand to run the refutation
estimate – CausalEstimate: Estimate to run the refutation
treatment_name – str: Name of the treatment
outcome_name – str: Name of the outcome
kappa_t – float, numpy.ndarray: Strength of the confounder’s effect on treatment. When confounders_effect_on_treatment is linear, it is the regression coefficient. When the confounders_effect_on_treatment is binary flip, it is the probability with which effect of unobserved confounder can invert the value of the treatment.
kappa_y – float, numpy.ndarray: Strength of the confounder’s effect on outcome. Its interpretation depends on confounders_effect_on_outcome and the simulation_method. When simulation_method is direct-simulation, for a linear effect it behaves like the regression coefficient and for a binary flip, it is the probability with which it can invert the value of the outcome.
confounders_effect_on_treatment – str : The type of effect on the treatment due to the unobserved confounder. Possible values are [‘binary_flip’, ‘linear’]
confounders_effect_on_outcome – str : The type of effect on the outcome due to the unobserved confounder. Possible values are [‘binary_flip’, ‘linear’]
frac_strength_treatment – float: This parameter decides the effect strength of the simulated confounder as a fraction of the effect strength of observed confounders on treatment. Defaults to 1.
frac_strength_outcome – float: This parameter decides the effect strength of the simulated confounder as a fraction of the effect strength of observed confounders on outcome. Defaults to 1.
plotmethod – string: Type of plot to be shown. If None, no plot is generated. This parameter is used only only when more than one treatment confounder effect values or outcome confounder effect values are provided. Default is “colormesh”. Supported values are “contour”, “colormesh” when more than one value is provided for both confounder effect value parameters; “line” when provided for only one of them.
- Returns:
CausalRefuter: An object that contains the estimated effect and a new effect and the name of the refutation used.
dowhy.causal_refuters.assess_overlap module
- class dowhy.causal_refuters.assess_overlap.AssessOverlap(*args, **kwargs)[source]
Bases:
CausalRefuter
Assess Overlap
This class implements the OverRule algorithm for assessing support and overlap via Boolean Rulesets, from [1].
[1] Oberst, M., Johansson, F., Wei, D., Gao, T., Brat, G., Sontag, D., & Varshney, K. (2020). Characterization of Overlap in Observational Studies. In S. Chiappa & R. Calandra (Eds.), Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics (Vol. 108, pp. 788–798). PMLR. https://arxiv.org/abs/1907.04138
Initialize the parameters required for the refuter.
Arguments are passed through to the refute_estimate method. See dowhy.causal_refuters.assess_overlap_overrule for the definition of the SupportConfig and OverlapConfig dataclasses that define optimization hyperparameters.
Warning
This method is only compatible with estimators that use backdoor adjustment, and will attempt to acquire the set of backdoor variables via self._target_estimand.get_backdoor_variables().
- Param:
cat_feats: List[str]: List of categorical features, all others will be discretized
- Param:
support_config: SupportConfig: DataClass with configuration options for learning support rules
- Param:
overlap_config: OverlapConfig: DataClass with configuration options for learning overlap rules
- Param:
overlap_eps: float: Defines the range of propensity scores for a point to be considered in the overlap region, with the range defined as (overlap_eps, 1 - overlap_eps), defaults to 0.1
- Param:
overrule_verbose: bool: Enable verbose logging of optimization output, defaults to False
- Param:
support_only: bool: Only fit rules to describe the support region (do not fit overlap rules), defaults to False
- Param:
overlap_only: bool: Only fit rules to describe the overlap region (do not fit support rules), defaults to False
- refute_estimate(show_progress_bar=False)[source]
Learn overlap and support rules.
- Parameters:
show_progress_bar (bool) – Not implemented, will raise error if set to True, defaults to False
- Raises:
NotImplementedError – Will raise this error if show_progress_bar=True
- Returns:
object of class OverruleAnalyzer
- dowhy.causal_refuters.assess_overlap.assess_support_and_overlap_overrule(data, backdoor_vars: List[str], treatment_name: str, cat_feats: List[str] = [], overlap_config: Optional[OverlapConfig] = None, support_config: Optional[SupportConfig] = None, overlap_eps: float = 0.1, support_only: bool = False, overlap_only: bool = False, verbose: bool = False)[source]
Learn support and overlap rules using OverRule.
- Parameters:
data – Data containing backdoor variables and treatment name
backdoor_vars – List of backdoor variables. Support and overlap rules will only be learned with respect to
these variables :type backdoor_vars: List[str] :param treatment_name: Treatment name :type treatment_name: str :param cat_feats: Categorical features :type cat_feats: List[str] :param overlap_config: Configuration for learning overlap rules :type overlap_config: OverlapConfig :param support_config: Configuration for learning support rules :type support_config: SupportConfig :param: overlap_eps: float: Defines the range of propensity scores for a point to be considered in the overlap
region, with the range defined as (overlap_eps, 1 - overlap_eps), defaults to 0.1
- Param:
support_only: bool: Only fit the support region
- Param:
overlap_only: bool: Only fit the overlap region
- Param:
verbose: bool: Enable verbose logging of optimization output, defaults to False
dowhy.causal_refuters.assess_overlap_overrule module
- class dowhy.causal_refuters.assess_overlap_overrule.OverlapConfig(alpha: float = 0.95, lambda0: float = 0.001, lambda1: float = 0.001, K: int = 20, D: int = 20, B: int = 10, iterMax: int = 10, num_thresh: int = 9, thresh_override: Optional[Dict] = None, solver: str = 'ECOS', rounding: str = 'greedy_sweep')[source]
Bases:
object
Configuration for learning overlap rules.
- Parameters:
alpha (float, optional) – Fraction of the overlap samples to ensure are included in the rules, defaults to 0.95
lambda0 (float, optional) – Regularization on the # of rules, defaults to 1e-3
lambda1 (float, optional) – Regularization on the # of literals, defaults to 1e-3
K (int, optional) – Maximum results returned during beam search, defaults to 20
D (int, optional) – Maximum extra rules per beam seach iteration, defaults to 20
B (int, optional) – Width of beam search, defaults to 10
iterMax (int, optional) – Maximum number of iterations of column generation, defaults to 10
num_thresh (int, optional) – Number of bins to discretize continuous variables, defaults to 9 (for deciles)
thresh_override (Optional[Dict], optional) – Manual override of the thresholds for continuous features, given as a dictionary like the following, will only be applied to continuous features with more than num_thresh unique values thresh_override = {column_name: np.linspace(0, 100, 10)}
solver (str, optional) – Linear programming solver used by CVXPY to solve the LP relaxation, defaults to ‘ECOS’
rounding (str, optional) – Strategy to perform rounding, either ‘greedy’ or ‘greedy_sweep’, defaults to ‘greedy_sweep’
- B: int = 10
- D: int = 20
- K: int = 20
- alpha: float = 0.95
- iterMax: int = 10
- lambda0: float = 0.001
- lambda1: float = 0.001
- num_thresh: int = 9
- rounding: str = 'greedy_sweep'
- solver: str = 'ECOS'
- thresh_override: Optional[Dict] = None
- class dowhy.causal_refuters.assess_overlap_overrule.OverruleAnalyzer(backdoor_vars: List[str], treatment_name: str, cat_feats: Optional[List[str]] = None, support_config: Optional[SupportConfig] = None, overlap_config: Optional[OverlapConfig] = None, prop_estimator: Optional[Union[BaseEstimator, GridSearchCV]] = None, overlap_eps: float = 0.1, support_only: bool = False, overlap_only: bool = False, verbose: bool = False)[source]
Bases:
object
Learn support and overlap rules.
- Parameters:
backdoor_vars – List of backdoor variables. Support and overlap rules will only be learned with respect to
these variables :type backdoor_vars: List[str] :param treatment_name: Treatment name :type treatment_name: str :param: cat_feats: List[str]: List of categorical features, all others will be discretized :param: support_config: SupportConfig: DataClass with configuration options for learning support rules :param: overlap_config: OverlapConfig: DataClass with configuration options for learning overlap rules :param: overrule_verbose: bool: Enable verbose logging of optimization output, defaults to False :param prop_estimator: Propensity score estimator, defaults to RandomForestClassifier learned via GridSearchCV :type prop_estimator: Optional[Union[BaseEstimator, GridSearchCV]], optional :param: overlap_eps: float: Defines the range of propensity scores for a point to be considered in the overlap
region, with the range defined as (overlap_eps, 1 - overlap_eps), defaults to 0.1
- Parameters:
support_only (bool, optional) – Only fit the support region, not the overlap, defaults to False
overlap_only (bool, optional) – Only fit the overlap region, not the support, defaults to False
verbose (bool, optional) – Verbose optimization output, defaults to False
- class dowhy.causal_refuters.assess_overlap_overrule.SupportConfig(n_ref_multiplier: float = 1.0, seed: Optional[int] = None, alpha: float = 0.98, lambda0: float = 0.01, lambda1: float = 0.001, K: int = 20, D: int = 20, B: int = 10, iterMax: int = 10, num_thresh: int = 9, thresh_override: Optional[Dict] = None, solver: str = 'ECOS', rounding: str = 'greedy_sweep')[source]
Bases:
object
Configuration for learning support rules.
- Parameters:
n_ref_multiplier (float, optional) – Reference sample count multiplier, defaults to 1.0
seed (int, optional) – Random seed for reference samples, only used for estimating support, defaults to None
alpha (float, optional) – Fraction of the existing examples to ensure are included in the rules, defaults to 0.98
lambda0 (float, optional) – Regularization on the # of rules, defaults to 1e-2
lambda1 (float, optional) – Regularization on the # of literals, defaults to 1e-3
K (int, optional) – Maximum results returned during beam search, defaults to 20
D (int, optional) – Maximum extra rules per beam seach iteration, defaults to 20
B (int, optional) – Width of beam search, defaults to 10
iterMax (int, optional) – Maximum number of iterations of column generation, defaults to 10
num_thresh (int, optional) – Number of bins to discretize continuous variables, defaults to 9 (for deciles)
thresh_override (Optional[Dict], optional) – Manual override of the thresholds for continuous features, given as a dictionary like the following, will only be applied to continuous features with more than num_thresh unique values thresh_override = {column_name: np.linspace(0, 100, 10)}
solver (str, optional) – Linear programming solver used by CVXPY to solve the LP relaxation, defaults to ‘ECOS’
rounding (str, optional) – Strategy to perform rounding, either ‘greedy’ or ‘greedy_sweep’, defaults to ‘greedy_sweep’
- B: int = 10
- D: int = 20
- K: int = 20
- alpha: float = 0.98
- iterMax: int = 10
- lambda0: float = 0.01
- lambda1: float = 0.001
- n_ref_multiplier: float = 1.0
- num_thresh: int = 9
- rounding: str = 'greedy_sweep'
- seed: Optional[int] = None
- solver: str = 'ECOS'
- thresh_override: Optional[Dict] = None
dowhy.causal_refuters.bootstrap_refuter module
- class dowhy.causal_refuters.bootstrap_refuter.BootstrapRefuter(*args, **kwargs)[source]
Bases:
CausalRefuter
Refute an estimate by running it on a random sample of the data containing measurement error in the confounders. This allows us to find the ability of the estimator to find the effect of the treatment on the outcome.
It supports additional parameters that can be specified in the refute_estimate() method.
- Parameters:
num_simulations (int, optional) – The number of simulations to be run,
CausalRefuter.DEFAULT_NUM_SIMULATIONS
by defaultsample_size (int, optional) – The size of each bootstrap sample and is the size of the original data by default
required_variables (int, list, bool, optional) – The list of variables to be used as the input for
y~f(W)
This isTrue
by default, which in turn selects all variables leaving the treatment and the outcome
An integer argument refers to how many variables will be used for estimating the value of the outcome
A list explicitly refers to which variables will be used to estimate the outcome Furthermore, it gives the ability to explictly select or deselect the covariates present in the estimation of the outcome. This is done by either adding or explicitly removing variables from the list as shown below:
Note
We need to pass required_variables =
[W0,W1]
if we wantW0
andW1
.We need to pass required_variables =
[-W0,-W1]
if we want all variables excludingW0
andW1
.
If the value is True, we wish to include all variables to estimate the value of the outcome.
Warning
A
False
value isINVALID
and will result in anerror
.- Parameters:
noise (float, optional) – The standard deviation of the noise to be added to the data and is
BootstrapRefuter.DEFAULT_STD_DEV
by defaultprobability_of_change (float, optional) – It specifies the probability with which we change the data for a boolean or categorical variable It is
noise
by default, only if the value ofnoise
is less than 1.random_state (int, RandomState, optional) – The seed value to be added if we wish to repeat the same random behavior. For this purpose, we repeat the same seed in the psuedo-random generator.
- DEFAULT_NUMBER_OF_TRIALS = 1
- DEFAULT_STD_DEV = 0.1
- DEFAULT_SUCCESS_PROBABILITY = 0.5
- dowhy.causal_refuters.bootstrap_refuter.refute_bootstrap(data: DataFrame, target_estimand: IdentifiedEstimand, estimate: CausalEstimate, num_simulations: int = 100, random_state: Optional[Union[int, RandomState]] = None, sample_size: Optional[int] = None, required_variables: bool = True, noise: float = 0.1, probability_of_change: Optional[float] = None, show_progress_bar: bool = False, n_jobs: int = 1, verbose: int = 0, **_) CausalRefutation [source]
Refute an estimate by running it on a random sample of the data containing measurement error in the confounders. This allows us to find the ability of the estimator to find the effect of the treatment on the outcome.
- Parameters:
data – pd.DataFrame: Data to run the refutation
target_estimand – IdentifiedEstimand: Identified estimand to run the refutation
estimate – CausalEstimate: Estimate to run the refutation
num_simulations – The number of simulations to be run,
CausalRefuter.DEFAULT_NUM_SIMULATIONS
by defaultrandom_state – The seed value to be added if we wish to repeat the same random behavior. For this purpose, we repeat the same seed in the psuedo-random generator.
sample_size – The size of each bootstrap sample and is the size of the original data by default
required_variables – The list of variables to be used as the input for
y~f(W)
This isTrue
by default, which in turn selects all variables leaving the treatment and the outcome
An integer argument refers to how many variables will be used for estimating the value of the outcome
A list explicitly refers to which variables will be used to estimate the outcome Furthermore, it gives the ability to explictly select or deselect the covariates present in the estimation of the outcome. This is done by either adding or explicitly removing variables from the list as shown below:
Note
We need to pass required_variables =
[W0,W1]
if we wantW0
andW1
.We need to pass required_variables =
[-W0,-W1]
if we want all variables excludingW0
andW1
.
3. If the value is True, we wish to include all variables to estimate the value of the outcome. .. warning:: A
False
value isINVALID
and will result in anerror
. :param noise: The standard deviation of the noise to be added to the data and isBootstrapRefuter.DEFAULT_STD_DEV
by default :param probability_of_change: It specifies the probability with which we change the data for a boolean or categorical variableIt is
noise
by default, only if the value ofnoise
is less than 1.- Parameters:
n_jobs – The maximum number of concurrently running jobs. If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all (this is the default).
verbose – The verbosity level: if non zero, progress messages are printed. Above 50, the output is sent to stdout. The frequency of the messages increases with the verbosity level. If it more than 10, all iterations are reported. The default is 0.
dowhy.causal_refuters.data_subset_refuter module
- class dowhy.causal_refuters.data_subset_refuter.DataSubsetRefuter(*args, **kwargs)[source]
Bases:
CausalRefuter
Refute an estimate by rerunning it on a random subset of the original data.
Supports additional parameters that can be specified in the refute_estimate() method. For joblib-related parameters (n_jobs, verbose), please refer to the joblib documentation for more details (https://joblib.readthedocs.io/en/latest/generated/joblib.Parallel.html).
- Parameters:
subset_fraction (float, optional) – Fraction of the data to be used for re-estimation, which is
DataSubsetRefuter.DEFAULT_SUBSET_FRACTION
by default.num_simulations (int, optional) – The number of simulations to be run, which is
CausalRefuter.DEFAULT_NUM_SIMULATIONS
by defaultrandom_state (int, RandomState, optional) – The seed value to be added if we wish to repeat the same random behavior. If we with to repeat the same behavior we push the same seed in the psuedo-random generator
n_jobs (int, optional) – The maximum number of concurrently running jobs. If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all (this is the default).
verbose (int, optional) – The verbosity level: if non zero, progress messages are printed. Above 50, the output is sent to stdout. The frequency of the messages increases with the verbosity level. If it more than 10, all iterations are reported. The default is 0.
- DEFAULT_SUBSET_FRACTION = 0.8
- dowhy.causal_refuters.data_subset_refuter.refute_data_subset(data: DataFrame, target_estimand: IdentifiedEstimand, estimate: CausalEstimate, subset_fraction: float = 0.8, num_simulations: int = 100, random_state: Optional[Union[int, RandomState]] = None, show_progress_bar: bool = False, n_jobs: int = 1, verbose: int = 0, **_) CausalRefutation [source]
Refute an estimate by rerunning it on a random subset of the original data.
- Parameters:
data – pd.DataFrame: Data to run the refutation
target_estimand – IdentifiedEstimand: Identified estimand to run the refutation
estimate – CausalEstimate: Estimate to run the refutation
subset_fraction – Fraction of the data to be used for re-estimation, which is
DataSubsetRefuter.DEFAULT_SUBSET_FRACTION
by default.num_simulations – The number of simulations to be run,
CausalRefuter.DEFAULT_NUM_SIMULATIONS
by defaultrandom_state – The seed value to be added if we wish to repeat the same random behavior. For this purpose, we repeat the same seed in the psuedo-random generator.
n_jobs – The maximum number of concurrently running jobs. If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all (this is the default).
verbose – The verbosity level: if non zero, progress messages are printed. Above 50, the output is sent to stdout. The frequency of the messages increases with the verbosity level. If it more than 10, all iterations are reported. The default is 0.
dowhy.causal_refuters.dummy_outcome_refuter module
- dowhy.causal_refuters.dummy_outcome_refuter.DEFAULT_TRUE_CAUSAL_EFFECT(x)
- class dowhy.causal_refuters.dummy_outcome_refuter.DummyOutcomeRefuter(*args, **kwargs)[source]
Bases:
CausalRefuter
Refute an estimate by replacing the outcome with a simulated variable for which the true causal effect is known.
In the simplest case, the dummy outcome is an independent, randomly generated variable. By definition, the true causal effect should be zero.
More generally, the dummy outcome uses the observed relationship between confounders and outcome (conditional on treatment) to create a more realistic outcome for which the treatment effect is known to be zero. If the goal is to simulate a dummy outcome with a non-zero true causal effect, then we can add an arbitrary function h(t) to the dummy outcome’s generation process and then the causal effect becomes h(t=1)-h(t=0).
Note that this general procedure only works for the backdoor criterion.
1. We find f(W) for a each value of treatment. That is, keeping the treatment constant, we fit a predictor to estimate the effect of confounders W on outcome y. Note that since f(W) simply defines a new DGP for the simulated outcome, it need not be the correct structural equation from W to y. 2. We obtain the value of dummy outcome as:
y_dummy = h(t) + f(W)
To prevent overfitting, we fit f(W) for one value of T and then use it to generate data for other values of t. Future support for identification based on instrumental variable and mediation.
If we originally started out with W / \ t --->y On estimating the following with constant t, y_dummy = f(W) W / \ t --|->y This ensures that we try to capture as much of W--->Y as possible On adding h(t) W / \ t --->y h(t)
Supports additional parameters that can be specified in the refute_estimate() method.
- Parameters:
num_simulations (int, optional) – The number of simulations to be run, which defaults to
CausalRefuter.DEFAULT_NUM_SIMULATIONS
transformation_list (list, optional) –
It is a list of actions to be performed to obtain the outcome, which defaults to
DEFAULT_TRANSFORMATION
. The default transformation is as follows:[("zero",""),("noise", {'std_dev':1} )]
Each of the actions within a transformation is one of the following types:
function argument: function
pd.Dataframe -> np.ndarray
It takes in a function that takes the input data frame as the input and outputs the outcome variable. This allows us to create an output varable that only depends on the covariates and does not depend on the treatment variable.
string argument
Currently it supports some common estimators like
Linear Regression
K Nearest Neighbours
Support Vector Machine
Neural Network
Random Forest
Or functions such as:
Permute This permutes the rows of the outcome, disassociating any effect of the treatment on the outcome.
Noise This adds white noise to the outcome with white noise, reducing any causal relationship with the treatment.
Zero It replaces all the values in the outcome by zero
- Examples:
The
transformation_list
is of the following form:
If the function
pd.Dataframe -> np.ndarray
is already defined.[(func,func_params),('permute',{'permute_fraction':val}),('noise',{'std_dev':val})]
Every function should be able to support a minimum of two arguments
X_train
andoutcome_train
which correspond to the training data and the outcome that we want to predict, along with additional parameters such as the learning rate or the momentum constant can be set with the help offunc_args
.[(neural_network,{'alpha': 0.0001, 'beta': 0.9}),('permute',{'permute_fraction': 0.2}),('noise',{'std_dev': 0.1})]
The neural network is invoked as
neural_network(X_train, outcome_train, **args)
.If a function from the above list is used
[('knn',{'n_neighbors':5}), ('permute', {'permute_fraction': val} ), ('noise', {'std_dev': val} )]
- Parameters:
true_causal_effect – A function that is used to get the True Causal Effect for the modelled dummy outcome. It defaults to
DEFAULT_TRUE_CAUSAL_EFFECT
, which means that there is no relationship between the treatment and outcome in the dummy data.
Note
The true causal effect should take an input of the same shape as the treatment and the output should match the shape of the outcome
- Parameters:
required_variables – The list of variables to be used as the input for
y~f(W)
This isTrue
by default, which in turn selects all variables leaving the treatment and the outcome
Note
We need to pass required_variables =
[W0,W1]
if we wantW0
andW1
.We need to pass required_variables =
[-W0,-W1]
if we want all variables excludingW0
andW1
.
If the value is True, we wish to include all variables to estimate the value of the outcome.
Warning
A
False
value isINVALID
and will result in anerror
.Note
These inputs are fed to the function for estimating the outcome variable. The same set of required_variables is used for each instance of an internal estimation function.
- Parameters:
bucket_size_scale_factor – For continuous data, the scale factor helps us scale the size of the bucket used on the data. The default scale factor is
DEFAULT_BUCKET_SCALE_FACTOR
.min_data_point_threshold (int, optional) – The minimum number of data points for an estimator to run. This defaults to
MIN_DATA_POINT_THRESHOLD
. If the number of data points is too few for a certain category, we make use of theDEFAULT_TRANSFORMATION
for generaring the dummy outcome
- class dowhy.causal_refuters.dummy_outcome_refuter.TestFraction(base, other)
Bases:
tuple
Create new instance of TestFraction(base, other)
- base
Alias for field number 0
- other
Alias for field number 1
- dowhy.causal_refuters.dummy_outcome_refuter.noise(outcome: ndarray, std_dev: float)[source]
Add white noise with mean 0 and standard deviation = std_dev
- Parameters:
'outcome' – np.ndarray The outcome variable, to which the white noise is added.
'std_dev' – float The standard deviation of the white noise.
- Returns:
outcome with added noise
- dowhy.causal_refuters.dummy_outcome_refuter.permute(outcome_name: str, outcome: ndarray, permute_fraction: float)[source]
If the permute_fraction is 1, we permute all the values in the outcome. Otherwise we make use of the Fisher Yates shuffle. Refer to https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle for more details.
- Parameters:
'outcome' – np.ndarray The outcome variable to be permuted.
'permute_fraction' – float [0, 1] The fraction of rows permuted.
- dowhy.causal_refuters.dummy_outcome_refuter.preprocess_data_by_treatment(data: DataFrame, treatment_name: List[str], unobserved_confounder_values: Optional[ndarray], bucket_size_scale_factor: float, chosen_variables: List[str])[source]
This function groups data based on the data type of the treatment.
Expected variable types supported for the treatment:
bool
pd.categorical
float
int
- Returns:
pandas.core.groupby.generic.DataFrameGroupBy
- dowhy.causal_refuters.dummy_outcome_refuter.process_data(outcome_name: str, X_train: ndarray, outcome_train: ndarray, X_validation: ndarray, outcome_validation: ndarray, transformation_list: List)[source]
We process the data by first training the estimators in the transformation_list on
X_train
andoutcome_train
. We then apply the estimators onX_validation
to get the value of the dummy outcome, which we store inoutcome_validation
.- Parameters:
X_train (np.ndarray) – The data of the covariates which is used to train an estimator. It corresponds to the data of a single category of the treatment
outcome_train (np.ndarray) – This is used to hold the intermediate values of the outcome variable in the transformation list
For Example:
[ ('permute', {'permute_fraction': val} ), (func,func_params)]
The value obtained from permutation is used as an input for the custom estimator.
- Parameters:
X_validation (np.ndarray) – The data of the covariates that is fed to a trained estimator to generate a dummy outcome
outcome_validation (np.ndarray) – This variable stores the dummy_outcome generated by the transformations
transformation_list (np.ndarray) – The list of transformations on the outcome data required to produce a dummy outcome
- dowhy.causal_refuters.dummy_outcome_refuter.refute_dummy_outcome(data: ~pandas.core.frame.DataFrame, target_estimand: ~dowhy.causal_identifier.identified_estimand.IdentifiedEstimand, estimate: ~dowhy.causal_estimator.CausalEstimate, treatment_name: str, outcome_name: str, required_variables: ~typing.Optional[~typing.Union[int, list, bool]] = None, min_data_point_threshold: float = 30, bucket_size_scale_factor: float = 0.5, num_simulations: int = 100, transformation_list: ~typing.List = [('zero', ''), ('noise', {'std_dev': 1})], test_fraction: ~typing.List[~dowhy.causal_refuters.dummy_outcome_refuter.TestFraction] = [TestFraction(base=0.5, other=0.5)], unobserved_confounder_values: ~typing.Optional[~typing.List] = None, true_causal_effect: ~typing.Callable = <function <lambda>>, show_progress_bar=False, **_) List[CausalRefutation] [source]
Refute an estimate by replacing the outcome with a simulated variable for which the true causal effect is known.
In the simplest case, the dummy outcome is an independent, randomly generated variable. By definition, the true causal effect should be zero.
More generally, the dummy outcome uses the observed relationship between confounders and outcome (conditional on treatment) to create a more realistic outcome for which the treatment effect is known to be zero. If the goal is to simulate a dummy outcome with a non-zero true causal effect, then we can add an arbitrary function h(t) to the dummy outcome’s generation process and then the causal effect becomes h(t=1)-h(t=0).
Note that this general procedure only works for the backdoor criterion.
1. We find f(W) for a each value of treatment. That is, keeping the treatment constant, we fit a predictor to estimate the effect of confounders W on outcome y. Note that since f(W) simply defines a new DGP for the simulated outcome, it need not be the correct structural equation from W to y. 2. We obtain the value of dummy outcome as:
y_dummy = h(t) + f(W)
To prevent overfitting, we fit f(W) for one value of T and then use it to generate data for other values of t. Future support for identification based on instrumental variable and mediation.
If we originally started out with W / \ t --->y On estimating the following with constant t, y_dummy = f(W) W / \ t --|->y This ensures that we try to capture as much of W--->Y as possible On adding h(t) W / \ t --->y h(t)
- Parameters:
data – pd.DataFrame: Data to run the refutation
target_estimand – IdentifiedEstimand: Identified estimand to run the refutation
estimate – CausalEstimate: Estimate to run the refutation
treatment_name – str: Name of the treatment
num_simulations (int, optional) – The number of simulations to be run, which defaults to
CausalRefuter.DEFAULT_NUM_SIMULATIONS
transformation_list (list, optional) –
It is a list of actions to be performed to obtain the outcome, which defaults to
DEFAULT_TRANSFORMATION
. The default transformation is as follows:[("zero",""),("noise", {'std_dev':1} )]
Each of the actions within a transformation is one of the following types:
function argument: function
pd.Dataframe -> np.ndarray
It takes in a function that takes the input data frame as the input and outputs the outcome variable. This allows us to create an output varable that only depends on the covariates and does not depend on the treatment variable.
string argument
Currently it supports some common estimators like
Linear Regression
K Nearest Neighbours
Support Vector Machine
Neural Network
Random Forest
Or functions such as:
Permute This permutes the rows of the outcome, disassociating any effect of the treatment on the outcome.
Noise This adds white noise to the outcome with white noise, reducing any causal relationship with the treatment.
Zero It replaces all the values in the outcome by zero
- Examples:
The
transformation_list
is of the following form:
If the function
pd.Dataframe -> np.ndarray
is already defined.[(func,func_params),('permute',{'permute_fraction':val}),('noise',{'std_dev':val})]
Every function should be able to support a minimum of two arguments
X_train
andoutcome_train
which correspond to the training data and the outcome that we want to predict, along with additional parameters such as the learning rate or the momentum constant can be set with the help offunc_args
.[(neural_network,{'alpha': 0.0001, 'beta': 0.9}),('permute',{'permute_fraction': 0.2}),('noise',{'std_dev': 0.1})]
The neural network is invoked as
neural_network(X_train, outcome_train, **args)
.If a function from the above list is used
[('knn',{'n_neighbors':5}), ('permute', {'permute_fraction': val} ), ('noise', {'std_dev': val} )]
- Parameters:
true_causal_effect – A function that is used to get the True Causal Effect for the modelled dummy outcome. It defaults to
DEFAULT_TRUE_CAUSAL_EFFECT
, which means that there is no relationship between the treatment and outcome in the dummy data.
Note
The true causal effect should take an input of the same shape as the treatment and the output should match the shape of the outcome
- Parameters:
required_variables – The list of variables to be used as the input for
y~f(W)
This isTrue
by default, which in turn selects all variables leaving the treatment and the outcome
Note
We need to pass required_variables =
[W0,W1]
if we wantW0
andW1
.We need to pass required_variables =
[-W0,-W1]
if we want all variables excludingW0
andW1
.
If the value is True, we wish to include all variables to estimate the value of the outcome.
Warning
A
False
value isINVALID
and will result in anerror
.Note
These inputs are fed to the function for estimating the outcome variable. The same set of required_variables is used for each instance of an internal estimation function.
- Parameters:
bucket_size_scale_factor – For continuous data, the scale factor helps us scale the size of the bucket used on the data. The default scale factor is
DEFAULT_BUCKET_SCALE_FACTOR
.min_data_point_threshold (int, optional) – The minimum number of data points for an estimator to run. This defaults to
MIN_DATA_POINT_THRESHOLD
. If the number of data points is too few for a certain category, we make use of theDEFAULT_TRANSFORMATION
for generaring the dummy outcome
dowhy.causal_refuters.evalue_sensitivity_analyzer module
- class dowhy.causal_refuters.evalue_sensitivity_analyzer.EValueSensitivityAnalyzer(estimate: CausalEstimate, estimand: IdentifiedEstimand, data: DataFrame, treatment_name: str, outcome_name: str, no_effect_baseline=None)[source]
Bases:
object
This class computes Ding & VanderWeele’s E-value for unmeasured confounding. The E-value is the minimum strength of association on the risk ratio scale that an unmeasured confounder would need to have with both the treatment and the outcome, conditional on the measured covariates, to fully explain away a specific treatment-outcome association.
It benchmarks the E-value against measured confounders using McGowan and Greevy Jr.’s Observed Covariate E-value. This approach drops measured confounders and re-fits the estimator, measuring how much the limiting bound of the confidence interval changes on the E-value scale. This benchmarks hypothetical unmeasured confounding against each of the measured confounders.
See: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4820664/, https://dash.harvard.edu/bitstream/handle/1/36874927/EValue_FinalSubmission.pdf, and https://arxiv.org/pdf/2011.07030.pdf. The implementation is based on the R packages https://github.com/cran/EValue and https://github.com/LucyMcGowan/tipr.
- Parameters:
estimate – CausalEstimate
estimand – IdentifiedEstimand
data – pd.DataFrame
outcome_name – Outcome variable name
no_effect_baseline – A number to which to shift the observed estimate to. Defaults to 1 for ratio measures (RR, OR, HR)
and 0 for additive measures (OLS, MD). (Default = None)
- benchmark(data: DataFrame)[source]
Benchmarks E-values against the measured confounders using McGowan and Greevy Jr.’s Observed Covariate E-value. This approach drops measured confounders and re-fits the estimator, measuring how much the limiting bound of the confidence interval changes on the E-value scale. This benchmarks hypothetical unmeasured confounding against each of the measured confounders.
See: https://arxiv.org/pdf/2011.07030.pdf and https://github.com/LucyMcGowan/tipr
- check_sensitivity(data: DataFrame, plot=True)[source]
Computes E-value for point estimate and confidence limits. Benchmarks E-values against measured confounders using Observed Covariate E-values. Plots E-values and Observed Covariate E-values.
- Parameters:
plot – plots E-value for point estimate and confidence limit. (Default = True)
- get_evalue(coef_est, coef_se)[source]
Computes E-value for point estimate and confidence limits. The estimate and confidence limits are converted to the risk ratio scale before the E-value is calculated.
- Parameters:
coef_est – coefficient estimate
coef_se – coefficient standard error
- plot(num_points_per_contour=200, plot_size=(6.4, 4.8), contour_colors=['blue', 'red'], benchmarking_color='green', xy_limit=None)[source]
Plots contours showing the combinations of treatment-confounder and confounder-outcome risk ratios that would tip the point estimate and confidence limit. The X-axis shows the treatment-confounder risk ratio and the Y-axis shows the confounder-outcome risk ratio.
- Parameters:
num_points_per_contour – number of points to calculate and plot for each contour (Default = 200)
plot_size – size of the plot (Default = (6.4,4.8))
contour_colors – colors for point estimate and confidence limit contour (Default = [“blue”, “red”])
benchmarking_color – color for observed covariate E-values. (Default = “green”)
xy_limit – plot’s maximum x and y value. Default is 2 x E-value. (Default = None)
dowhy.causal_refuters.graph_refuter module
- class dowhy.causal_refuters.graph_refuter.GraphRefutation(method_name_discrete, method_name_continuous)[source]
Bases:
CausalRefutation
Class for storing the result of a refutation method.
- class dowhy.causal_refuters.graph_refuter.GraphRefuter(data, method_name_discrete='conditional_mutual_information', method_name_continuous='partial_correlation')[source]
Bases:
CausalRefuter
Class for performing refutations on graph and storing the results
Initialize data for graph refutation
:param data:input dataset :param method_name_discrete: name of method for testing conditional independence in discrete data :param method_name_continuous: name of method for testing conditional independece in continuous data :returns : instance of GraphRefutation class
dowhy.causal_refuters.linear_sensitivity_analyzer module
- class dowhy.causal_refuters.linear_sensitivity_analyzer.LinearSensitivityAnalyzer(estimator=None, data=None, treatment_name=None, percent_change_estimate=1.0, significance_level=0.05, confounder_increases_estimate=True, benchmark_common_causes=None, null_hypothesis_effect=0, frac_strength_treatment=None, frac_strength_outcome=None, common_causes_order=None)[source]
Bases:
object
Class to perform sensitivity analysis See: https://carloscinelli.com/files/Cinelli%20and%20Hazlett%20(2020)%20-%20Making%20Sense%20of%20Sensitivity.pdf
- Parameters:
estimator – linear estimator of the causal model
data – Pandas dataframe
treatment_name –
name of treatment :param percent_change_estimate: It is the percentage of reduction of treatment estimate that could alter the results (default = 1)
if percent_change_estimate = 1, the robustness value describes the strength of association of confounders with treatment and outcome in order to reduce the estimate by 100% i.e bring it down to 0.
null_hypothesis_effect – assumed effect under the null hypothesis
confounder_increases_estimate – True implies that confounder increases the absolute value of estimate and vice versa. (Default = True)
benchmark_common_causes – names of variables for bounding strength of confounders
significance_level – confidence interval for statistical inference(default = 0.05)
frac_strength_treatment – strength of association between unobserved confounder and treatment compared to benchmark covariate
frac_strength_outcome – strength of association between unobserved confounder and outcome compared to benchmark covariate
common_causes_order – The order of column names in OLS regression data
- check_sensitivity(plot=True)[source]
Function to perform sensitivity analysis. :param plot: plot = True generates a plot of point estimate and the variations with respect to unobserved confounding.
plot = False overrides the setting
- Returns:
instance of LinearSensitivityAnalyzer class
- compute_bias_adjusted(r2tu_w, r2yu_tw)[source]
Computes the bias adjusted estimate, standard error, t-value, partial R2, confidence intervals
- Parameters:
r2tu_w – partial r^2 from regressing unobserved confounder u on treatment t after conditioning on observed covariates w
r2yu_tw – partial r^2 from regressing unobserved confounder u on outcome y after conditioning on observed covariates w and treatment t
- Returns:
Python dictionary with information about partial R^2 of confounders with treatment and outcome and bias adjusted variables
- partial_r2_func(estimator_model=None, treatment=None)[source]
Computes the partial R^2 of regression model
- Parameters:
estimator_model – Linear regression model
treatment – treatment name
- Returns:
partial R^2 value
- plot(plot_type='estimate', critical_value=None, x_limit=0.8, y_limit=0.8, num_points_per_contour=200, plot_size=(7, 7), contours_color='blue', critical_contour_color='red', label_fontsize=9, contour_linewidths=0.75, contour_linestyles='solid', contours_label_color='black', critical_label_color='red', unadjusted_estimate_marker='D', unadjusted_estimate_color='black', adjusted_estimate_marker='^', adjusted_estimate_color='red', legend_position=(1.6, 0.6))[source]
Plots and summarizes the sensitivity bounds as a contour plot, as they vary with the partial R^2 of the unobserved confounder(s) with the treatment and the outcome Two types of plots can be generated, based on adjusted estimates or adjusted t-values X-axis: Partial R^2 of treatment and unobserved confounder(s) Y-axis: Partial R^2 of outcome and unobserved confounder(s) We also plot bounds on the partial R^2 of the unobserved confounders obtained from observed covariates
- Parameters:
plot_type – “estimate” or “t-value”
critical_value – special reference value of the estimate or t-value that will be highlighted in the plot
x_limit – plot’s maximum x_axis value (default = 0.8)
y_limit – plot’s minimum y_axis value (default = 0.8)
num_points_per_contour – number of points to calculate and plot each contour line (default = 200)
plot_size – tuple denoting the size of the plot (default = (7,7))
contours_color – color of contour line (default = blue) String or array. If array, lines will be plotted with the specific color in ascending order.
critical_contour_color – color of threshold line (default = red)
label_fontsize – fontsize for labelling contours (default = 9)
contour_linewidths – linewidths for contours (default = 0.75)
contour_linestyles – linestyles for contours (default = “solid”) See : https://matplotlib.org/3.5.0/gallery/lines_bars_and_markers/linestyles.html for more examples
contours_label_color – color of contour line label (default = black)
critical_label_color – color of threshold line label (default = red)
unadjusted_estimate_marker – marker type for unadjusted estimate in the plot (default = ‘D’) See: https://matplotlib.org/stable/api/markers_api.html
adjusted_estimate_marker – marker type for bias adjusted estimates in the plot (default = ‘^’)
- Parm unadjusted_estimate_color:
marker color for unadjusted estimate in the plot (default = “black”)
- Parm adjusted_estimate_color:
marker color for bias adjusted estimates in the plot (default = “red”)
:param legend_position:tuple denoting the position of the legend (default = (1.6, 0.6))
- plot_estimate(r2tu_w, r2yu_tw)[source]
Computes the contours, threshold line and bounds for plotting estimates. Contour lines (z - axis) correspond to the adjusted estimate values for different values of r2tu_w (x) and r2yu_tw (y). :param r2tu_w: hypothetical partial R^2 of confounder with treatment(x - axis) :param r2yu_tw: hypothetical partial R^2 of confounder with outcome(y - axis)
- Returns:
contour_values : values of contour lines for the plot critical_estimate : threshold point estimate_bounds : estimate values for unobserved confounders (bias adjusted estimates)
- plot_t(r2tu_w, r2yu_tw)[source]
Computes the contours, threshold line and bounds for plotting t. Contour lines (z - axis) correspond to the adjusted t values for different values of r2tu_w (x) and r2yu_tw (y). :param r2tu_w: hypothetical partial R^2 of confounder with treatment(x - axis) :param r2yu_tw: hypothetical partial R^2 of confounder with outcome(y - axis)
- Returns:
contour_values : values of contour lines for the plot critical_t : threshold point t_bounds : t-value for unobserved confounders (bias adjusted t values)
- robustness_value_func(alpha=1.0)[source]
Function to calculate the robustness value. It is the minimum strength of association that confounders must have with treatment and outcome to change conclusions. Robustness value describes how strong the association must be in order to reduce the estimated effect by (100 * percent_change_estimate)%. Robustness value close to 1 means the treatment effect can handle strong confounders explaining almost all residual variation of the treatment and the outcome. Robustness value close to 0 means that even very weak confounders can also change the results.
- Parameters:
alpha – confidence interval (default = 1)
- Returns:
robustness value
dowhy.causal_refuters.non_parametric_sensitivity_analyzer module
- class dowhy.causal_refuters.non_parametric_sensitivity_analyzer.NonParametricSensitivityAnalyzer(*args, theta_s, plugin_reisz=False, **kwargs)[source]
Bases:
PartialLinearSensitivityAnalyzer
Non-parametric sensitivity analysis for causal estimators.
- Two important quantities used to estimate the bias are alpha and g.
g := E[Y | T, W, Z] denotes the long regression function g_s := E[Y | T, W] denotes the short regression function α := (T - E[T | W, Z] ) / (E(T - E[T | W, Z]) ^ 2) denotes long reisz representer α_s := (T - E[T | W] ) / (E(T - E[T | W]) ^ 2) denotes short reisz representer
Bias = E(g_s - g)(α_s - α) Thus, The bound is the product of additional variations that omitted confounders generate in the regression function and in the reisz representer for partially linear models. It can be written as, Bias = S * Cg * Calpha where Cg and Calpha are explanatory powers of the confounder and S^2 = E(Y - g_s) ^ 2 * E(α_s ^ 2)
- Based on this work:
Chernozhukov, V., Cinelli, C., Newey, W., Sharma, A., & Syrgkanis, V. (2022). Long Story Short: Omitted Variable Bias in Causal Machine Learning (No. w30302). National Bureau of Economic Research.
- Parameters:
estimator – estimator of the causal model
num_splits – number of splits for cross validation. (default = 5)
:param shuffle_data : shuffle data or not before splitting into folds (default = False) :param shuffle_random_seed: seed for randomly shuffling data :param benchmark_common_causes: names of variables for bounding strength of confounders :param significance_level: confidence interval for statistical inference(default = 0.05) :param frac_strength_treatment: strength of association between unobserved confounder and treatment compared to benchmark covariate :param frac_strength_outcome: strength of association between unobserved confounder and outcome compared to benchmark covariate :param g_s_estimator_list: list of estimator objects for finding g_s. These objects should have fit() and predict() functions. :param g_s_estimator_param_list: list of dictionaries with parameters for tuning respective estimators in “g_s_estimator_list”. :param alpha_s_estimator_list: list of estimator objects for finding the treatment predictor which is used for alpha_s estimation. These objects should have fit() and predict_proba() functions. :param alpha_s_estimator_param_list: list of dictionaries with parameters for tuning respective estimators in “alpha_s_estimator_list”.
The order of the dictionaries in the list should be consistent with the estimator objects order in “g_s_estimator_list”
- Parameters:
observed_common_causes – common causes dataframe
outcome – outcome dataframe
treatment – treatment dataframe
theta_s – point estimate for the estimator
plugin_reisz – whether to use plugin reisz estimator. False by default. The plugin estimator works only for single-dimensional, binary treatment.
- check_sensitivity(plot=True)[source]
Function to perform sensitivity analysis. The following formulae are used to obtain the upper and lower bound respectively. θ+ = θ_s + S * C_g * C_α θ- = θ_s - S * C_g * C_α where θ_s is the obtained estimate, S^2 = E[Y - gs]^2 * E[α_s]^ 2 S is obtained by debiased machine learning. θ_s = E[m(W, gs) + (Y - gs) * α_s] σ² = E[Y - gs]^2 ν^2 = 2 * E[m(W, α_s )] - E[α_s ^ 2]
- Parameters:
plot – plot = True generates a plot of lower confidence bound of the estimate for different variations of unobserved confounding. plot = False overrides the setting
- Returns:
instance of NonParametricSensitivityAnalyzer class
- get_alpharegression_var(X, numeric_features, split_indices, reisz_model=None)[source]
Calculates the variance of reisz function
- Parameters:
X – numpy array containing set of regressors
split_indices – training and testing data indices obtained after cross folding
- Returns:
variance of reisz function
- get_phi_lower_upper(Cg, Calpha)[source]
Calculate lower and upper influence function (phi)
- Parameters:
Cg – measure of strength of confounding that omitted variables generate in outcome regression
Calpha – measure of strength of confounding that omitted variables generate in treatment regression
:returns : lower bound of phi, upper bound of phi
dowhy.causal_refuters.partial_linear_sensitivity_analyzer module
- class dowhy.causal_refuters.partial_linear_sensitivity_analyzer.PartialLinearSensitivityAnalyzer(estimator=None, num_splits=5, shuffle_data=False, shuffle_random_seed=None, reisz_polynomial_max_degree=3, significance_level=0.05, effect_strength_treatment=None, effect_strength_outcome=None, benchmark_common_causes=None, frac_strength_treatment=None, frac_strength_outcome=None, observed_common_causes=None, treatment=None, outcome=None, g_s_estimator_list=None, alpha_s_estimator_list=None, g_s_estimator_param_list=None, alpha_s_estimator_param_list=None, **kwargs)[source]
Bases:
object
Class to perform sensitivity analysis for partially linear model.
An efficient version of the non parametric sensitivity analyzer that works for estimators that return residuals of regression from confounders on treatment and outcome, such as the DML method. For all other methods (or when the partially linear assumption is not guaranteed to be satisfied), use the non-parametric sensitivity analysis.
- Based on this work:
Chernozhukov, V., Cinelli, C., Newey, W., Sharma, A., & Syrgkanis, V. (2022). Long Story Short: Omitted Variable Bias in Causal Machine Learning (No. w30302). National Bureau of Economic Research.
- Parameters:
estimator – estimator of the causal model
num_splits – number of splits for cross validation. (default = 5)
:param shuffle_data : shuffle data or not before splitting into folds (default = False) :param shuffle_random_seed: seed for randomly shuffling data :param effect_strength_treatment: C^2_T, list of plausible sensitivity parameters for effect of confounder on treatment :param effect_strength_outcome: C^2_Y, list of plausible sensitivity parameters for effect of confounder on outcome :param benchmark_common_causes: names of variables for bounding strength of confounders :param significance_level: confidence interval for statistical inference(default = 0.05) :param frac_strength_treatment: strength of association between unobserved confounder and treatment compared to benchmark covariate :param frac_strength_outcome: strength of association between unobserved confounder and outcome compared to benchmark covariate :param g_s_estimator_list: list of estimator objects for finding g_s. These objects should have fit() and predict() functions. :param g_s_estimator_param_list: list of dictionaries with parameters for tuning respective estimators in “g_s_estimator_list”. :param alpha_s_estimator_list: list of estimator objects for finding the treatment predictor which is used for alpha_s estimation. These objects should have fit() and predict_proba() functions. :param alpha_s_estimator_param_list: list of dictionaries with parameters for tuning respective estimators in “alpha_s_estimator_list”.
The order of the dictionaries in the list should be consistent with the estimator objects order in “g_s_estimator_list”
- Parameters:
observed_common_causes – common causes dataframe
outcome – outcome dataframe
treatment – treatment dataframe
- calculate_robustness_value(alpha, is_partial_linear)[source]
Function to compute the robustness value of estimate against the confounders :param alpha: confidence interval for statistical inference
- Returns:
robustness value
- check_sensitivity(plot=True)[source]
Function to perform sensitivity analysis.
- Parameters:
plot – plot = True generates a plot of lower confidence bound of the estimate for different variations of unobserved confounding. plot = False overrides the setting
- Returns:
instance of PartialLinearSensitivityAnalyzer class
- compute_r2diff_benchmarking_covariates(treatment_df, features, T, Y, W, benchmark_common_causes, split_indices=None, second_stage_linear=False, is_partial_linear=True)[source]
Computes the change in partial R^2 due to presence of unobserved confounders :param split_indices: training and testing data indices obtained after cross folding :param second_stage_linear: True if second stage regression is linear else False (default = False) :param is_partial_linear: True if the data-generating process is assumed to be partially linear
- Returns delta_r2_y_wj:
observed additive gains in explanatory power with outcome when including benchmark covariate on regression equation
- Returns delta_r2t_wj:
observed additive gains in explanatory power with treatment when including benchmark covariate on regression equation
- get_confidence_levels(r2yu_tw, r2tu_w, significance_level, is_partial_linear)[source]
Returns lower and upper bounds for the effect estimate, given different explanatory powers of unobserved confounders. It uses the following definitions.
Y_residual = Y - E[Y | X, T] (residualized outcome) T_residual = T - E[T | X] (residualized treatment) theta = E[(Y - E[Y | X, T)(T - E[T | X] )] / E[(T - E[T | X]) ^ 2] σ² = E[(Y - E[Y | X, T]) ^ 2] (expected value of residual outcome) ν^2 = E[(T - E[T | X])^2] (expected value of residual treatment) ψ_θ = m(Ws , g) + (Y - g(Ws))α(Ws) - θ ψ_σ² = (Y - g(Ws)) ^ 2 - σ² ψ_ν2 = (2m(Ws, α ) - α^2) - ν^2
- Parameters:
r2yu_tw – proportion of residual variance in the outcome explained by confounders
r2tu_w – proportion of residual variance in the treatment explained by confounders
significance_level – confidence interval for statistical inference(default = 0.05)
is_partial_linear – whether the data-generating process is assumed to be partially linear
- Returns lower_confidence_bound:
lower limit of confidence bound of the estimate
- Returns upper_confidence_bound:
upper limit of confidence bound of the estimate
- Returns bias:
omitted variable bias for the confounding scenario
- get_phi_lower_upper(Cg, Calpha)[source]
Calculate lower and upper influence function (phi)
- Parameters:
Cg – measure of strength of confounding that omitted variables generate in outcome regression
Calpha – measure of strength of confounding that omitted variables generate in treatment regression
:returns : lower bound of phi, upper bound of phi
- get_regression_r2(X, Y, numeric_features, split_indices, regression_model=None)[source]
Calculates the pearson non parametric partial R^2 from a regression function.
- Parameters:
X – numpy array containing set of regressors
Y – outcome variable in regression
numeric_features – list of indices of columns with numeric features
split_indices – training and testing data indices obtained after cross folding
- Returns:
partial R^2 value
- perform_benchmarking(r2yu_tw, r2tu_w, significance_level, is_partial_linear=True)[source]
- Parameters:
r2yu_tw – proportion of residual variance in the outcome explained by confounders
r2tu_w – proportion of residual variance in the treatment explained by confounders
significance_level – the desired significance level for the bounds
is_partial_linear – whether we assume a partially linear data-generating process
- Returns:
python dictionary storing values of r2tu_w, r2yu_tw, short estimate, bias, lower_ate_bound,upper_ate_bound, lower_confidence_bound, upper_confidence_bound
- plot(plot_type='lower_confidence_bound', plot_size=(7, 7), contours_color='blue', critical_contour_color='red', label_fontsize=9, contour_linewidths=0.75, contour_linestyles='solid', contours_label_color='black', critical_label_color='red', unadjusted_estimate_marker='D', unadjusted_estimate_color='black', adjusted_estimate_marker='^', adjusted_estimate_color='red', legend_position=(1.05, 1))[source]
Plots and summarizes the sensitivity bounds as a contour plot, as they vary with the partial R^2 of the unobserved confounder(s) with the treatment and the outcome Two types of plots can be generated, based on adjusted estimates or adjusted t-values X-axis: Partial R^2 of treatment and unobserved confounder(s) Y-axis: Partial R^2 of outcome and unobserved confounder(s) We also plot bounds on the partial R^2 of the unobserved confounders obtained from observed covariates
- Parameters:
plot_type – possible values are ‘bias’,’lower_ate_bound’,’upper_ate_bound’,’lower_confidence_bound’,’upper_confidence_bound’
plot_size – tuple denoting the size of the plot (default = (7,7))
contours_color – color of contour line (default = blue) String or array. If array, lines will be plotted with the specific color in ascending order.
critical_contour_color – color of threshold line (default = red)
label_fontsize – fontsize for labelling contours (default = 9)
contour_linewidths – linewidths for contours (default = 0.75)
contour_linestyles – linestyles for contours (default = “solid”) See : https://matplotlib.org/3.5.0/gallery/lines_bars_and_markers/linestyles.html for more examples
contours_label_color – color of contour line label (default = black)
critical_label_color – color of threshold line label (default = red)
unadjusted_estimate_marker – marker type for unadjusted estimate in the plot (default = ‘D’) See: https://matplotlib.org/stable/api/markers_api.html
unadjusted_estimate_color – marker color for unadjusted estimate in the plot (default = “black”)
adjusted_estimate_marker – marker type for bias adjusted estimates in the plot (default = ‘^’)
- Parm adjusted_estimate_color:
marker color for bias adjusted estimates in the plot (default = “red”)
:param legend_position:tuple denoting the position of the legend (default = (1.6, 0.6))
dowhy.causal_refuters.placebo_treatment_refuter module
- class dowhy.causal_refuters.placebo_treatment_refuter.PlaceboTreatmentRefuter(*args, **kwargs)[source]
Bases:
CausalRefuter
Refute an estimate by replacing treatment with a randomly-generated placebo variable.
Supports additional parameters that can be specified in the refute_estimate() method. For joblib-related parameters (n_jobs, verbose), please refer to the joblib documentation for more details (https://joblib.readthedocs.io/en/latest/generated/joblib.Parallel.html).
- Parameters:
placebo_type (str, optional) – Default is to generate random values for the treatment. If placebo_type is “permute”, then the original treatment values are permuted by row.
num_simulations (int, optional) – The number of simulations to be run, which is
CausalRefuter.DEFAULT_NUM_SIMULATIONS
by defaultrandom_state (int, RandomState, optional) – The seed value to be added if we wish to repeat the same random behavior. If we want to repeat the same behavior we push the same seed in the psuedo-random generator.
n_jobs (int, optional) – The maximum number of concurrently running jobs. If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all (this is the default).
verbose (int, optional) – The verbosity level: if non zero, progress messages are printed. Above 50, the output is sent to stdout. The frequency of the messages increases with the verbosity level. If it more than 10, all iterations are reported. The default is 0.
- class dowhy.causal_refuters.placebo_treatment_refuter.PlaceboType(value)[source]
Bases:
Enum
An enumeration.
- DEFAULT = 'Random Data'
- PERMUTE = 'permute'
- dowhy.causal_refuters.placebo_treatment_refuter.refute_placebo_treatment(data: DataFrame, target_estimand: IdentifiedEstimand, estimate: CausalEstimate, treatment_names: List, num_simulations: int = 100, placebo_type: PlaceboType = PlaceboType.DEFAULT, random_state: Optional[Union[int, RandomState]] = None, show_progress_bar: bool = False, n_jobs: int = 1, verbose: int = 0, **_) CausalRefutation [source]
Refute an estimate by replacing treatment with a randomly-generated placebo variable.
- Parameters:
data – pd.DataFrame: Data to run the refutation
target_estimand – IdentifiedEstimand: Identified estimand to run the refutation
estimate – CausalEstimate: Estimate to run the refutation
treatment_names – list: List of treatments
num_simulations – The number of simulations to be run, which defaults to
CausalRefuter.DEFAULT_NUM_SIMULATIONS
placebo_type – Default is to generate random values for the treatment. If placebo_type is “permute”, then the original treatment values are permuted by row.
random_state – The seed value to be added if we wish to repeat the same random behavior. If we want to repeat the same behavior we push the same seed in the psuedo-random generator.
n_jobs – The maximum number of concurrently running jobs. If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all (this is the default).
verbose – The verbosity level: if non zero, progress messages are printed. Above 50, the output is sent to stdout. The frequency of the messages increases with the verbosity level. If it more than 10, all iterations are reported. The default is 0.
dowhy.causal_refuters.random_common_cause module
- class dowhy.causal_refuters.random_common_cause.RandomCommonCause(*args, **kwargs)[source]
Bases:
CausalRefuter
Refute an estimate by introducing a randomly generated confounder (that may have been unobserved).
Supports additional parameters that can be specified in the refute_estimate() method. For joblib-related parameters (n_jobs, verbose), please refer to the joblib documentation for more details (https://joblib.readthedocs.io/en/latest/generated/joblib.Parallel.html).
- Parameters:
num_simulations (int, optional) – The number of simulations to be run, which is
CausalRefuter.DEFAULT_NUM_SIMULATIONS
by defaultrandom_state (int, RandomState, optional) – The seed value to be added if we wish to repeat the same random behavior. If we with to repeat the same behavior we push the same seed in the psuedo-random generator
n_jobs (int, optional) – The maximum number of concurrently running jobs. If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all (this is the default).
verbose (int, optional) – The verbosity level: if non zero, progress messages are printed. Above 50, the output is sent to stdout. The frequency of the messages increases with the verbosity level. If it more than 10, all iterations are reported. The default is 0.
- dowhy.causal_refuters.random_common_cause.refute_random_common_cause(data: DataFrame, target_estimand: IdentifiedEstimand, estimate: CausalEstimate, num_simulations: int = 100, random_state: Optional[Union[int, RandomState]] = None, show_progress_bar: bool = False, n_jobs: int = 1, verbose: int = 0, **_) CausalRefutation [source]
Refute an estimate by introducing a randomly generated confounder (that may have been unobserved).
- Parameters:
data – pd.DataFrame: Data to run the refutation
target_estimand – IdentifiedEstimand: Identified estimand to run the refutation
estimate – CausalEstimate: Estimate to run the refutation
num_simulations – The number of simulations to be run, which defaults to
CausalRefuter.DEFAULT_NUM_SIMULATIONS
random_state – The seed value to be added if we wish to repeat the same random behavior. If we want to repeat the same behavior we push the same seed in the psuedo-random generator.
n_jobs – The maximum number of concurrently running jobs. If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all (this is the default).
verbose – The verbosity level: if non zero, progress messages are printed. Above 50, the output is sent to stdout. The frequency of the messages increases with the verbosity level. If it more than 10, all iterations are reported. The default is 0.
dowhy.causal_refuters.refute_estimate module
- dowhy.causal_refuters.refute_estimate.refute_estimate(data: ~pandas.core.frame.DataFrame, target_estimand: ~dowhy.causal_identifier.identified_estimand.IdentifiedEstimand, estimate: ~dowhy.causal_estimator.CausalEstimate, treatment_name: ~typing.Optional[str] = None, outcome_name: ~typing.Optional[str] = None, refuters: ~typing.List[~typing.Callable[[...], ~typing.Union[~dowhy.causal_refuter.CausalRefutation, ~typing.List[~dowhy.causal_refuter.CausalRefutation]]]] = [<function sensitivity_simulation>, <function refute_bootstrap>, <function refute_data_subset>, <function refute_dummy_outcome>, <function refute_placebo_treatment>, <function refute_random_common_cause>], **kwargs) List[CausalRefutation] [source]
- Executes a list of refuters using the default parameters
Only refuters that return CausalRefutation or a list of CausalRefutation is supported
- Parameters:
data – pd.DataFrame: Data to run the refutation
target_estimand – IdentifiedEstimand: Identified estimand to run the refutation
estimate – CausalEstimate: Estimate to run the refutation
treatment_name – str: Name of the treatment (Optional)
outcome_name – str: Name of the outcome (Optional)
refuters – list: List of refuters to execute
- **kwargskwargs:
Replace any default for the provided list of refuters
dowhy.causal_refuters.reisz module
- class dowhy.causal_refuters.reisz.PluginReisz(propensity_model)[source]
Bases:
object
Plugin reisz function for average treatment effect
- class dowhy.causal_refuters.reisz.ReiszRepresenter(*args: Any, **kwargs: Any)[source]
Bases:
BaseGRF
Generalized Random Forest to estimate Reisz Representer (RR) See: https://github.com/microsoft/EconML/blob/main/econml/grf/_base_grf.py :param reisz_functions: List of polynomial functions of n degree to approximate reisz representer created using create_polynomial_function :param moment_function: moment function m(W,g) whose expected value is used to calculate estimate :param l2_regularizer: l2 penalty while modeling (default = 1e-3) For tuning other parameters see https://econml.azurewebsites.net/_autosummary/econml.grf.CausalForest.html
- dowhy.causal_refuters.reisz.get_alpha_estimator(cv, X, max_degree=None, estimator_list=None, estimator_param_list=None, numeric_features=None, plugin_reisz=True)[source]
Finds the best estimator for reisz representer (alpha_s )
- Parameters:
cv – training and testing data indices obtained afteer Kfolding the dataset
X – treatment+confounders
max_degree – degree of the polynomial function used to approximate alpha_s
param_grid_dict – python dictionary with parameters to tune the ReiszRepresenter estimator
- Returns:
estimator for alpha_s
This method assumes a binary T.
- dowhy.causal_refuters.reisz.get_generic_regressor(cv, X, Y, max_degree=3, estimator_list=None, estimator_param_list=None, numeric_features=None)[source]
Finds the best estimator for regression function (g_s)
- Parameters:
cv – training and testing data indices obtained afteer Kfolding the dataset
X – regressors data for training the regression model
Y – outcome data for training the regression model
max_degree – degree of the polynomial function used to approximate the regression function
estimator_list – list of estimator objects for finding the regression function
estimator_param_list – list of dictionaries with parameters for tuning respective estimators in estimator_list
numeric_features – list of indices of numeric features in the dataset
- Returns:
estimator for Reisz Regression function
Module contents
- class dowhy.causal_refuters.AddUnobservedCommonCause(*args, **kwargs)[source]
Bases:
CausalRefuter
Add an unobserved confounder for refutation.
- AddUnobservedCommonCause class supports three methods:
Simulation of an unobserved confounder
Linear partial R2 : Sensitivity Analysis for linear models.
Non-Parametric partial R2 based : Sensitivity Analyis for non-parametric models.
Supports additional parameters that can be specified in the refute_estimate() method.
Initialize the parameters required for the refuter.
For direct_simulation, if effect_strength_on_treatment or effect_strength_on_outcome is not given, it is calculated automatically as a range between the minimum and maximum effect strength of observed confounders on treatment and outcome respectively.
- Parameters:
simulation_method – The method to use for simulating effect of unobserved confounder. Possible values are [“direct-simulation”, “linear-partial-R2”, “non-parametric-partial-R2”, “e-value”].
confounders_effect_on_treatment – str : The type of effect on the treatment due to the unobserved confounder. Possible values are [‘binary_flip’, ‘linear’]
confounders_effect_on_outcome – str : The type of effect on the outcome due to the unobserved confounder. Possible values are [‘binary_flip’, ‘linear’]
effect_strength_on_treatment – float, numpy.ndarray: [Used when simulation_method=”direct-simulation”] Strength of the confounder’s effect on treatment. When confounders_effect_on_treatment is linear, it is the regression coefficient. When the confounders_effect_on_treatment is binary flip, it is the probability with which effect of unobserved confounder can invert the value of the treatment.
effect_strength_on_outcome – float, numpy.ndarray: Strength of the confounder’s effect on outcome. Its interpretation depends on confounders_effect_on_outcome and the simulation_method. When simulation_method is direct-simulation, for a linear effect it behaves like the regression coefficient and for a binary flip, it is the probability with which it can invert the value of the outcome.
partial_r2_confounder_treatment – float, numpy.ndarray: [Used when simulation_method is linear-partial-R2 or non-parametric-partial-R2] Partial R2 of the unobserved confounder wrt the treatment conditioned on the observed confounders. Only in the case of general non-parametric-partial-R2, it is the fraction of variance in the reisz representer that is explained by the unobserved confounder; specifically (1-r), where r is the ratio of variance of reisz representer, alpha^2, based on observed confounders and that based on all confounders.
partial_r2_confounder_outcome – float, numpy.ndarray: [Used when simulation_method is linear-partial-R2 or non-parametric-partial-R2] Partial R2 of the unobserved confounder wrt the outcome conditioned on the treatment and observed confounders.
frac_strength_treatment – float: This parameter decides the effect strength of the simulated confounder as a fraction of the effect strength of observed confounders on treatment. Defaults to 1.
frac_strength_outcome – float: This parameter decides the effect strength of the simulated confounder as a fraction of the effect strength of observed confounders on outcome. Defaults to 1.
plotmethod – string: Type of plot to be shown. If None, no plot is generated. This parameter is used only only when more than one treatment confounder effect values or outcome confounder effect values are provided. Default is “colormesh”. Supported values are “contour”, “colormesh” when more than one value is provided for both confounder effect value parameters; “line” when provided for only one of them.
percent_change_estimate – It is the percentage of reduction of treatment estimate that could alter the results (default = 1). if percent_change_estimate = 1, the robustness value describes the strength of association of confounders with treatment and outcome in order to reduce the estimate by 100% i.e bring it down to 0. (relevant only for Linear Sensitivity Analysis, ignore for rest)
confounder_increases_estimate – True implies that confounder increases the absolute value of estimate and vice versa. (Default = False). (relevant only for Linear Sensitivity Analysis, ignore for rest)
benchmark_common_causes – names of variables for bounding strength of confounders. (relevant only for partial-r2 based simulation methods)
significance_level – confidence interval for statistical inference(default = 0.05). (relevant only for partial-r2 based simulation methods)
null_hypothesis_effect – assumed effect under the null hypothesis. (relevant only for linear-partial-R2, ignore for rest)
plot_estimate – Generate contour plot for estimate while performing sensitivity analysis. (default = True). (relevant only for partial-r2 based simulation methods)
num_splits – number of splits for cross validation. (default = 5). (relevant only for non-parametric-partial-R2 simulation method)
:param shuffle_data : shuffle data or not before splitting into folds (default = False). (relevant only for non-parametric-partial-R2 simulation method) :param shuffle_random_seed: seed for randomly shuffling data. (relevant only for non-parametric-partial-R2 simulation method) :param alpha_s_estimator_param_list: list of dictionaries with parameters for finding alpha_s. (relevant only for non-parametric-partial-R2 simulation method) :param g_s_estimator_list: list of estimator objects for finding g_s. These objects should have fit() and predict() functions implemented. (relevant only for non-parametric-partial-R2 simulation method) :param g_s_estimator_param_list: list of dictionaries with parameters for tuning respective estimators in “g_s_estimator_list”. The order of the dictionaries in the list should be consistent with the estimator objects order in “g_s_estimator_list”. (relevant only for non-parametric-partial-R2 simulation method)
- class dowhy.causal_refuters.BootstrapRefuter(*args, **kwargs)[source]
Bases:
CausalRefuter
Refute an estimate by running it on a random sample of the data containing measurement error in the confounders. This allows us to find the ability of the estimator to find the effect of the treatment on the outcome.
It supports additional parameters that can be specified in the refute_estimate() method.
- Parameters:
num_simulations (int, optional) – The number of simulations to be run,
CausalRefuter.DEFAULT_NUM_SIMULATIONS
by defaultsample_size (int, optional) – The size of each bootstrap sample and is the size of the original data by default
required_variables (int, list, bool, optional) – The list of variables to be used as the input for
y~f(W)
This isTrue
by default, which in turn selects all variables leaving the treatment and the outcome
An integer argument refers to how many variables will be used for estimating the value of the outcome
A list explicitly refers to which variables will be used to estimate the outcome Furthermore, it gives the ability to explictly select or deselect the covariates present in the estimation of the outcome. This is done by either adding or explicitly removing variables from the list as shown below:
Note
We need to pass required_variables =
[W0,W1]
if we wantW0
andW1
.We need to pass required_variables =
[-W0,-W1]
if we want all variables excludingW0
andW1
.
If the value is True, we wish to include all variables to estimate the value of the outcome.
Warning
A
False
value isINVALID
and will result in anerror
.- Parameters:
noise (float, optional) – The standard deviation of the noise to be added to the data and is
BootstrapRefuter.DEFAULT_STD_DEV
by defaultprobability_of_change (float, optional) – It specifies the probability with which we change the data for a boolean or categorical variable It is
noise
by default, only if the value ofnoise
is less than 1.random_state (int, RandomState, optional) – The seed value to be added if we wish to repeat the same random behavior. For this purpose, we repeat the same seed in the psuedo-random generator.
- DEFAULT_NUMBER_OF_TRIALS = 1
- DEFAULT_STD_DEV = 0.1
- DEFAULT_SUCCESS_PROBABILITY = 0.5
- class dowhy.causal_refuters.DataSubsetRefuter(*args, **kwargs)[source]
Bases:
CausalRefuter
Refute an estimate by rerunning it on a random subset of the original data.
Supports additional parameters that can be specified in the refute_estimate() method. For joblib-related parameters (n_jobs, verbose), please refer to the joblib documentation for more details (https://joblib.readthedocs.io/en/latest/generated/joblib.Parallel.html).
- Parameters:
subset_fraction (float, optional) – Fraction of the data to be used for re-estimation, which is
DataSubsetRefuter.DEFAULT_SUBSET_FRACTION
by default.num_simulations (int, optional) – The number of simulations to be run, which is
CausalRefuter.DEFAULT_NUM_SIMULATIONS
by defaultrandom_state (int, RandomState, optional) – The seed value to be added if we wish to repeat the same random behavior. If we with to repeat the same behavior we push the same seed in the psuedo-random generator
n_jobs (int, optional) – The maximum number of concurrently running jobs. If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all (this is the default).
verbose (int, optional) – The verbosity level: if non zero, progress messages are printed. Above 50, the output is sent to stdout. The frequency of the messages increases with the verbosity level. If it more than 10, all iterations are reported. The default is 0.
- DEFAULT_SUBSET_FRACTION = 0.8
- class dowhy.causal_refuters.DummyOutcomeRefuter(*args, **kwargs)[source]
Bases:
CausalRefuter
Refute an estimate by replacing the outcome with a simulated variable for which the true causal effect is known.
In the simplest case, the dummy outcome is an independent, randomly generated variable. By definition, the true causal effect should be zero.
More generally, the dummy outcome uses the observed relationship between confounders and outcome (conditional on treatment) to create a more realistic outcome for which the treatment effect is known to be zero. If the goal is to simulate a dummy outcome with a non-zero true causal effect, then we can add an arbitrary function h(t) to the dummy outcome’s generation process and then the causal effect becomes h(t=1)-h(t=0).
Note that this general procedure only works for the backdoor criterion.
1. We find f(W) for a each value of treatment. That is, keeping the treatment constant, we fit a predictor to estimate the effect of confounders W on outcome y. Note that since f(W) simply defines a new DGP for the simulated outcome, it need not be the correct structural equation from W to y. 2. We obtain the value of dummy outcome as:
y_dummy = h(t) + f(W)
To prevent overfitting, we fit f(W) for one value of T and then use it to generate data for other values of t. Future support for identification based on instrumental variable and mediation.
If we originally started out with W / \ t --->y On estimating the following with constant t, y_dummy = f(W) W / \ t --|->y This ensures that we try to capture as much of W--->Y as possible On adding h(t) W / \ t --->y h(t)
Supports additional parameters that can be specified in the refute_estimate() method.
- Parameters:
num_simulations (int, optional) – The number of simulations to be run, which defaults to
CausalRefuter.DEFAULT_NUM_SIMULATIONS
transformation_list (list, optional) –
It is a list of actions to be performed to obtain the outcome, which defaults to
DEFAULT_TRANSFORMATION
. The default transformation is as follows:[("zero",""),("noise", {'std_dev':1} )]
Each of the actions within a transformation is one of the following types:
function argument: function
pd.Dataframe -> np.ndarray
It takes in a function that takes the input data frame as the input and outputs the outcome variable. This allows us to create an output varable that only depends on the covariates and does not depend on the treatment variable.
string argument
Currently it supports some common estimators like
Linear Regression
K Nearest Neighbours
Support Vector Machine
Neural Network
Random Forest
Or functions such as:
Permute This permutes the rows of the outcome, disassociating any effect of the treatment on the outcome.
Noise This adds white noise to the outcome with white noise, reducing any causal relationship with the treatment.
Zero It replaces all the values in the outcome by zero
- Examples:
The
transformation_list
is of the following form:
If the function
pd.Dataframe -> np.ndarray
is already defined.[(func,func_params),('permute',{'permute_fraction':val}),('noise',{'std_dev':val})]
Every function should be able to support a minimum of two arguments
X_train
andoutcome_train
which correspond to the training data and the outcome that we want to predict, along with additional parameters such as the learning rate or the momentum constant can be set with the help offunc_args
.[(neural_network,{'alpha': 0.0001, 'beta': 0.9}),('permute',{'permute_fraction': 0.2}),('noise',{'std_dev': 0.1})]
The neural network is invoked as
neural_network(X_train, outcome_train, **args)
.If a function from the above list is used
[('knn',{'n_neighbors':5}), ('permute', {'permute_fraction': val} ), ('noise', {'std_dev': val} )]
- Parameters:
true_causal_effect – A function that is used to get the True Causal Effect for the modelled dummy outcome. It defaults to
DEFAULT_TRUE_CAUSAL_EFFECT
, which means that there is no relationship between the treatment and outcome in the dummy data.
Note
The true causal effect should take an input of the same shape as the treatment and the output should match the shape of the outcome
- Parameters:
required_variables – The list of variables to be used as the input for
y~f(W)
This isTrue
by default, which in turn selects all variables leaving the treatment and the outcome
Note
We need to pass required_variables =
[W0,W1]
if we wantW0
andW1
.We need to pass required_variables =
[-W0,-W1]
if we want all variables excludingW0
andW1
.
If the value is True, we wish to include all variables to estimate the value of the outcome.
Warning
A
False
value isINVALID
and will result in anerror
.Note
These inputs are fed to the function for estimating the outcome variable. The same set of required_variables is used for each instance of an internal estimation function.
- Parameters:
bucket_size_scale_factor – For continuous data, the scale factor helps us scale the size of the bucket used on the data. The default scale factor is
DEFAULT_BUCKET_SCALE_FACTOR
.min_data_point_threshold (int, optional) – The minimum number of data points for an estimator to run. This defaults to
MIN_DATA_POINT_THRESHOLD
. If the number of data points is too few for a certain category, we make use of theDEFAULT_TRANSFORMATION
for generaring the dummy outcome
- class dowhy.causal_refuters.PlaceboTreatmentRefuter(*args, **kwargs)[source]
Bases:
CausalRefuter
Refute an estimate by replacing treatment with a randomly-generated placebo variable.
Supports additional parameters that can be specified in the refute_estimate() method. For joblib-related parameters (n_jobs, verbose), please refer to the joblib documentation for more details (https://joblib.readthedocs.io/en/latest/generated/joblib.Parallel.html).
- Parameters:
placebo_type (str, optional) – Default is to generate random values for the treatment. If placebo_type is “permute”, then the original treatment values are permuted by row.
num_simulations (int, optional) – The number of simulations to be run, which is
CausalRefuter.DEFAULT_NUM_SIMULATIONS
by defaultrandom_state (int, RandomState, optional) – The seed value to be added if we wish to repeat the same random behavior. If we want to repeat the same behavior we push the same seed in the psuedo-random generator.
n_jobs (int, optional) – The maximum number of concurrently running jobs. If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all (this is the default).
verbose (int, optional) – The verbosity level: if non zero, progress messages are printed. Above 50, the output is sent to stdout. The frequency of the messages increases with the verbosity level. If it more than 10, all iterations are reported. The default is 0.
- class dowhy.causal_refuters.RandomCommonCause(*args, **kwargs)[source]
Bases:
CausalRefuter
Refute an estimate by introducing a randomly generated confounder (that may have been unobserved).
Supports additional parameters that can be specified in the refute_estimate() method. For joblib-related parameters (n_jobs, verbose), please refer to the joblib documentation for more details (https://joblib.readthedocs.io/en/latest/generated/joblib.Parallel.html).
- Parameters:
num_simulations (int, optional) – The number of simulations to be run, which is
CausalRefuter.DEFAULT_NUM_SIMULATIONS
by defaultrandom_state (int, RandomState, optional) – The seed value to be added if we wish to repeat the same random behavior. If we with to repeat the same behavior we push the same seed in the psuedo-random generator
n_jobs (int, optional) – The maximum number of concurrently running jobs. If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all (this is the default).
verbose (int, optional) – The verbosity level: if non zero, progress messages are printed. Above 50, the output is sent to stdout. The frequency of the messages increases with the verbosity level. If it more than 10, all iterations are reported. The default is 0.
- dowhy.causal_refuters.refute_bootstrap(data: DataFrame, target_estimand: IdentifiedEstimand, estimate: CausalEstimate, num_simulations: int = 100, random_state: Optional[Union[int, RandomState]] = None, sample_size: Optional[int] = None, required_variables: bool = True, noise: float = 0.1, probability_of_change: Optional[float] = None, show_progress_bar: bool = False, n_jobs: int = 1, verbose: int = 0, **_) CausalRefutation [source]
Refute an estimate by running it on a random sample of the data containing measurement error in the confounders. This allows us to find the ability of the estimator to find the effect of the treatment on the outcome.
- Parameters:
data – pd.DataFrame: Data to run the refutation
target_estimand – IdentifiedEstimand: Identified estimand to run the refutation
estimate – CausalEstimate: Estimate to run the refutation
num_simulations – The number of simulations to be run,
CausalRefuter.DEFAULT_NUM_SIMULATIONS
by defaultrandom_state – The seed value to be added if we wish to repeat the same random behavior. For this purpose, we repeat the same seed in the psuedo-random generator.
sample_size – The size of each bootstrap sample and is the size of the original data by default
required_variables – The list of variables to be used as the input for
y~f(W)
This isTrue
by default, which in turn selects all variables leaving the treatment and the outcome
An integer argument refers to how many variables will be used for estimating the value of the outcome
A list explicitly refers to which variables will be used to estimate the outcome Furthermore, it gives the ability to explictly select or deselect the covariates present in the estimation of the outcome. This is done by either adding or explicitly removing variables from the list as shown below:
Note
We need to pass required_variables =
[W0,W1]
if we wantW0
andW1
.We need to pass required_variables =
[-W0,-W1]
if we want all variables excludingW0
andW1
.
3. If the value is True, we wish to include all variables to estimate the value of the outcome. .. warning:: A
False
value isINVALID
and will result in anerror
. :param noise: The standard deviation of the noise to be added to the data and isBootstrapRefuter.DEFAULT_STD_DEV
by default :param probability_of_change: It specifies the probability with which we change the data for a boolean or categorical variableIt is
noise
by default, only if the value ofnoise
is less than 1.- Parameters:
n_jobs – The maximum number of concurrently running jobs. If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all (this is the default).
verbose – The verbosity level: if non zero, progress messages are printed. Above 50, the output is sent to stdout. The frequency of the messages increases with the verbosity level. If it more than 10, all iterations are reported. The default is 0.
- dowhy.causal_refuters.refute_data_subset(data: DataFrame, target_estimand: IdentifiedEstimand, estimate: CausalEstimate, subset_fraction: float = 0.8, num_simulations: int = 100, random_state: Optional[Union[int, RandomState]] = None, show_progress_bar: bool = False, n_jobs: int = 1, verbose: int = 0, **_) CausalRefutation [source]
Refute an estimate by rerunning it on a random subset of the original data.
- Parameters:
data – pd.DataFrame: Data to run the refutation
target_estimand – IdentifiedEstimand: Identified estimand to run the refutation
estimate – CausalEstimate: Estimate to run the refutation
subset_fraction – Fraction of the data to be used for re-estimation, which is
DataSubsetRefuter.DEFAULT_SUBSET_FRACTION
by default.num_simulations – The number of simulations to be run,
CausalRefuter.DEFAULT_NUM_SIMULATIONS
by defaultrandom_state – The seed value to be added if we wish to repeat the same random behavior. For this purpose, we repeat the same seed in the psuedo-random generator.
n_jobs – The maximum number of concurrently running jobs. If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all (this is the default).
verbose – The verbosity level: if non zero, progress messages are printed. Above 50, the output is sent to stdout. The frequency of the messages increases with the verbosity level. If it more than 10, all iterations are reported. The default is 0.
- dowhy.causal_refuters.refute_dummy_outcome(data: ~pandas.core.frame.DataFrame, target_estimand: ~dowhy.causal_identifier.identified_estimand.IdentifiedEstimand, estimate: ~dowhy.causal_estimator.CausalEstimate, treatment_name: str, outcome_name: str, required_variables: ~typing.Optional[~typing.Union[int, list, bool]] = None, min_data_point_threshold: float = 30, bucket_size_scale_factor: float = 0.5, num_simulations: int = 100, transformation_list: ~typing.List = [('zero', ''), ('noise', {'std_dev': 1})], test_fraction: ~typing.List[~dowhy.causal_refuters.dummy_outcome_refuter.TestFraction] = [TestFraction(base=0.5, other=0.5)], unobserved_confounder_values: ~typing.Optional[~typing.List] = None, true_causal_effect: ~typing.Callable = <function <lambda>>, show_progress_bar=False, **_) List[CausalRefutation] [source]
Refute an estimate by replacing the outcome with a simulated variable for which the true causal effect is known.
In the simplest case, the dummy outcome is an independent, randomly generated variable. By definition, the true causal effect should be zero.
More generally, the dummy outcome uses the observed relationship between confounders and outcome (conditional on treatment) to create a more realistic outcome for which the treatment effect is known to be zero. If the goal is to simulate a dummy outcome with a non-zero true causal effect, then we can add an arbitrary function h(t) to the dummy outcome’s generation process and then the causal effect becomes h(t=1)-h(t=0).
Note that this general procedure only works for the backdoor criterion.
1. We find f(W) for a each value of treatment. That is, keeping the treatment constant, we fit a predictor to estimate the effect of confounders W on outcome y. Note that since f(W) simply defines a new DGP for the simulated outcome, it need not be the correct structural equation from W to y. 2. We obtain the value of dummy outcome as:
y_dummy = h(t) + f(W)
To prevent overfitting, we fit f(W) for one value of T and then use it to generate data for other values of t. Future support for identification based on instrumental variable and mediation.
If we originally started out with W / \ t --->y On estimating the following with constant t, y_dummy = f(W) W / \ t --|->y This ensures that we try to capture as much of W--->Y as possible On adding h(t) W / \ t --->y h(t)
- Parameters:
data – pd.DataFrame: Data to run the refutation
target_estimand – IdentifiedEstimand: Identified estimand to run the refutation
estimate – CausalEstimate: Estimate to run the refutation
treatment_name – str: Name of the treatment
num_simulations (int, optional) – The number of simulations to be run, which defaults to
CausalRefuter.DEFAULT_NUM_SIMULATIONS
transformation_list (list, optional) –
It is a list of actions to be performed to obtain the outcome, which defaults to
DEFAULT_TRANSFORMATION
. The default transformation is as follows:[("zero",""),("noise", {'std_dev':1} )]
Each of the actions within a transformation is one of the following types:
function argument: function
pd.Dataframe -> np.ndarray
It takes in a function that takes the input data frame as the input and outputs the outcome variable. This allows us to create an output varable that only depends on the covariates and does not depend on the treatment variable.
string argument
Currently it supports some common estimators like
Linear Regression
K Nearest Neighbours
Support Vector Machine
Neural Network
Random Forest
Or functions such as:
Permute This permutes the rows of the outcome, disassociating any effect of the treatment on the outcome.
Noise This adds white noise to the outcome with white noise, reducing any causal relationship with the treatment.
Zero It replaces all the values in the outcome by zero
- Examples:
The
transformation_list
is of the following form:
If the function
pd.Dataframe -> np.ndarray
is already defined.[(func,func_params),('permute',{'permute_fraction':val}),('noise',{'std_dev':val})]
Every function should be able to support a minimum of two arguments
X_train
andoutcome_train
which correspond to the training data and the outcome that we want to predict, along with additional parameters such as the learning rate or the momentum constant can be set with the help offunc_args
.[(neural_network,{'alpha': 0.0001, 'beta': 0.9}),('permute',{'permute_fraction': 0.2}),('noise',{'std_dev': 0.1})]
The neural network is invoked as
neural_network(X_train, outcome_train, **args)
.If a function from the above list is used
[('knn',{'n_neighbors':5}), ('permute', {'permute_fraction': val} ), ('noise', {'std_dev': val} )]
- Parameters:
true_causal_effect – A function that is used to get the True Causal Effect for the modelled dummy outcome. It defaults to
DEFAULT_TRUE_CAUSAL_EFFECT
, which means that there is no relationship between the treatment and outcome in the dummy data.
Note
The true causal effect should take an input of the same shape as the treatment and the output should match the shape of the outcome
- Parameters:
required_variables – The list of variables to be used as the input for
y~f(W)
This isTrue
by default, which in turn selects all variables leaving the treatment and the outcome
Note
We need to pass required_variables =
[W0,W1]
if we wantW0
andW1
.We need to pass required_variables =
[-W0,-W1]
if we want all variables excludingW0
andW1
.
If the value is True, we wish to include all variables to estimate the value of the outcome.
Warning
A
False
value isINVALID
and will result in anerror
.Note
These inputs are fed to the function for estimating the outcome variable. The same set of required_variables is used for each instance of an internal estimation function.
- Parameters:
bucket_size_scale_factor – For continuous data, the scale factor helps us scale the size of the bucket used on the data. The default scale factor is
DEFAULT_BUCKET_SCALE_FACTOR
.min_data_point_threshold (int, optional) – The minimum number of data points for an estimator to run. This defaults to
MIN_DATA_POINT_THRESHOLD
. If the number of data points is too few for a certain category, we make use of theDEFAULT_TRANSFORMATION
for generaring the dummy outcome
- dowhy.causal_refuters.refute_estimate(data: ~pandas.core.frame.DataFrame, target_estimand: ~dowhy.causal_identifier.identified_estimand.IdentifiedEstimand, estimate: ~dowhy.causal_estimator.CausalEstimate, treatment_name: ~typing.Optional[str] = None, outcome_name: ~typing.Optional[str] = None, refuters: ~typing.List[~typing.Callable[[...], ~typing.Union[~dowhy.causal_refuter.CausalRefutation, ~typing.List[~dowhy.causal_refuter.CausalRefutation]]]] = [<function sensitivity_simulation>, <function refute_bootstrap>, <function refute_data_subset>, <function refute_dummy_outcome>, <function refute_placebo_treatment>, <function refute_random_common_cause>], **kwargs) List[CausalRefutation] [source]
- Executes a list of refuters using the default parameters
Only refuters that return CausalRefutation or a list of CausalRefutation is supported
- Parameters:
data – pd.DataFrame: Data to run the refutation
target_estimand – IdentifiedEstimand: Identified estimand to run the refutation
estimate – CausalEstimate: Estimate to run the refutation
treatment_name – str: Name of the treatment (Optional)
outcome_name – str: Name of the outcome (Optional)
refuters – list: List of refuters to execute
- **kwargskwargs:
Replace any default for the provided list of refuters
- dowhy.causal_refuters.refute_placebo_treatment(data: DataFrame, target_estimand: IdentifiedEstimand, estimate: CausalEstimate, treatment_names: List, num_simulations: int = 100, placebo_type: PlaceboType = PlaceboType.DEFAULT, random_state: Optional[Union[int, RandomState]] = None, show_progress_bar: bool = False, n_jobs: int = 1, verbose: int = 0, **_) CausalRefutation [source]
Refute an estimate by replacing treatment with a randomly-generated placebo variable.
- Parameters:
data – pd.DataFrame: Data to run the refutation
target_estimand – IdentifiedEstimand: Identified estimand to run the refutation
estimate – CausalEstimate: Estimate to run the refutation
treatment_names – list: List of treatments
num_simulations – The number of simulations to be run, which defaults to
CausalRefuter.DEFAULT_NUM_SIMULATIONS
placebo_type – Default is to generate random values for the treatment. If placebo_type is “permute”, then the original treatment values are permuted by row.
random_state – The seed value to be added if we wish to repeat the same random behavior. If we want to repeat the same behavior we push the same seed in the psuedo-random generator.
n_jobs – The maximum number of concurrently running jobs. If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all (this is the default).
verbose – The verbosity level: if non zero, progress messages are printed. Above 50, the output is sent to stdout. The frequency of the messages increases with the verbosity level. If it more than 10, all iterations are reported. The default is 0.
- dowhy.causal_refuters.refute_random_common_cause(data: DataFrame, target_estimand: IdentifiedEstimand, estimate: CausalEstimate, num_simulations: int = 100, random_state: Optional[Union[int, RandomState]] = None, show_progress_bar: bool = False, n_jobs: int = 1, verbose: int = 0, **_) CausalRefutation [source]
Refute an estimate by introducing a randomly generated confounder (that may have been unobserved).
- Parameters:
data – pd.DataFrame: Data to run the refutation
target_estimand – IdentifiedEstimand: Identified estimand to run the refutation
estimate – CausalEstimate: Estimate to run the refutation
num_simulations – The number of simulations to be run, which defaults to
CausalRefuter.DEFAULT_NUM_SIMULATIONS
random_state – The seed value to be added if we wish to repeat the same random behavior. If we want to repeat the same behavior we push the same seed in the psuedo-random generator.
n_jobs – The maximum number of concurrently running jobs. If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all (this is the default).
verbose – The verbosity level: if non zero, progress messages are printed. Above 50, the output is sent to stdout. The frequency of the messages increases with the verbosity level. If it more than 10, all iterations are reported. The default is 0.
- dowhy.causal_refuters.sensitivity_e_value(data: DataFrame, target_estimand: IdentifiedEstimand, estimate: CausalEstimate, treatment_name: List[str], outcome_name: List[str], plot_estimate: bool = True) EValueSensitivityAnalyzer [source]
- dowhy.causal_refuters.sensitivity_simulation(data: DataFrame, target_estimand: IdentifiedEstimand, estimate: CausalEstimate, treatment_name: str, outcome_name: str, kappa_t: Optional[Union[float, ndarray]] = None, kappa_y: Optional[Union[float, ndarray]] = None, confounders_effect_on_treatment: str = 'binary_flip', confounders_effect_on_outcome: str = 'linear', frac_strength_treatment: float = 1.0, frac_strength_outcome: float = 1.0, plotmethod: Optional[str] = None, show_progress_bar=False, **_) CausalRefutation [source]
This function attempts to add an unobserved common cause to the outcome and the treatment. At present, we have implemented the behavior for one dimensional behaviors for continuous and binary variables. This function can either take single valued inputs or a range of inputs. The function then looks at the data type of the input and then decides on the course of action.
- Parameters:
data – pd.DataFrame: Data to run the refutation
target_estimand – IdentifiedEstimand: Identified estimand to run the refutation
estimate – CausalEstimate: Estimate to run the refutation
treatment_name – str: Name of the treatment
outcome_name – str: Name of the outcome
kappa_t – float, numpy.ndarray: Strength of the confounder’s effect on treatment. When confounders_effect_on_treatment is linear, it is the regression coefficient. When the confounders_effect_on_treatment is binary flip, it is the probability with which effect of unobserved confounder can invert the value of the treatment.
kappa_y – float, numpy.ndarray: Strength of the confounder’s effect on outcome. Its interpretation depends on confounders_effect_on_outcome and the simulation_method. When simulation_method is direct-simulation, for a linear effect it behaves like the regression coefficient and for a binary flip, it is the probability with which it can invert the value of the outcome.
confounders_effect_on_treatment – str : The type of effect on the treatment due to the unobserved confounder. Possible values are [‘binary_flip’, ‘linear’]
confounders_effect_on_outcome – str : The type of effect on the outcome due to the unobserved confounder. Possible values are [‘binary_flip’, ‘linear’]
frac_strength_treatment – float: This parameter decides the effect strength of the simulated confounder as a fraction of the effect strength of observed confounders on treatment. Defaults to 1.
frac_strength_outcome – float: This parameter decides the effect strength of the simulated confounder as a fraction of the effect strength of observed confounders on outcome. Defaults to 1.
plotmethod – string: Type of plot to be shown. If None, no plot is generated. This parameter is used only only when more than one treatment confounder effect values or outcome confounder effect values are provided. Default is “colormesh”. Supported values are “contour”, “colormesh” when more than one value is provided for both confounder effect value parameters; “line” when provided for only one of them.
- Returns:
CausalRefuter: An object that contains the estimated effect and a new effect and the name of the refutation used.