dowhy.utils package#
Submodules#
dowhy.utils.api module#
dowhy.utils.cit module#
- dowhy.utils.cit.compute_ci(r=None, nx=None, ny=None, confidence=0.95)[source]#
Compute Parametric confidence intervals around correlation coefficient. See : https://online.stat.psu.edu/stat505/lesson/6/6.3
This is done by applying Fisher’s r to z transform z = .5[ln((1+r)/(1-r))] = arctanh(r)
The Standard error is 1/sqrt(N-3) where N is sample size
The critical value for normal distribution for a corresponding confidence level is calculated from stats.norm.ppf((1 - alpha)/2) for two tailed test
The lower and upper condidence intervals in z space are calculated with the formula z ± critical value*error
The confidence interval is then converted back to r space
:param stat : correlation coefficient :param nx : length of vector x :param ny :length of vector y :param confidence : Confidence level (0.95 = 95%)
:returns : array containing confidence interval
- dowhy.utils.cit.conditional_MI(data=None, x=None, y=None, z=None)[source]#
Method to return conditional mutual information between X and Y given Z I(X, Y | Z) = H(X|Z) - H(X|Y,Z)
= H(X,Z) - H(Z) - H(X,Y,Z) + H(Y,Z) = H(X,Z) + H(Y,Z) - H(X,Y,Z) - H(Z)
:param data : dataset :param x,y,z : column names from dataset :returns : conditional mutual information between X and Y given Z
- dowhy.utils.cit.entropy(x)[source]#
” Returns entropy for a random variable x H(x) = - Σ p(x)log(p(x)) :param x : random variable to calculate entropy for :returns : entropy of random variable
- dowhy.utils.cit.partial_corr(data=None, x=None, y=None, z=None, method='pearson')[source]#
Calculate Partial correlation which is the degree of association between x and y after removing effect of z. This is done by calculating correlation coefficient between the residuals of two linear regressions : xsim z, ysim z See : 1 https://en.wikipedia.org/wiki/Partial_correlation
:param data : pandas dataframe :param x : Column name in data :param y : Column name in data :param z : string or list :param method : string denoting the correlation type - “pearson” or “spearman”
- : returns: a python dictionary with keys as
n: Sample size r: Partial correlation coefficient CI95: 95% parametric confidence intervals p-val: p-value
dowhy.utils.cli_helpers module#
- dowhy.utils.cli_helpers.query_yes_no(question, default=True)[source]#
Ask a yes/no question via standard input and return the answer.
Source: https://stackoverflow.com/questions/3041986/apt-command-line-interface-like-yes-no-input
If invalid input is given, the user will be asked until they actually give valid input.
Side Effects: Blocks program execution until valid input(y/n) is given.
- Parameters:
question(str) – A question that is presented to the user.
default(bool|None) – The default value when enter is pressed with no value. When None, there is no default value and the query will loop.
- Returns:
A bool indicating whether user has entered yes or no.
dowhy.utils.dgp module#
- class dowhy.utils.dgp.DataGeneratingProcess(**kwargs)[source]#
Bases:
object
Base class for implementation of data generating process.
Subclasses implement functions that create various data generating processes. All data generating processes are in the package “dowhy.utils.dgps”.
- DEFAULT_PERCENTILE = 0.9#
dowhy.utils.encoding module#
- class dowhy.utils.encoding.Encoders(drop_first=True)[source]#
Bases:
object
Categorical data One-Hot encoding helper object.
Initializes a factory object which manages a set of sklearn.preprocessing.OneHotEncoder instances, although the encode() method can be overriden to replace these with your preferred encoder.
Each Encoder instance is given a name to retrieve it in future, and is used to encode a different set of variables.
Initializes an instance and calls reset_encoders().
- Parameters:
drop_first – If true, will not encode the first category value with a bit in 1-hot encoding. It will be implicit instead, by the absence of any bit representing this value in the relevant columns. Set to False to include a bit for each value of every categorical variable.
- encode(data: DataFrame, encoder_name: str)[source]#
Encodes categorical columns in the given data, returning a new dataframe containing all original data and the encoded columns. Numerical data is unchanged, categorical types are one-hot encoded. encoder_name identifies a specific encoder to be used if available, or created if not. The encoder can be reused in subsequent calls.
- Parameters:
data – Data to encode.
encoder_name – The name for the encoder to be used.
- Returns:
The encoded data.
- dowhy.utils.encoding.one_hot_encode(data: DataFrame, columns=None, drop_first: bool = False, encoder: OneHotEncoder | None = None)[source]#
Replaces pandas’ get_dummies with an implementation of sklearn.preprocessing.OneHotEncoder.
The purpose of replacement is to allow encoding of new data using the same encoder, which ensures that the resulting encodings are consistent.
If encoder is None, a new instance of sklearn.preprocessing.OneHotEncoder will be created using fit_transform(). Otherwise, the existing encoder is used with fit().
For compatibility with get_dummies, the encoded data will be transformed into a DataFrame.
In all cases, the return value will be the encoded data and the encoder object (even if passed in). If data contains other columns than the dummy-coded one(s), these will be prepended, unaltered, to the result.
- Parameters:
data – Data of which to get dummy indicators.
columns – List-like structure containing specific columns to encode.
drop_first – Whether to get k-1 dummies out of k categorical levels by removing the first level.
- Returns:
DataFrame, OneHotEncoder
dowhy.utils.graph_operations module#
- dowhy.utils.graph_operations.add_edge(i, j, g)[source]#
Adds an edge i –> j to the graph, g. The edge is only added if this addition does NOT cause the graph to have cycles.
- dowhy.utils.graph_operations.adjacency_matrix_to_adjacency_list(adjacency_matrix, labels=None)[source]#
Convert the adjacency matrix of a graph to an adjacency list.
- Parameters:
adjacency_matrix – A numpy array representing the graph adjacency matrix.
labels – List of labels.
- Returns:
Adjacency list as a dictionary.
- dowhy.utils.graph_operations.adjacency_matrix_to_graph(adjacency_matrix, labels=None)[source]#
Convert a given graph adjacency matrix to DOT format.
- Parameters:
adjacency_matrix – A numpy array representing the graph adjacency matrix.
labels – List of labels.
- Returns:
Graph in DOT format.
- dowhy.utils.graph_operations.daggity_to_dot(daggity_string)[source]#
Converts the input daggity_string to valid DOT graph format.
- Parameters:
daggity_string – Output graph from Daggity site
- Returns:
DOT string
- dowhy.utils.graph_operations.del_edge(i, j, g)[source]#
Deletes the edge i –> j in the graph, g. The edge is only deleted if this removal does NOT cause the graph to be disconnected.
- dowhy.utils.graph_operations.find_ancestor(node_set, node_names, adjacency_matrix, node2idx, idx2node)[source]#
Finds ancestors of a given set of nodes in a given graph.
- Parameters:
node_set – Set of nodes whos ancestors must be obtained.
node_names – Name of all nodes in the graph.
adjacency_matrix – Graph adjacency matrix.
node2idx – A dictionary mapping node names to their row or column index in the adjacency matrix.
idx2node – A dictionary mapping the row or column indices in the adjacency matrix to the corresponding node names.
- Returns:
OrderedSet containing ancestors of all nodes in the node_set.
- dowhy.utils.graph_operations.find_c_components(adjacency_matrix, node_set, idx2node)[source]#
Obtain C-components in a graph.
- Parameters:
adjacency_matrix – Graph adjacency matrix.
node_set – Set of nodes whos ancestors must be obtained.
idx2node – A dictionary mapping the row or column indices in the adjacency matrix to the corresponding node names.
- Returns:
List of C-components in the graph.
- dowhy.utils.graph_operations.find_predecessor(i, j, g)[source]#
Finds a predecessor, k, in the path between two nodes, i and j, in the graph, g.
- dowhy.utils.graph_operations.get_simple_ordered_tree(n)[source]#
Generates a simple-ordered tree. The tree is just a directed acyclic graph of n nodes with the structure 0 –> 1 –> …. –> n.
- dowhy.utils.graph_operations.induced_graph(node_set, adjacency_matrix, node2idx)[source]#
To obtain the induced graph corresponding to a subset of nodes.
- Parameters:
node_set – Set of nodes whos ancestors must be obtained.
adjacency_matrix – Graph adjacency matrix.
node2idx – A dictionary mapping node names to their row or column index in the adjacency matrix.
- Returns:
Numpy array representing the adjacency matrix of the induced graph.
dowhy.utils.graphviz_plotting module#
- dowhy.utils.graphviz_plotting.plot_causal_graph_graphviz(causal_graph: Graph, layout_prog: str | None = None, display_causal_strengths: bool = True, causal_strengths: Dict[Tuple[Any, Any], float] | None = None, colors: Dict[Any | Tuple[Any, Any], str] | None = None, filename: str | None = None, display_plot: bool = True, figure_size: Tuple[int, int] | None = None) None [source]#
dowhy.utils.networkx_plotting module#
- dowhy.utils.networkx_plotting.plot_causal_graph_networkx(causal_graph: Graph, layout_prog: str | None = None, causal_strengths: Dict[Tuple[Any, Any], float] | None = None, colors: Dict[Any | Tuple[Any, Any], str] | None = None, filename: str | None = None, display_plot: bool = True, label_wrap_length: int = 3, figure_size: Tuple[int, int] | None = None) None [source]#
dowhy.utils.ordered_set module#
- class dowhy.utils.ordered_set.OrderedSet(elements=None)[source]#
Bases:
object
Python class for ordered set. Code taken from buyalsky/ordered-hash-set.
- add(element)[source]#
Function to add an element to do set if it does not exit.
- Parameters:
element – element to be added.
- difference(other_set)[source]#
Function to remove elements in self._set which are also present in other_set.
- Parameters:
other_set – The set to obtain difference with. Can be a list, set or OrderedSet.
- Returns:
New OrderedSet representing the difference of elements in the self._set and other_set.
- get_all()[source]#
Function to return list of all elements in the set.
- Returns:
List of all items in the set.
- intersection(other_set)[source]#
Function to compute the intersection of self._set and other_set.
- Parameters:
other_set – The set to obtain intersection with. Can be a list, set or OrderedSet.
- Returns:
New OrderedSet representing the set with elements common to the OrderedSet object and other_set.
dowhy.utils.plotting module#
- dowhy.utils.plotting.bar_plot(values: Dict[str, float], uncertainties: Dict[str, Tuple[float, float]] | None = None, ylabel: str = '', filename: str | None = None, display_plot: bool = True, figure_size: List[int] | None = None, bar_width: float = 0.8, xticks: List[str] | None = None, xticks_rotation: int = 90, sort_names: bool = False) None [source]#
Convenience function to make a bar plot of the given values with uncertainty bars, if provided. Useful for all kinds of attribution results (including confidence intervals).
- Parameters:
values – A dictionary where the keys are the labels and the values are the values to be plotted.
uncertainties – A dictionary of attributes to be added to the error bars.
ylabel – The label for the y-axis.
filename – An optional filename if the output should be plotted into a file.
display_plot – Optionally specify if the plot should be displayed or not (default to True).
figure_size – The size of the figure to be plotted.
bar_width – The width of the bars.
xticks – Explicitly specify the labels for the bars on the x-axis.
xticks_rotation – Specify the rotation of the labels on the x-axis.
sort_names – If True, the names in the plot are sorted alphabetically. If False, the order as given in values are used.
- dowhy.utils.plotting.plot(causal_graph: Graph, layout_prog: str | None = None, causal_strengths: Dict[Tuple[Any, Any], float] | None = None, colors: Dict[Any | Tuple[Any, Any], str] | None = None, filename: str | None = None, display_plot: bool = True, figure_size: Tuple[int, int] | None = None, **kwargs) None [source]#
Convenience function to plot causal graphs. This function uses different backends based on what’s available on the system. The best result is achieved when using Graphviz as the backend. This requires both the shared system library (e.g.
brew install graphviz
orapt-get install graphviz
) and the Python pygraphviz package (pip install pygraphviz
). When graphviz is not available, it will fall back to the networkx backend.- Parameters:
causal_graph – The graph to be plotted
layout_prog – Defines the layout type. If None is given, the ‘dot’ layout is used for graphviz plots and a customized layout for networkx plots.
causal_strengths – An optional dictionary with Edge -> float entries.
colors – An optional dictionary with color specifications for edges or nodes.
filename – An optional filename if the output should be plotted into a file.
display_plot – Optionally specify if the plot should be displayed or not (default to True).
figure_size – A tuple to define the width and height (as a tuple) of the pyplot. This is used to parameter to modify pyplot’s ‘figure.figsize’ parameter. If None is given, the current/default value is used.
kwargs – Remaining parameters will be passed through to the backend verbatim.
Example usage:
>>> plot(nx.DiGraph([('X', 'Y')])) # plots X -> Y >>> plot(nx.DiGraph([('X', 'Y')]), causal_strengths={('X', 'Y'): 0.43}) # annotates arrow with 0.43 >>> plot(nx.DiGraph([('X', 'Y')]), colors={('X', 'Y'): 'red', 'X': 'green'}) # colors X -> Y red and X green
dowhy.utils.propensity_score module#
- dowhy.utils.propensity_score.binary_treatment_model(data, covariates, treatment, variable_types)[source]#
- dowhy.utils.propensity_score.categorical_treatment_model(data, covariates, treatment, variable_types)[source]#
- dowhy.utils.propensity_score.continuous_treatment_model(data, covariates, treatment, variable_types)[source]#
dowhy.utils.regression module#
- dowhy.utils.regression.create_polynomial_function(max_degree)[source]#
Creates a list of polynomial functions
- Parameters:
max_degree – degree of the polynomial function to be created
- Returns:
list of lambda functions
dowhy.utils.timeseries module#
- dowhy.utils.timeseries.create_graph_from_csv(file_path: str) DiGraph [source]#
Creates a directed graph from a CSV file.
The time_lag parameter of the networkx graph represents the exact causal lag of an edge between any 2 nodes in the graph. Each edge can contain multiple time lags, therefore each combination of (node1,node2,time_lag) must be input individually in the CSV file.
The CSV file should have at least three columns: ‘node1’, ‘node2’, and ‘time_lag’. Each row represents an edge from ‘node1’ to ‘node2’ with a ‘time_lag’ attribute.
- Parameters:
file_path (str) – The path to the CSV file.
- Returns:
A directed graph created from the CSV file.
- Return type:
nx.DiGraph
- Example:
Example CSV content:
node1,node2,time_lag A,B,5 B,C,2 A,C,7
- dowhy.utils.timeseries.create_graph_from_dot_format(file_path: str) DiGraph [source]#
Creates a directed graph from a DOT file and ensures it is a DiGraph.
The time_lag parameter of the networkx graph represents the exact causal lag of an edge between any 2 nodes in the graph. Each edge can contain multiple valid time lags.
The DOT file should contain a graph in DOT format.
- Parameters:
file_path (str) – The path to the DOT file.
- Returns:
A directed graph (DiGraph) created from the DOT file.
- Return type:
nx.DiGraph
- dowhy.utils.timeseries.create_graph_from_networkx_array(array: ndarray, var_names: list) DiGraph [source]#
Create a NetworkX directed graph from a numpy array with time lag information.
The time_lag parameter of the networkx graph represents the exact causal lag of an edge between any 2 nodes in the graph. Each edge can contain multiple valid time lags.
The resulting graph will be a directed graph with edge attributes indicating the type of link based on the array values.
- Parameters:
array (np.ndarray) – A numpy array of shape (n, n, tau) representing the causal links.
var_names (list) – A list of variable names.
- Returns:
A directed graph with edge attributes based on the array values.
- Return type:
nx.DiGraph
- dowhy.utils.timeseries.create_graph_from_user() DiGraph [source]#
Creates a directed graph based on user input from the console.
The time_lag parameter of the networkx graph represents the exact causal lag of an edge between any 2 nodes in the graph. Each edge can contain multiple time lags, therefore each combination of (node1,node2,time_lag) must be input individually by the user.
The user is prompted to enter edges one by one in the format ‘node1 node2 time_lag’, where ‘node1’ and ‘node2’ are the nodes connected by the edge, and ‘time_lag’ is a numerical value representing the weight of the edge. The user should enter ‘done’ to finish inputting edges.
- Returns:
A directed graph created from the user’s input.
- Return type:
nx.DiGraph
- Example user input:
Enter an edge: A B 4 Enter an edge: B C 2 Enter an edge: done