causalexplain.explainability package#

Submodules#

Hierarchy of links

Can I use the information above to decide wether to connect groups of variables linked together?

class Hierarchies(method='spearman', mic_alpha=0.6, mic_c=15, linkage_method='complete', correlation_th=None, prog_bar=False, verbose=False, silent=False)[source]#

Bases: object

Class representing the hierarchy of links between variables.

Parameters:
  • method (str or Callable, optional) – Method to use to compute the correlation. Default is ‘spearman’, but can also be ‘pearson’, ‘kendall’ or ‘mic’.

  • alpha (float, optional) – Threshold for the correlation. Default is 0.6.

  • c (int, optional) – Number of clusters to be formed. Default is 15. Only valid with MIC.

  • linkage_method (str, optional) – Method to use to compute the linkage. Default is ‘complete’.

  • correlation_th (float, optional) – Threshold for the correlation. Default is None.

  • prog_bar (bool, optional) – Whether to show a progress bar during computation. Default is False.

  • verbose (bool, optional) – Whether to print additional information. Default is False.

  • silent (bool, optional) – Whether to suppress all output. Default is False.

Attributes:
correlations
linkage_mat

Methods

compute_correlated_features(correlations, ...)

Compute the list of correlated features for each target.

compute_correlation_matrix(data[, method, ...])

Compute the correlation matrix.

expand_clusters_perm_importance(pi[, ...])

Expand the clusters of the linkage matrix to include the features that are in the same cluster in the permutation importance matrix.

fit(X)

Compute the hierarchy of links between variables using the correlation method specified in corr_method.

hierarchical_dissimilarities()

Compute the dissimilarities between features in a hierarchical clustering.

correlations = None#
__init__(method='spearman', mic_alpha=0.6, mic_c=15, linkage_method='complete', correlation_th=None, prog_bar=False, verbose=False, silent=False)[source]#

Initialize the Hierarchies object.

Parameters:
  • method (str or Callable, optional) – Method to use to compute the correlation. Default is ‘spearman’, but can also be ‘pearson’, ‘kendall’ or ‘mic’.

  • mic_alpha (float, optional) – Threshold for the correlation. Default is 0.6.

  • mic_c (int, optional) – Number of clusters to be formed. Default is 15. Only valid with MIC.

  • linkage_method (str, optional) – Method to use to compute the linkage. Default is ‘complete’.

  • correlation_th (float, optional) – Threshold for the correlation. Default is None.

  • prog_bar (bool, optional) – Whether to show a progress bar during computation. Default is False.

  • verbose (bool, optional) – Whether to print additional information. Default is False.

  • silent (bool, optional) – Whether to suppress all output. Default is False.

linkage_mat: ndarray = None#
fit(X)[source]#

Compute the hierarchy of links between variables using the correlation method specified in corr_method.

Parameters:
  • X (pd.DataFrame) – The input data.

  • y (None) – Ignored.

Returns:

self – The fitted Hierarchies object.

Return type:

Hierarchies

static compute_correlation_matrix(data, method='spearman', mic_alpha=0.6, mic_c=15, prog_bar=False)[source]#

Compute the correlation matrix.

Parameters:
  • data (pd.DataFrame) – The input data.

  • method (str or Callable, optional) – Method to use to compute the correlation. Default is ‘spearman’, but can also be ‘pearson’, ‘kendall’ or ‘mic’.

  • prog_bar (bool, optional) – Whether to show a progress bar during computation. Default is False.

Returns:

correlations – The correlation matrix.

Return type:

pd.DataFrame

static compute_correlated_features(correlations, correlation_th, feature_names, verbose=False)[source]#

Compute the list of correlated features for each target.

Parameters:
  • correlations (pd.DataFrame) – The correlation matrix.

  • correlation_th (float) – Threshold for the correlation.

  • feature_names (List[str]) – The list of feature names.

  • verbose (bool, optional) – Whether to print additional information. Default is False.

Returns:

correlated_features – The list of correlated features for each target.

Return type:

defaultdict(list)

expand_clusters_perm_importance(pi, ground_truth=None)[source]#

Expand the clusters of the linkage matrix to include the features that are in the same cluster in the permutation importance matrix. It expands, for each cluster, with the metrics related to correlation, deltas, backward PI, etc. Used to determine if some criteria can be extracted.

Parameters:
  • pi (pd.DataFrame) – Permutation importance matrix.

  • ground_truth (pd.DataFrame, optional) – Ground truth matrix.

Return type:

None

hierarchical_dissimilarities()[source]#

Compute the dissimilarities between features in a hierarchical clustering.

Returns:

hierarchical_dissimilarity – Dissimilarities between features.

Return type:

pd.DataFrame

connect_isolated_nodes(G, linkage_mat, feature_names, verbose=False)[source]#

Connect isolated nodes in the graph, based on their relationship in the hierarchical clustering provided through the linkage_mat.

connect_hierarchies(G, linkage_mat, feature_names, verbose=False)[source]#
plot_dendogram_correlations(correlations, feature_names, **kwargs)[source]#

Plot the dendrogram of the correlation matrix.

Parameters:
  • (pd.DataFrame) (- correlations) – Correlation matrix.

  • (List[str]) (- feature_names) – List of feature names.

  • kwargs (-) – Keyword arguments to be passed to the plot_dendogram function.

Permutation Importance for feature selection. Wrapper over SKLearn’s PermutationImportance and own implementation of the vanilla version of the algorithm to run over models trained with PyTorch.

    1. Renero 2022, 2023

Parameters:#

models: dict

A dictionary of models, where the keys are the target variables and the values are the models trained to predict the target variables.

n_repeats: int

The number of times to repeat the permutation importance algorithm.

mean_pi_percentile: float

The percentile of the mean permutation importance to use as a threshold for feature selection.

random_state: int

The random state to use for the permutation importance algorithm.

prog_bar: bool

Whether to display a progress bar or not.

verbose: bool

Whether to display explanations on the process or not.

silent: bool

Whether to display anything or not.

class PermutationImportance(models, discrepancies=None, correlation_th=None, n_repeats=10, mean_pi_percentile=0.8, exhaustive=False, threshold=None, random_state=42, prog_bar=True, verbose=False, silent=False)[source]#

Bases: BaseEstimator

Permutation Importance for feature selection. Wrapper over SKLearn’s PermutationImportance and own implementation of the vanilla version of the algorithm to run over models trained with PyTorch.

Methods

fit(X)

Implementation of the fit method for the PermutationImportance class.

get_metadata_routing()

Get metadata routing of this object.

get_params([deep])

Get parameters for this estimator.

plot(**kwargs)

Plot the permutation importance for each feature, by calling the internal _plot_perm_imp method.

predict([X, root_causes, prior])

Implementation of the predict method for the PermutationImportance class.

set_params(**params)

Set the parameters of this estimator.

set_predict_request(*[, prior, root_causes])

Request metadata passed to the predict method.

fit_predict

device = 'cpu'#
__init__(models, discrepancies=None, correlation_th=None, n_repeats=10, mean_pi_percentile=0.8, exhaustive=False, threshold=None, random_state=42, prog_bar=True, verbose=False, silent=False)[source]#
fit(X)[source]#

Implementation of the fit method for the PermutationImportance class. If the model is a PyTorch model, the fit method will compute the base loss for each feature. If the model is a SKLearn model, the fit method will compute the permutation importance for each feature.

predict(X=None, root_causes=None, prior=None)[source]#

Implementation of the predict method for the PermutationImportance class.

fit_predict(X, root_causes)[source]#
plot(**kwargs)[source]#

Plot the permutation importance for each feature, by calling the internal _plot_perm_imp method.

set_predict_request(*, prior='$UNCHANGED$', root_causes='$UNCHANGED$')#

Request metadata passed to the predict method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to predict.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:
  • prior (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for prior parameter in predict.

  • root_causes (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for root_causes parameter in predict.

Returns:

self – The updated object.

Return type:

object

class RegQuality[source]#

Bases: BaseEstimator

Methods

get_metadata_routing()

Get metadata routing of this object.

get_params([deep])

Get parameters for this estimator.

predict(scores[, gamma_shape, gamma_scale, ...])

Returns the indices of features that are both gamma and outliers.

set_params(**params)

Set the parameters of this estimator.

set_predict_request(*[, gamma_scale, ...])

Request metadata passed to the predict method.

__init__()[source]#
static predict(scores, gamma_shape=1, gamma_scale=1, threshold=0.9, verbose=False)[source]#

Returns the indices of features that are both gamma and outliers. Both criteria are applied to the given scores to determine if the MSE error obtained from the regression is bad compared with the rest of regressions for the other features in the dataset, and thus the feature should be considered a parent node.

Parameters:

scores (List[float]) – List of scores

Returns:

List of indices of features that are both gamma and outliers

Return type:

Set[int]

set_predict_request(*, gamma_scale='$UNCHANGED$', gamma_shape='$UNCHANGED$', threshold='$UNCHANGED$', verbose='$UNCHANGED$')#

Request metadata passed to the predict method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to predict.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:
  • gamma_scale (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for gamma_scale parameter in predict.

  • gamma_shape (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for gamma_shape parameter in predict.

  • threshold (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for threshold parameter in predict.

  • verbose (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for verbose parameter in predict.

Returns:

self – The updated object.

Return type:

object

This module builds the causal graph based on the informacion that we derived from the SHAP values. The main idea is to use the SHAP values to compute the discrepancy between the SHAP values and the target values. This discrepancy is then used to build the graph.

class ShapDiscrepancy(target, parent, shap_heteroskedasticity, parent_heteroskedasticity, shap_p_value, parent_p_value, shap_model, parent_model, shap_discrepancy, shap_correlation, shap_gof, ks_pvalue, ks_result)[source]#

Bases: object

A class representing the discrepancy between the SHAP value and the parent value for a given feature.

- target

The name of the target feature.

Type:

str

- parent

The name of the parent feature.

Type:

str

- shap_heteroskedasticity

Whether the SHAP value exhibits heteroskedasticity.

Type:

bool

- parent_heteroskedasticity

Whether the parent value exhibits heteroskedasticity.

Type:

bool

- shap_p_value

The p-value for the SHAP value.

Type:

float

- parent_p_value

The p-value for the parent value.

Type:

float

- shap_model

The regression model for the SHAP value.

Type:

sm.regression.linear_model.RegressionResultsWrapper

- parent_model

The regression model for the parent value.

Type:

sm.regression.linear_model.RegressionResultsWrapper

- shap_discrepancy

The discrepancy between the SHAP value and the parent value.

Type:

float

- shap_correlation

The correlation between the SHAP value and the parent value.

Type:

float

- shap_gof

The goodness of fit for the SHAP value.

Type:

float

- ks_pvalue

The p-value for the Kolmogorov-Smirnov test.

Type:

float

- ks_result

The result of the Kolmogorov-Smirnov test.

Type:

str

target: str#
parent: str#
shap_heteroskedasticity: bool#
parent_heteroskedasticity: bool#
shap_p_value: float#
parent_p_value: float#
shap_model: RegressionResultsWrapper#
parent_model: RegressionResultsWrapper#
shap_discrepancy: float#
shap_correlation: float#
shap_gof: float#
ks_pvalue: float#
ks_result: str#
__init__(target, parent, shap_heteroskedasticity, parent_heteroskedasticity, shap_p_value, parent_p_value, shap_model, parent_model, shap_discrepancy, shap_correlation, shap_gof, ks_pvalue, ks_result)#
class ShapEstimator(explainer='explainer', models=None, correlation_th=None, mean_shap_percentile=0.8, iters=20, reciprocity=False, min_impact=1e-06, exhaustive=False, parallel_jobs=0, on_gpu=False, verbose=False, prog_bar=True, silent=False)[source]#

Bases: BaseEstimator

A class for computing SHAP values and building a causal graph from them.

Parameters:
  • explainer (str, default="explainer") – The SHAP explainer to use. Possible values are “kernel”, “gradient”, and “explainer”.

  • models (BaseEstimator, default=None) – The models to use for computing SHAP values. If None, a linear regression model is used for each feature.

  • correlation_th (float, default=None) – The correlation threshold to use for removing highly correlated features.

  • mean_shap_percentile (float, default=0.8) – The percentile threshold for selecting features based on their mean SHAP value.

  • iters (int, default=20) – The number of iterations to use for the feature selection method.

  • reciprocity (bool, default=False) – Whether to enforce reciprocity in the causal graph.

  • min_impact (float, default=1e-06) – The minimum impact threshold for selecting features.

  • exhaustive (bool, default=False) – Whether to use the exhaustive (recursive) method for selecting features. If this is True, the threshold parameter must be provided, and the clustering is performed until remaining values to be clustered are below the given threshold.

  • threshold (float, default=None) – The threshold to use when exhaustive is True. If None, exception is raised.

  • on_gpu (bool, default=False) – Whether to use the GPU for computing SHAP values.

  • verbose (bool, default=False) – Whether to print verbose output.

  • prog_bar (bool, default=True) – Whether to show a progress bar.

  • silent (bool, default=False) – Whether to suppress all output.

Attributes:
correlation_th
models
shap_discrepancies

Methods

compute_error_contribution()

Computes the error contribution of each feature for each target.

fit(X)

Fit the ShapleyExplainer model to the given dataset.

get_metadata_routing()

Get metadata routing of this object.

get_params([deep])

Get parameters for this estimator.

predict(X[, root_causes, prior])

Builds a causal graph from the shap values using a selection mechanism based on clustering, knee or abrupt methods.

set_params(**params)

Set the parameters of this estimator.

set_predict_request(*[, prior, root_causes])

Request metadata passed to the predict method.

adjust

device = 'cpu'#
shap_discrepancies = None#
__init__(explainer='explainer', models=None, correlation_th=None, mean_shap_percentile=0.8, iters=20, reciprocity=False, min_impact=1e-06, exhaustive=False, parallel_jobs=0, on_gpu=False, verbose=False, prog_bar=True, silent=False)[source]#

Initialize the ShapEstimator object.

Parameters:
  • explainer (str, default="explainer") – The SHAP explainer to use. Possible values are “kernel”, “gradient”, and “explainer”.

  • models (BaseEstimator, default=None) – The models to use for computing SHAP values. If None, a linear regression model is used for each feature.

  • correlation_th (float, default=None) – The correlation threshold to use for removing highly correlated features.

  • mean_shap_percentile (float, default=0.8) – The percentile threshold for selecting features based on their mean SHAP value.

  • iters (int, default=20) – The number of iterations to use for the feature selection method.

  • reciprocity (bool, default=False) – Whether to enforce reciprocity in the causal graph.

  • min_impact (float, default=1e-06) – The minimum impact threshold for selecting features.

  • exhaustive (bool, default=False) – Whether to use the exhaustive (recursive) method for selecting features. If this is True, the threshold parameter must be provided, and the clustering is performed until remaining values to be clustered are below the given threshold.

  • threshold (float, default=None) – The threshold to use when exhaustive is True. If None, exception is raised.

  • on_gpu (bool, default=False) – Whether to use the GPU for computing SHAP values.

  • verbose (bool, default=False) – Whether to print verbose output.

  • prog_bar (bool, default=True) – Whether to show a progress bar.

  • silent (bool, default=False) – Whether to suppress all output.

explainer = 'explainer'#
models = None#
correlation_th = None#
mean_shap_percentile = 0.8#
iters = 20#
reciprocity = False#
min_impact = 1e-06#
exhaustive = False#
parallel_jobs = 0#
on_gpu = False#
verbose = False#
prog_bar = True#
silent = False#
fit(X)[source]#

Fit the ShapleyExplainer model to the given dataset.

Parameters: - X: The input dataset.

Returns: - self: The fitted ShapleyExplainer model.

predict(X, root_causes=None, prior=None)[source]#

Builds a causal graph from the shap values using a selection mechanism based on clustering, knee or abrupt methods.

Parameters:
  • X (pd.DataFrame) – The input data. Consists of all the features in a pandas DataFrame.

  • root_causes (List[str], optional) – The root causes of the graph. If None, all features are considered as root causes, by default None.

  • prior (List[List[str]], optional) – The prior knowledge about the connections between the features. If None, all features are considered as valid candidates for the connections, by default None.

Returns:

The causal graph.

Return type:

nx.DiGraph

adjust(graph, increase_tolerance=0.0, sd_upper=0.1)[source]#
set_predict_request(*, prior='$UNCHANGED$', root_causes='$UNCHANGED$')#

Request metadata passed to the predict method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to predict.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:
  • prior (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for prior parameter in predict.

  • root_causes (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for root_causes parameter in predict.

Returns:

self – The updated object.

Return type:

object

compute_error_contribution()[source]#

Computes the error contribution of each feature for each target. If this value is positive, then it means that, on average, the presence of the feature in the model leads to a higher error. Thus, without that feature, the prediction would have been generally better. In other words, the feature is making more harm than good! On the contrary, the more negative this value, the more beneficial the feature is for the predictions since its presence leads to smaller errors.

custom_main(exp_name, path='/Users/renero/phd/data/RC4/', output_path='/Users/renero/phd/output/RC4/', scale=False)[source]#

Runs a custom main function for the given experiment name.

Parameters:
  • experiment_name (str) – The name of the experiment to run.

  • path (str) – The path to the data files.

  • output_path (str) – The path to the output files.

Returns:

None

sachs_main()[source]#

Module contents#

Explainability techniques used for causal discovery.

This module contains various techniques and tools for explaining and interpreting causal discovery results:

  • shapley: Implements Shapley value-based methods for attributing importance to features in causal models.

  • regression_quality: Provides metrics and tools for assessing the quality of regression models used in causal discovery.

  • perm_importance: Implements permutation importance methods for feature importance in causal models.

  • hierarchies: Contains tools for analyzing and visualizing hierarchical structures in causal relationships.

These submodules offer a range of approaches to enhance the interpretability and understanding of causal discovery results, aiding in the validation and refinement of causal models.