causalexplain.estimators.cam package#

Submodules#

Causal Additive Model (CAM) estimator.

This module provides a Python translation of the original CAM implementation:

Buhlmann, Peter and Peters, Jonas and Ernest, Jan (2014). CAM: Causal additive models, high-dimensional order search and penalized regression. The Annals of Statistics, 42(6), 2526–2556.

The implementation depends on helper modules in this package such as computeScoreMat, updateScoreMat, pruning, selGamBoost, and selGam.

class CAM(name, scoreName='SEMGAM', parsScore=None, numCores=1, maxNumParents=None, verbose=False, variableSel=False, variableSelMethod=<function selGamBoost>, variableSelMethodPars=None, pruning=True, pruneMethod=<function selGam>, pruneMethodPars={'cutOffPVal': 0.05, 'numBasisFcts': 10}, intervData=False, intervMat=None)[source]#

Bases: object

Causal Additive Model (CAM) estimator.

__init__(name, scoreName='SEMGAM', parsScore=None, numCores=1, maxNumParents=None, verbose=False, variableSel=False, variableSelMethod=<function selGamBoost>, variableSelMethodPars=None, pruning=True, pruneMethod=<function selGam>, pruneMethodPars={'cutOffPVal': 0.05, 'numBasisFcts': 10}, intervData=False, intervMat=None)[source]#
fit(X)[source]#

This method implements the entire CAM algorithm. Translated from the R code.

Parameters:

X (np.array) – Observational data

Returns:

  • edgeList (list) – List of edges

  • scoreVec (list) – List of scores

predict(ref_graph=None)[source]#
fit_predict(train_data, test_data=None, ref_graph=None)[source]#
main(dataset_name, input_path='/Users/renero/phd/data/sachs', output_path='/Users/renero/phd/output/', save=False)[source]#

Compute the CAM score matrix for candidate parent sets.

computeScoreMat(X, score_name, num_parents, verbose, num_cores, sel_mat, pars_score, interv_mat, interv_data)[source]#

Calculate score entries for all parent combinations.

  • R’s cat is replaced with Python’s print.

  • R’s stop is replaced with Python’s raise for exceptions.

  • R’s ! is replaced with Python’s ~ for logical negation.

  • R’s prod is replaced with np.prod from NumPy.

  • R’s var is replaced with np.var from NumPy.

computeScoreMatParallel(row_parents, score_name, X, sel_mat, verbose, node2, i, pars_score, interv_mat, interv_data)[source]#
  • The pruning function is translated to Python.

  • dim(G)[1] is replaced with G.shape[0] to get the number of rows.

  • matrix(0,p,p) is replaced with np.zeros((p, p)) to create a zero matrix.

  • which(G[,i]==1) is replaced with np.where(G[:, i] == 1)[0] to find the indices

    where the condition is true.

  • cbind(X[,parents],X[,i]) is replaced with np.hstack((X[:, parents], X[:, [i]]))

    to concatenate arrays horizontally.

  • The cat function is replaced with print for output.

  • The pruneMethod function is passed as prune_method and called accordingly.

pruning(X, G, verbose=False, prune_method=None, prune_method_pars={'cutOffPVal': 0.001, 'numBasisFcts': 10})[source]#

_summary_

Parameters:
  • X (_type_) – Input vectors

  • G (_type_) – Adjacency matrix representing a DAG

  • output (bool, optional) – Whether to print debug messages

  • prune_method (_type_, optional) – _description_. Defaults to None.

  • prune_method_pars (dict, optional) – _description_. Defaults to {‘cutOffPVal’: 0.001, ‘numBasisFcts’: 10}.

Returns:

_description_

Return type:

_type_

This Python version aims to replicate the functionality of the R function. Here are some key points about the translation:

  1. We use NumPy for array operations.

  2. Instead of gam from R, we use pygam library which provides similar

    functionality in Python.

  3. The p-values are extracted from the fitted GAM model’s statistics.

  4. The logic for creating and updating selVec is adjusted to work with Python’s

    0-based indexing.

Note that this translation assumes that the pygam library is installed and imported. You may need to install it using pip install pygam.

Also, be aware that there might be some differences in the exact implementation details between R’s gam and Python’s pygam. You may need to fine-tune the GAM model creation and fitting process to match the exact behavior of the R version.

selGam(X, pars=None, verbose=False, k=None)[source]#

This method selects features based on GAM p-values. It returns a vector of selected features whose p-values are less than the cutOffPVal.

Parameters:
  • X (_type_) – _description_

  • pars (_type_, optional) – _description_. Defaults to None.

  • verbose (bool, optional) – _description_. Defaults to False.

  • k (_type_, optional) – _description_. Defaults to None.

Returns:

_description_

Return type:

_type_

GAM boost selection helper translated from the original R code.

selGamBoost(X, pars=None, output=False, k=None)[source]#

Select candidate parents using boosted GAMs.

Lasso selection helper translated from the original R code.

selLasso(X, pars=None, output=False, k=None)[source]#

Select candidate parents using Lasso regression.

Here are the main changes and explanations:

  • We use numpy for array operations and scipy.stats for linear regression.

  • The default parameter pars is set to None and then initialized if not provided.

  • We use f-strings for string formatting in the print statement.

  • The input X is converted to a numpy array.

  • Instead of using lm, we use scipy.stats.linregress for linear regression.

  • We manually add a constant term to X for the intercept in the regression.

  • The p-values are extracted directly from the linregress result.

  • The selection vector is updated using list slicing to exclude the k-th element.

Note that this Python version assumes that the input X is a 2D array-like object. The function will work similarly to the R version, but there might be slight differences in the exact numerical results due to different underlying implementations of the linear regression.

selLm(X, pars=None, output=False, k=None)[source]#

_summary_

Parameters:
  • X (np.ndarray) – a 2D numpy array with the variables

  • pars (dict) – Parameters

  • output (bool, optional) – _description_. Defaults to False.

  • k (int, optional) – The index of the variable

Returns:

_description_

Return type:

_type_

Linear-model boost selection helper translated from the original R code.

selLmBoost(X, pars=None, output=False, k=None)[source]#

Select candidate parents using boosted linear models.

This Python version attempts to replicate the functionality of the R function. Here are some key points:

  1. We use pandas for data manipulation and sklearn for machine learning components.

  2. The bbs function in R is approximated using SplineTransformer from scikit-learn.

  3. Instead of mboost_fit, we use GradientBoostingRegressor from scikit-learn.

  4. The function returns a dictionary with the same keys as the R version.

Note that this is an approximation, as the exact behavior of bbs and mboost_fit in R might differ from the Python implementations. You may need to fine-tune parameters or use different libraries for a more exact replication of the R function’s behavior.

train_GAMboost(X, y, pars=None)[source]#

_summary_

Parameters:
  • X (_type_) – _description_

  • y (_type_) – _description_

  • pars (_type_, optional) – _description_. Defaults to None.

Returns:

_description_

Return type:

_type_

Here’s an explanation of the changes:

1. We import numpy for array operations and GradientBoostingRegressor from scikit-learn as an equivalent to R’s glmboost. 2. The function signature is similar, but we use None as the default for pars instead of an empty list. 3. We convert inputs to numpy arrays to ensure compatibility. 4. We center y by subtracting its mean. 5. We create and fit a GradientBoostingRegressor, which is similar to glmboost in R. 6. We create a dictionary result with the fitted values, residuals, and the model itself. 7. The center=TRUE parameter in the R version is not needed as scikit-learn’s GradientBoostingRegressor handles feature centering internally.

Note that this Python version might not be exactly equivalent to the R version, as there could be differences in the underlying algorithms and default parameters. You may need to adjust the GradientBoostingRegressor parameters to match the behavior of glmboost more closely if needed.

train_LMboost(X, y, pars=None)[source]#

_summary_

Parameters:
  • X (_type_) – _description_

  • y (_type_) – _description_

  • pars (dict, optional) – Parameters

Returns:

result

Return type:

dict

Key differences and notes:

  • We use NumPy arrays instead of R matrices.

  • The pygam library is used instead of R’s gam function.

  • The formula creation is different. In pygam, we create a list of smooth terms.

  • Error handling is done with a try-except block instead of R’s try().

  • The df, edf, and edf1 calculations are approximations, as pygam doesn’t provide

    exact equivalents to R’s GAM implementation.

  • The function signature includes type hints for better code clarity.

To use this function, you’ll need to install the required libraries: pygam

This Python version should provide similar functionality to the R version, but there might be some differences in the exact numerical results due to the different implementations of GAM in R and Python.

train_gam(X, y, pars=None, verbose=False)[source]#

Train a Generalized Additive Model using pyGAM.

Parameters:
  • X (np.ndarray) – Input features.

  • y (np.ndarray) – Target variable.

  • pars (Dict[str, Any], optional) – Model parameters. Defaults to None.

Returns:

result – Model results.

Return type:

Dict[str, Any]

train_gam_sm(X, y, pars=None)[source]#

Train a Generalized Additive Model using statsmodels.gam.

Parameters:
  • X (np.ndarray) – Input features.

  • y (np.ndarray) – Target variable.

  • pars (Dict[str, Any], optional) – Model parameters. Defaults to None.

Returns:

Model results.

Return type:

Dict[str, Any]

This Python function maintains the same structure and functionality as the original R function:

1. The function is named train_gp and takes three parameters: X, y, and pars (with a default empty dictionary). 2. It raises a NotImplementedError with the message “GP regression not implemented.” 3. It returns None (which is equivalent to R’s NULL).

Note that in Python, we use raise instead of stop() to throw exceptions, and we use NotImplementedError as it’s the most appropriate built-in exception for this case. Also, the default value for pars is set to None and then initialized as an empty dictionary inside the function, which is a common Python idiom to avoid mutable default arguments.

train_gp(X, y, pars=None)[source]#

_summary_

Parameters:
  • X (_type_) – _description_

  • y (_type_) – _description_

  • pars (_type_, optional) – _description_. Defaults to None.

Raises:

NotImplementedError – _description_

Returns:

_description_

Return type:

_type_

This Python version accomplishes the same task as the R function:

1. It uses LassoCV from scikit-learn to perform cross-validation and find the optimal regularization parameter (lambda in R, alpha in Python). 2. It then trains a final Lasso model using the optimal alpha. 3. The function returns a dictionary containing the fitted values, residuals, and the trained model.

Note that: - The cv.glmnet in R is replaced by LassoCV in Python. - The glmnet in R is replaced by Lasso in Python. - In scikit-learn, the regularization parameter is called alpha instead of lambda. - The pars parameter is kept for consistency, but it’s not used in this implementation. You can extend the function to use additional parameters if needed. - The cross-validation is set to 5-fold (you can adjust this if needed). - A random state is set for reproducibility.

This Python version should provide equivalent functionality to the original R function.

train_lasso(X, y, pars=None)[source]#

_summary_

Parameters:
  • X (_type_) – _description_

  • y (_type_) – _description_

  • pars (_type_, optional) – _description_. Defaults to None.

Returns:

_description_

Return type:

_type_

train_linear(X, y, pars=None)[source]#

Train a linear regression model.

Parameters:
  • X (numpy.ndarray) – Input features, shape (n_samples, n_features).

  • y (numpy.ndarray) – Target values, shape (n_samples, 1).

  • pars (dict, optional) – Additional parameters for the model. Defaults to None.

Returns:

A dictionary containing:
  • ’Yfit’ (numpy.ndarray): Predicted values, shape (n_samples, 1).

  • ’residuals’ (numpy.ndarray): Residuals (y - y_pred), shape (n_samples, 1).

  • ’model’ (LinearRegression): Fitted sklearn LinearRegression model.

Return type:

dict

Note

The coefficients of the model can be accessed via the ‘model’ key in the returned dictionary, specifically using result[‘model’].coef_.

1. Imports: Imported necessary modules and the compute_score_mat_parallel function. 2. Function Definition: Translated the R function to Python, maintaining the same logic and structure. 3. Matrix Operations: Used NumPy for matrix operations. 4. Parallel Processing: Used Python’s multiprocessing.Pool for parallel processing, similar to R’s mcmapply.

Make sure to have the compute_score_mat_parallel function defined in a file named compute_score_mat_parallel.py.

updateScoreMat(score_mat, X, score_name, i, j, score_nodes, adj, verbose, num_cores, max_num_parents, pars_score, interv_mat, interv_data)[source]#

_summary_

Parameters:
  • score_mat (_type_) – _description_

  • X (_type_) – _description_

  • score_name (_type_) – _description_

  • i (_type_) – _description_

  • j (_type_) – _description_

  • score_nodes (_type_) – _description_

  • adj (_type_) – _description_

  • output (_type_) – _description_

  • num_cores (_type_) – _description_

  • max_num_parents (_type_) – _description_

  • pars_score (_type_) – _description_

  • interv_mat (_type_) – _description_

  • interv_data (_type_) – _description_

Returns:

_description_

Return type:

_type_

Module contents#