causalexplain.estimators.cam package#

Submodules#

  1. Original Code from R implementation of the Causal Additive Model (CAM)

@article{buhlmann2014cam,
title={CAM: Causal additive models, high-dimensional order search and

penalized regression},

author={B{“u}hlmann, Peter and Peters, Jonas and Ernest, Jan}, journal={The Annals of Statistics}, volume={42}, number={6}, pages={2526–2556}, year={2014}, publisher={Institute of Mathematical Statistics}

}

  • Imports: Imported necessary modules and functions. Assumed that

computeScoreMat, updateScoreMat, pruning, selGamBoost, and selGam are defined in separate Python files in the same directory. - Function Definition: Translated the R function CAM to Python. - Variable Initialization: Initialized variables and handled default values. - Variable Selection: Used numpy and multiprocessing for parallel

processing.

  • Edge Inclusion: Translated the logic for including edges and updating the

score matrix. - Pruning: Translated the pruning step. - Output and Return: Collected and printed the results.

Make sure the corresponding Python files (computeScoreMat.py,

updateScoreMat.py,

pruning.py, selGamBoost.py, selGam.py) are present in the same directory and contain the necessary functions.

class CAM(name, scoreName='SEMGAM', parsScore=None, numCores=1, maxNumParents=None, verbose=False, variableSel=False, variableSelMethod=<function selGamBoost>, variableSelMethodPars=None, pruning=True, pruneMethod=<function selGam>, pruneMethodPars={'cutOffPVal': 0.05, 'numBasisFcts': 10}, intervData=False, intervMat=None)[source]#

Bases: object

Methods

fit(X)

This method implements the entire CAM algorithm.

fit_predict

predict

__init__(name, scoreName='SEMGAM', parsScore=None, numCores=1, maxNumParents=None, verbose=False, variableSel=False, variableSelMethod=<function selGamBoost>, variableSelMethodPars=None, pruning=True, pruneMethod=<function selGam>, pruneMethodPars={'cutOffPVal': 0.05, 'numBasisFcts': 10}, intervData=False, intervMat=None)[source]#
fit(X)[source]#

This method implements the entire CAM algorithm. Translated from the R code.

Parameters:

X (np.array) – Observational data

Returns:

  • edgeList (list) – List of edges

  • scoreVec (list) – List of scores

predict(ref_graph=None)[source]#
fit_predict(train_data, test_data=None, ref_graph=None)[source]#
main(dataset_name, input_path='/Users/renero/phd/data/sachs', output_path='/Users/renero/phd/output/', save=False)[source]#
  • Function Definition: Translated the R function computeScoreMat to Python as

compute_score_mat. - Combinations: Used itertools.combinations to generate parent combinations. - DataFrame: Used pandas.DataFrame to create the grid of indices. - Parallel Processing: Used multiprocessing.Pool for parallel processing. - Variance Calculation: Used numpy.var to calculate variance and adjusted the score matrix accordingly.

Make sure the compute_score_mat_parallel function is correctly defined in computeScoreMatParallel.py. If you need further assistance with that function, please provide its R code.

computeScoreMat(X, score_name, num_parents, verbose, num_cores, sel_mat, pars_score, interv_mat, interv_data)[source]#

_summary_

Parameters:
  • X (_type_) – _description_

  • score_name (_type_) – _description_

  • num_parents (_type_) – _description_

  • output (_type_) – _description_

  • num_cores (_type_) – _description_

  • sel_mat (_type_) – _description_

  • pars_score (_type_) – _description_

  • interv_mat (_type_) – _description_

  • interv_data (_type_) – _description_

Returns:

_description_

Return type:

_type_

  • R’s cat is replaced with Python’s print.

  • R’s stop is replaced with Python’s raise for exceptions.

  • R’s ! is replaced with Python’s ~ for logical negation.

  • R’s prod is replaced with np.prod from NumPy.

  • R’s var is replaced with np.var from NumPy.

computeScoreMatParallel(row_parents, score_name, X, sel_mat, verbose, node2, i, pars_score, interv_mat, interv_data)[source]#
  • The pruning function is translated to Python.

  • dim(G)[1] is replaced with G.shape[0] to get the number of rows.

  • matrix(0,p,p) is replaced with np.zeros((p, p)) to create a zero matrix.

  • which(G[,i]==1) is replaced with np.where(G[:, i] == 1)[0] to find the indices

    where the condition is true.

  • cbind(X[,parents],X[,i]) is replaced with np.hstack((X[:, parents], X[:, [i]]))

    to concatenate arrays horizontally.

  • The cat function is replaced with print for output.

  • The pruneMethod function is passed as prune_method and called accordingly.

pruning(X, G, verbose=False, prune_method=None, prune_method_pars={'cutOffPVal': 0.001, 'numBasisFcts': 10})[source]#

_summary_

Parameters:
  • X (_type_) – Input vectors

  • G (_type_) – Adjacency matrix representing a DAG

  • output (bool, optional) – Whether to print debug messages

  • prune_method (_type_, optional) – _description_. Defaults to None.

  • prune_method_pars (dict, optional) – _description_. Defaults to {‘cutOffPVal’: 0.001, ‘numBasisFcts’: 10}.

Returns:

_description_

Return type:

_type_

This Python version aims to replicate the functionality of the R function. Here are some key points about the translation:

  1. We use NumPy for array operations.

  2. Instead of gam from R, we use pygam library which provides similar

    functionality in Python.

  3. The p-values are extracted from the fitted GAM model’s statistics.

  4. The logic for creating and updating selVec is adjusted to work with Python’s

    0-based indexing.

Note that this translation assumes that the pygam library is installed and imported. You may need to install it using pip install pygam.

Also, be aware that there might be some differences in the exact implementation details between R’s gam and Python’s pygam. You may need to fine-tune the GAM model creation and fitting process to match the exact behavior of the R version.

selGam(X, pars=None, verbose=False, k=None)[source]#

This method selects features based on GAM p-values. It returns a vector of selected features whose p-values are less than the cutOffPVal.

Parameters:
  • X (_type_) – _description_

  • pars (_type_, optional) – _description_. Defaults to None.

  • verbose (bool, optional) – _description_. Defaults to False.

  • k (_type_, optional) – _description_. Defaults to None.

Returns:

_description_

Return type:

_type_

This Python version of selGamBoost follows the structure and logic of the original R function. Here are some key points about the translation:

1. We import numpy for array operations and assume that train_GAMboost is imported from a separate file. 2. The function signature remains similar, with default values for pars and output. 3. R’s matrix indexing is replaced with NumPy array indexing. 4. The cat function for output is replaced with Python’s print function. 5. List comprehensions and NumPy functions are used to replace some R-specific

operations.

6. The xselect() method is assumed to exist in the model returned by train_GAMboost. You may need to adjust this based on the actual implementation. 7. The boolean indexing and selection logic is adapted to work with NumPy arrays.

Note that this translation assumes that the train_GAMboost function in Python returns an object with similar properties to its R counterpart. You may need to adjust the code further based on the exact implementation of train_GAMboost in Python.

selGamBoost(X, pars=None, output=False, k=None)[source]#

_summary_

Parameters:
  • X (_type_) – _description_

  • pars (_type_, optional) – _description_. Defaults to None.

  • output (bool, optional) – _description_. Defaults to False.

  • k (_type_, optional) – _description_. Defaults to None.

Returns:

_description_

Return type:

_type_

Key changes and explanations:

  1. Imported numpy for array operations and train_lasso from train_lasso.py.

2. Changed function signature to use Python conventions (e.g., None as default for pars). 3. Used f-string for formatted output. 4. Converted X to a numpy array for easier indexing and shape retrieval. 5. Adjusted indexing to account for Python’s 0-based indexing (e.g., X[:, :k] instead of X[,-k]). 6. Implemented the selection vector creation using list comprehensions and boolean operations. 7. Adjusted the selVec assignment to account for Python’s slicing behavior.

Note that this translation assumes that the train_lasso function in train_lasso.py returns a dictionary with a nested ‘model’ dictionary containing a ‘beta’ list. You may need to adjust the train_lasso call and result handling if its implementation differs from this assumption.

selLasso(X, pars=None, output=False, k=None)[source]#

Here are the main changes and explanations:

  • We use numpy for array operations and scipy.stats for linear regression.

  • The default parameter pars is set to None and then initialized if not provided.

  • We use f-strings for string formatting in the print statement.

  • The input X is converted to a numpy array.

  • Instead of using lm, we use scipy.stats.linregress for linear regression.

  • We manually add a constant term to X for the intercept in the regression.

  • The p-values are extracted directly from the linregress result.

  • The selection vector is updated using list slicing to exclude the k-th element.

Note that this Python version assumes that the input X is a 2D array-like object. The function will work similarly to the R version, but there might be slight differences in the exact numerical results due to different underlying implementations of the linear regression.

selLm(X, pars=None, output=False, k=None)[source]#

_summary_

Parameters:
  • X (np.ndarray) – a 2D numpy array with the variables

  • pars (dict) – Parameters

  • output (bool, optional) – _description_. Defaults to False.

  • k (int, optional) – The index of the variable

Returns:

_description_

Return type:

_type_

This Python version attempts to replicate the functionality of the R function. Here are some key points:

  1. We use pandas for data manipulation and sklearn for machine learning components.

  2. The bbs function in R is approximated using SplineTransformer from scikit-learn.

  3. Instead of mboost_fit, we use GradientBoostingRegressor from scikit-learn.

  4. The function returns a dictionary with the same keys as the R version.

Note that this is an approximation, as the exact behavior of bbs and mboost_fit in R might differ from the Python implementations. You may need to fine-tune parameters or use different libraries for a more exact replication of the R function’s behavior.

train_GAMboost(X, y, pars=None)[source]#

_summary_

Parameters:
  • X (_type_) – _description_

  • y (_type_) – _description_

  • pars (_type_, optional) – _description_. Defaults to None.

Returns:

_description_

Return type:

_type_

Here’s an explanation of the changes:

1. We import numpy for array operations and GradientBoostingRegressor from scikit-learn as an equivalent to R’s glmboost. 2. The function signature is similar, but we use None as the default for pars instead of an empty list. 3. We convert inputs to numpy arrays to ensure compatibility. 4. We center y by subtracting its mean. 5. We create and fit a GradientBoostingRegressor, which is similar to glmboost in R. 6. We create a dictionary result with the fitted values, residuals, and the model itself. 7. The center=TRUE parameter in the R version is not needed as scikit-learn’s GradientBoostingRegressor handles feature centering internally.

Note that this Python version might not be exactly equivalent to the R version, as there could be differences in the underlying algorithms and default parameters. You may need to adjust the GradientBoostingRegressor parameters to match the behavior of glmboost more closely if needed.

train_LMboost(X, y, pars=None)[source]#

_summary_

Parameters:
  • X (_type_) – _description_

  • y (_type_) – _description_

  • pars (dict, optional) – Parameters

Returns:

result

Return type:

dict

Key differences and notes:

  • We use NumPy arrays instead of R matrices.

  • The pygam library is used instead of R’s gam function.

  • The formula creation is different. In pygam, we create a list of smooth terms.

  • Error handling is done with a try-except block instead of R’s try().

  • The df, edf, and edf1 calculations are approximations, as pygam doesn’t provide

    exact equivalents to R’s GAM implementation.

  • The function signature includes type hints for better code clarity.

To use this function, you’ll need to install the required libraries: pygam

This Python version should provide similar functionality to the R version, but there might be some differences in the exact numerical results due to the different implementations of GAM in R and Python.

train_gam(X, y, pars=None, verbose=False)[source]#

Train a Generalized Additive Model using pyGAM.

Parameters:
  • X (np.ndarray) – Input features.

  • y (np.ndarray) – Target variable.

  • pars (Dict[str, Any], optional) – Model parameters. Defaults to None.

Returns:

result – Model results.

Return type:

Dict[str, Any]

train_gam_sm(X, y, pars=None)[source]#

Train a Generalized Additive Model using statsmodels.gam.

Parameters:
  • X (np.ndarray) – Input features.

  • y (np.ndarray) – Target variable.

  • pars (Dict[str, Any], optional) – Model parameters. Defaults to None.

Returns:

Model results.

Return type:

Dict[str, Any]

This Python function maintains the same structure and functionality as the original R function:

1. The function is named train_gp and takes three parameters: X, y, and pars (with a default empty dictionary). 2. It raises a NotImplementedError with the message “GP regression not implemented.” 3. It returns None (which is equivalent to R’s NULL).

Note that in Python, we use raise instead of stop() to throw exceptions, and we use NotImplementedError as it’s the most appropriate built-in exception for this case. Also, the default value for pars is set to None and then initialized as an empty dictionary inside the function, which is a common Python idiom to avoid mutable default arguments.

train_gp(X, y, pars=None)[source]#

_summary_

Parameters:
  • X (_type_) – _description_

  • y (_type_) – _description_

  • pars (_type_, optional) – _description_. Defaults to None.

Raises:

NotImplementedError – _description_

Returns:

_description_

Return type:

_type_

This Python version accomplishes the same task as the R function:

1. It uses LassoCV from scikit-learn to perform cross-validation and find the optimal regularization parameter (lambda in R, alpha in Python). 2. It then trains a final Lasso model using the optimal alpha. 3. The function returns a dictionary containing the fitted values, residuals, and the trained model.

Note that: - The cv.glmnet in R is replaced by LassoCV in Python. - The glmnet in R is replaced by Lasso in Python. - In scikit-learn, the regularization parameter is called alpha instead of lambda. - The pars parameter is kept for consistency, but it’s not used in this implementation. You can extend the function to use additional parameters if needed. - The cross-validation is set to 5-fold (you can adjust this if needed). - A random state is set for reproducibility.

This Python version should provide equivalent functionality to the original R function.

train_lasso(X, y, pars=None)[source]#

_summary_

Parameters:
  • X (_type_) – _description_

  • y (_type_) – _description_

  • pars (_type_, optional) – _description_. Defaults to None.

Returns:

_description_

Return type:

_type_

train_linear(X, y, pars=None)[source]#

Train a linear regression model.

Parameters:
  • X (numpy.ndarray) – Input features, shape (n_samples, n_features).

  • y (numpy.ndarray) – Target values, shape (n_samples, 1).

  • pars (dict, optional) – Additional parameters for the model. Defaults to None.

Returns:

A dictionary containing:
  • ’Yfit’ (numpy.ndarray): Predicted values, shape (n_samples, 1).

  • ’residuals’ (numpy.ndarray): Residuals (y - y_pred), shape (n_samples, 1).

  • ’model’ (LinearRegression): Fitted sklearn LinearRegression model.

Return type:

dict

Note

The coefficients of the model can be accessed via the ‘model’ key in the returned dictionary, specifically using result[‘model’].coef_.

1. Imports: Imported necessary modules and the compute_score_mat_parallel function. 2. Function Definition: Translated the R function to Python, maintaining the same logic and structure. 3. Matrix Operations: Used NumPy for matrix operations. 4. Parallel Processing: Used Python’s multiprocessing.Pool for parallel processing, similar to R’s mcmapply.

Make sure to have the compute_score_mat_parallel function defined in a file named compute_score_mat_parallel.py.

updateScoreMat(score_mat, X, score_name, i, j, score_nodes, adj, verbose, num_cores, max_num_parents, pars_score, interv_mat, interv_data)[source]#

_summary_

Parameters:
  • score_mat (_type_) – _description_

  • X (_type_) – _description_

  • score_name (_type_) – _description_

  • i (_type_) – _description_

  • j (_type_) – _description_

  • score_nodes (_type_) – _description_

  • adj (_type_) – _description_

  • output (_type_) – _description_

  • num_cores (_type_) – _description_

  • max_num_parents (_type_) – _description_

  • pars_score (_type_) – _description_

  • interv_mat (_type_) – _description_

  • interv_data (_type_) – _description_

Returns:

_description_

Return type:

_type_

Module contents#