causalexplain.estimators.cam package#
Submodules#
Causal Additive Model (CAM) estimator.
This module provides a Python translation of the original CAM implementation:
Buhlmann, Peter and Peters, Jonas and Ernest, Jan (2014). CAM: Causal additive models, high-dimensional order search and penalized regression. The Annals of Statistics, 42(6), 2526–2556.
The implementation depends on helper modules in this package such as
computeScoreMat, updateScoreMat, pruning, selGamBoost, and
selGam.
- class CAM(name, scoreName='SEMGAM', parsScore=None, numCores=1, maxNumParents=None, verbose=False, variableSel=False, variableSelMethod=<function selGamBoost>, variableSelMethodPars=None, pruning=True, pruneMethod=<function selGam>, pruneMethodPars={'cutOffPVal': 0.05, 'numBasisFcts': 10}, intervData=False, intervMat=None)[source]#
Bases:
objectCausal Additive Model (CAM) estimator.
- __init__(name, scoreName='SEMGAM', parsScore=None, numCores=1, maxNumParents=None, verbose=False, variableSel=False, variableSelMethod=<function selGamBoost>, variableSelMethodPars=None, pruning=True, pruneMethod=<function selGam>, pruneMethodPars={'cutOffPVal': 0.05, 'numBasisFcts': 10}, intervData=False, intervMat=None)[source]#
- main(dataset_name, input_path='/Users/renero/phd/data/sachs', output_path='/Users/renero/phd/output/', save=False)[source]#
Compute the CAM score matrix for candidate parent sets.
- computeScoreMat(X, score_name, num_parents, verbose, num_cores, sel_mat, pars_score, interv_mat, interv_data)[source]#
Calculate score entries for all parent combinations.
R’s cat is replaced with Python’s print.
R’s stop is replaced with Python’s raise for exceptions.
R’s ! is replaced with Python’s ~ for logical negation.
R’s prod is replaced with np.prod from NumPy.
R’s var is replaced with np.var from NumPy.
- computeScoreMatParallel(row_parents, score_name, X, sel_mat, verbose, node2, i, pars_score, interv_mat, interv_data)[source]#
The pruning function is translated to Python.
dim(G)[1] is replaced with G.shape[0] to get the number of rows.
matrix(0,p,p) is replaced with np.zeros((p, p)) to create a zero matrix.
- which(G[,i]==1) is replaced with np.where(G[:, i] == 1)[0] to find the indices
where the condition is true.
- cbind(X[,parents],X[,i]) is replaced with np.hstack((X[:, parents], X[:, [i]]))
to concatenate arrays horizontally.
The cat function is replaced with print for output.
The pruneMethod function is passed as prune_method and called accordingly.
- pruning(X, G, verbose=False, prune_method=None, prune_method_pars={'cutOffPVal': 0.001, 'numBasisFcts': 10})[source]#
_summary_
- Parameters:
X (_type_) – Input vectors
G (_type_) – Adjacency matrix representing a DAG
output (bool, optional) – Whether to print debug messages
prune_method (_type_, optional) – _description_. Defaults to None.
prune_method_pars (dict, optional) – _description_. Defaults to {‘cutOffPVal’: 0.001, ‘numBasisFcts’: 10}.
- Returns:
_description_
- Return type:
_type_
This Python version aims to replicate the functionality of the R function. Here are some key points about the translation:
We use NumPy for array operations.
- Instead of gam from R, we use pygam library which provides similar
functionality in Python.
The p-values are extracted from the fitted GAM model’s statistics.
- The logic for creating and updating selVec is adjusted to work with Python’s
0-based indexing.
Note that this translation assumes that the pygam library is installed and imported. You may need to install it using pip install pygam.
Also, be aware that there might be some differences in the exact implementation details between R’s gam and Python’s pygam. You may need to fine-tune the GAM model creation and fitting process to match the exact behavior of the R version.
- selGam(X, pars=None, verbose=False, k=None)[source]#
This method selects features based on GAM p-values. It returns a vector of selected features whose p-values are less than the cutOffPVal.
- Parameters:
X (_type_) – _description_
pars (_type_, optional) – _description_. Defaults to None.
verbose (bool, optional) – _description_. Defaults to False.
k (_type_, optional) – _description_. Defaults to None.
- Returns:
_description_
- Return type:
_type_
GAM boost selection helper translated from the original R code.
- selGamBoost(X, pars=None, output=False, k=None)[source]#
Select candidate parents using boosted GAMs.
Lasso selection helper translated from the original R code.
- selLasso(X, pars=None, output=False, k=None)[source]#
Select candidate parents using Lasso regression.
Here are the main changes and explanations:
We use numpy for array operations and scipy.stats for linear regression.
The default parameter pars is set to None and then initialized if not provided.
We use f-strings for string formatting in the print statement.
The input X is converted to a numpy array.
Instead of using lm, we use scipy.stats.linregress for linear regression.
We manually add a constant term to X for the intercept in the regression.
The p-values are extracted directly from the linregress result.
The selection vector is updated using list slicing to exclude the k-th element.
Note that this Python version assumes that the input X is a 2D array-like object. The function will work similarly to the R version, but there might be slight differences in the exact numerical results due to different underlying implementations of the linear regression.
Linear-model boost selection helper translated from the original R code.
- selLmBoost(X, pars=None, output=False, k=None)[source]#
Select candidate parents using boosted linear models.
This Python version attempts to replicate the functionality of the R function. Here are some key points:
We use pandas for data manipulation and sklearn for machine learning components.
The bbs function in R is approximated using SplineTransformer from scikit-learn.
Instead of mboost_fit, we use GradientBoostingRegressor from scikit-learn.
The function returns a dictionary with the same keys as the R version.
Note that this is an approximation, as the exact behavior of bbs and mboost_fit in R might differ from the Python implementations. You may need to fine-tune parameters or use different libraries for a more exact replication of the R function’s behavior.
- train_GAMboost(X, y, pars=None)[source]#
_summary_
- Parameters:
X (_type_) – _description_
y (_type_) – _description_
pars (_type_, optional) – _description_. Defaults to None.
- Returns:
_description_
- Return type:
_type_
Here’s an explanation of the changes:
1. We import numpy for array operations and GradientBoostingRegressor from scikit-learn as an equivalent to R’s glmboost. 2. The function signature is similar, but we use None as the default for pars instead of an empty list. 3. We convert inputs to numpy arrays to ensure compatibility. 4. We center y by subtracting its mean. 5. We create and fit a GradientBoostingRegressor, which is similar to glmboost in R. 6. We create a dictionary result with the fitted values, residuals, and the model itself. 7. The center=TRUE parameter in the R version is not needed as scikit-learn’s GradientBoostingRegressor handles feature centering internally.
Note that this Python version might not be exactly equivalent to the R version, as there could be differences in the underlying algorithms and default parameters. You may need to adjust the GradientBoostingRegressor parameters to match the behavior of glmboost more closely if needed.
Key differences and notes:
We use NumPy arrays instead of R matrices.
The pygam library is used instead of R’s gam function.
The formula creation is different. In pygam, we create a list of smooth terms.
Error handling is done with a try-except block instead of R’s try().
- The df, edf, and edf1 calculations are approximations, as pygam doesn’t provide
exact equivalents to R’s GAM implementation.
The function signature includes type hints for better code clarity.
To use this function, you’ll need to install the required libraries: pygam
This Python version should provide similar functionality to the R version, but there might be some differences in the exact numerical results due to the different implementations of GAM in R and Python.
This Python function maintains the same structure and functionality as the original R function:
1. The function is named train_gp and takes three parameters: X, y, and pars (with a default empty dictionary). 2. It raises a NotImplementedError with the message “GP regression not implemented.” 3. It returns None (which is equivalent to R’s NULL).
Note that in Python, we use raise instead of stop() to throw exceptions, and we use NotImplementedError as it’s the most appropriate built-in exception for this case. Also, the default value for pars is set to None and then initialized as an empty dictionary inside the function, which is a common Python idiom to avoid mutable default arguments.
- train_gp(X, y, pars=None)[source]#
_summary_
- Parameters:
X (_type_) – _description_
y (_type_) – _description_
pars (_type_, optional) – _description_. Defaults to None.
- Raises:
NotImplementedError – _description_
- Returns:
_description_
- Return type:
_type_
This Python version accomplishes the same task as the R function:
1. It uses LassoCV from scikit-learn to perform cross-validation and find the optimal regularization parameter (lambda in R, alpha in Python). 2. It then trains a final Lasso model using the optimal alpha. 3. The function returns a dictionary containing the fitted values, residuals, and the trained model.
Note that: - The cv.glmnet in R is replaced by LassoCV in Python. - The glmnet in R is replaced by Lasso in Python. - In scikit-learn, the regularization parameter is called alpha instead of lambda. - The pars parameter is kept for consistency, but it’s not used in this implementation. You can extend the function to use additional parameters if needed. - The cross-validation is set to 5-fold (you can adjust this if needed). - A random state is set for reproducibility.
This Python version should provide equivalent functionality to the original R function.
- train_lasso(X, y, pars=None)[source]#
_summary_
- Parameters:
X (_type_) – _description_
y (_type_) – _description_
pars (_type_, optional) – _description_. Defaults to None.
- Returns:
_description_
- Return type:
_type_
- train_linear(X, y, pars=None)[source]#
Train a linear regression model.
- Parameters:
X (numpy.ndarray) – Input features, shape (n_samples, n_features).
y (numpy.ndarray) – Target values, shape (n_samples, 1).
pars (dict, optional) – Additional parameters for the model. Defaults to None.
- Returns:
- A dictionary containing:
’Yfit’ (numpy.ndarray): Predicted values, shape (n_samples, 1).
’residuals’ (numpy.ndarray): Residuals (y - y_pred), shape (n_samples, 1).
’model’ (LinearRegression): Fitted sklearn LinearRegression model.
- Return type:
Note
The coefficients of the model can be accessed via the ‘model’ key in the returned dictionary, specifically using result[‘model’].coef_.
1. Imports: Imported necessary modules and the compute_score_mat_parallel function. 2. Function Definition: Translated the R function to Python, maintaining the same logic and structure. 3. Matrix Operations: Used NumPy for matrix operations. 4. Parallel Processing: Used Python’s multiprocessing.Pool for parallel processing, similar to R’s mcmapply.
Make sure to have the compute_score_mat_parallel function defined in a file named compute_score_mat_parallel.py.
- updateScoreMat(score_mat, X, score_name, i, j, score_nodes, adj, verbose, num_cores, max_num_parents, pars_score, interv_mat, interv_data)[source]#
_summary_
- Parameters:
score_mat (_type_) – _description_
X (_type_) – _description_
score_name (_type_) – _description_
i (_type_) – _description_
j (_type_) – _description_
score_nodes (_type_) – _description_
adj (_type_) – _description_
output (_type_) – _description_
num_cores (_type_) – _description_
max_num_parents (_type_) – _description_
pars_score (_type_) – _description_
interv_mat (_type_) – _description_
interv_data (_type_) – _description_
- Returns:
_description_
- Return type:
_type_