alphapepttools.tl.bpca#

alphapepttools.tl.bpca(adata, layer=None, dim_space='obs', embeddings_name=None, n_comps=50, meta_data_mask_column_name=None, **bpca_kwargs)#

Bayesian Principal Component Analysis

Bayesian implementation of PCA that explicitly supports missing values. Computes latent space coordinates, loadings and variance decomposition.

The dimensionality-reduced representation can be computed either for samples (dim_space="obs") or for features (dim_space="var"). Depending on the chosen dim_space, the BPCA results are stored in different AnnData containers.

For BPCA computed in feature space (dim_space='var') The low-dimensional coordinates are stored in adata.obsm, the feature loadings in adata.varm, and the variance decomposition in adata.uns.
For BPCA computed in sample space (dim_space='obs') The coordinates are stored in adata.varm, the loadings in adata.obsm, and the variance decomposition in adata.uns.

Parameters:

adata (AnnData) – The (annotated) data matrix of shape n_obs X n_vars. Rows correspond to samples and columns to features.
layer (str | None (default: None)) – If provided, which element of layers to use for PCA. If None, the .X attribute of adata is used.
dim_space (str (default: 'obs')) – The dimension to project PCA on. Can be either “obs” (default) for sample projection or “var” for feature projection.
embeddings_name (str | None (default: None)) – If provided, this will be used as the key under which to store the PCA results in adata.obsm, adata.varm, and adata.uns (see Returns). If None, the default key "BPCA" is used for storing results in adata.obsm, adata.varm, and adata.uns.
n_comps (int (default: 50)) – Number of principal components to compute. Defaults to min(50, n_obs, n_var)
meta_data_mask_column_name (str | None (default: None)) – If provided, the colname in adata.var to use as a mask for the features to be used in PCA. This is useful for running PCA with the core proteome as “mask_var” to remove nan values. Must be of boolean dtype.
**bpca_kwargs – Additional keyword arguments to bpca.BPCA. By default None.

Return type:

AnnData

Returns:

Sets the following fields: for dim_space='obs' (sample projection): .obsm['BPCA' | embeddings_name] : ndarray (shape (adata.n_obs, n_comps))

BPCA representation of data.

.varm['BPCA' | embeddings_name]ndarray (shape (adata.n_vars, n_comps)): The principal components containing the loadings.
.uns['BPCA' | embeddings_name]['variance_ratio']ndarray (shape (n_comps,)): Ratio of explained variance.

for dim_space='var' (feature projection): .varm['BPCA' | embeddings_name] : ndarray (shape (adata.n_obs, n_comps))

BPCA representation of data.

.obsm['BPCA' | embeddings_name]ndarray (shape (adata.n_vars, n_comps)): The principal components containing the loadings.
.uns['BPCA' | embeddings_name]['variance_ratio']ndarray (shape (n_comps,)): Ratio of explained variance.

Notes

For complete data, BPCA converges to the standard PCA solution, but typically requires substantially more computation due to iterative model fitting.

BPCA assumes additive, homoscedastic Gaussian noise in the observed data. After appropriate normalization and log-transformation, this assumption is often a reasonable approximation for quantitative proteomics data, but may still be violated for features with extreme missingness or low signal intensity. In practice, filtering features with very high missingness prior to BPCA can improve numerical stability and interpretability.

Example

As the BPCA method supports missing values, you can directly run the dimensionality reduction on the log-transformed dataset.

path = at.data.get_data("bader2020_full_diann")
adata = at.io.read_psm_table(path, search_engine="diann")

# Optional: Remove features with little data support
at.pp.filter_data_completeness(
    adata=adata,
    max_missing=0.25,
    action="drop",
)

# BPCA expects a normal noise model, thus the data should be log-transformed
at.pp.nanlog(adata)

at.tl.bpca(adata)

References

[OST+03]
[Bis98]

alphapepttools.tl.bpca

Contents

alphapepttools.tl.bpca#