alphapepttools.tl.bpca#
- alphapepttools.tl.bpca(adata, layer=None, dim_space='obs', embeddings_name=None, n_comps=50, meta_data_mask_column_name=None, **bpca_kwargs)#
Bayesian Principal Component Analysis
Bayesian implementation of PCA that explicitly supports missing values. Computes latent space coordinates, loadings and variance decomposition.
The dimensionality-reduced representation can be computed either for samples (
dim_space="obs") or for features (dim_space="var"). Depending on the chosendim_space, the BPCA results are stored in different AnnData containers.For BPCA computed in feature space (
dim_space='var') The low-dimensional coordinates are stored inadata.obsm, the feature loadings inadata.varm, and the variance decomposition inadata.uns.For BPCA computed in sample space (
dim_space='obs') The coordinates are stored inadata.varm, the loadings inadata.obsm, and the variance decomposition inadata.uns.
- Parameters:
adata (
AnnData) – The (annotated) data matrix of shapen_obsXn_vars. Rows correspond to samples and columns to features.layer (
Optional[str] (default:None)) – If provided, which element of layers to use for PCA. If None, the.Xattribute ofadatais used.dim_space (
str(default:'obs')) – The dimension to project PCA on. Can be either “obs” (default) for sample projection or “var” for feature projection.embeddings_name (
Optional[str] (default:None)) – If provided, this will be used as the key under which to store the PCA results inadata.obsm,adata.varm, andadata.uns(see Returns). If None, the default key"BPCA"is used for storing results inadata.obsm,adata.varm, andadata.uns.n_comps (
int(default:50)) – Number of principal components to compute. Defaults tomin(50, n_obs, n_var)meta_data_mask_column_name (
Optional[str] (default:None)) – If provided, the colname inadata.varto use as a mask for the features to be used in PCA. This is useful for running PCA with the core proteome as “mask_var” to remove nan values. Must be of boolean dtype.**bpca_kwargs – Additional keyword arguments to
bpca.BPCA. By default None.
- Return type:
- Returns:
Sets the following fields: for
dim_space='obs'(sample projection):.obsm['BPCA' | embeddings_name]:ndarray(shape(adata.n_obs, n_comps))BPCA representation of data.
.varm['BPCA' | embeddings_name]ndarray(shape(adata.n_vars, n_comps))The principal components containing the loadings.
.uns['BPCA' | embeddings_name]['variance_ratio']ndarray(shape(n_comps,))Ratio of explained variance.
for
dim_space='var'(feature projection):.varm['BPCA' | embeddings_name]:ndarray(shape(adata.n_obs, n_comps))BPCA representation of data.
Notes
For complete data, BPCA converges to the standard PCA solution, but typically requires substantially more computation due to iterative model fitting.
BPCA assumes additive, homoscedastic Gaussian noise in the observed data. After appropriate normalization and log-transformation, this assumption is often a reasonable approximation for quantitative proteomics data, but may still be violated for features with extreme missingness or low signal intensity. In practice, filtering features with very high missingness prior to BPCA can improve numerical stability and interpretability.
Example
As the BPCA method supports missing values, you can directly run the dimensionality reduction on the log-transformed dataset.
path = at.data.get_data("bader2020_full_diann") adata = at.io.read_psm_table(path, search_engine="diann") # Optional: Remove features with little data support at.pp.filter_data_completeness( adata=adata, max_missing=0.25, action="drop", ) # BPCA expects a normal noise model, thus the data should be log-transformed at.pp.nanlog(adata) at.tl.bpca(adata)
References
See also