alphapepttools.tl.pca

Contents

alphapepttools.tl.pca#

alphapepttools.tl.pca(adata, layer=None, dim_space='obs', embeddings_name=None, n_comps=None, meta_data_mask_column_name=None, *, copy=False, **pca_kwargs)#

Principal component analysis [].

Computes PCA coordinates, loadings and variance decomposition. The passed adata will be changed as a result to include the pca calculations. depending on the dim_space parameter, the PCA result is dimensionality reduction projection of samples (obs) or of features (var). After PCA, the updated adata object will include adata.obsm layer for the PCA coordinates,`adata.varm` layer (for PCA feature loadings), and adata.uns layer (for PCA variance decomposition) for PCA done on the feature space. For PCA done on the sample space, the PCA coordinates will be stored in adata.varm, the PCA loadings in adata.obsm, and the variance decomposition in adata.uns. Uses the implementation of Scanpy, which in turn uses implementation of scikit-learn [].

Parameters:
  • adata (AnnData) – The (annotated) data matrix of shape n_obs X n_vars. Rows correspond to samples and columns to features.

  • layer (str | None (default: None)) – If provided, which element of layers to use for PCA. If None, the .X attribute of adata is used.

  • dim_space (str (default: 'obs')) – The dimension to project PCA on. Can be either “obs” (default) for sample projection or “var” for feature projection.

  • embeddings_name (str | None (default: None)) – If provided, this will be used as the key under which to store the PCA results in adata.obsm, adata.varm, and adata.uns (see Returns). If None, the default keys will be used: - For dim_space='obs': X_pca_obs for PC coordinates, PCs_obs for the feature loadings, variance_pca_obs for the variance. - For dim_space='var': X_pca_var for PC corrdinates, PCs_var for the sample loadings, variance_pca_var for the variance. If provided, the keys will be embeddings_name for all three data frames.

  • n_comps (int | None (default: None)) – Number of principal components to compute. Defaults to 50, or 1 - minimum dimension size of selected representation.

  • meta_data_mask_column_name (str | None (default: None)) – If provided, the colname in adata.var to use as a mask for the features to be used in PCA. This is useful for running PCA with the core proteome as “mask_var” to remove nan values. Must be of boolean dtype. If None, all features are used (data should not include NaNs!).

  • copy (bool (default: False)) – If False (default), modifies adata inplace and returns None. If True, returns a copy of the adata object.

  • **pca_kwargs (dict | None) – Additional keyword arguments for the scanpy.pp.pca() By default None.

Return type:

None | AnnData

Returns:

If copy=True and an updated adata object, else changes anndata object inplace.

Sets the following fields: for dim_space='obs' (sample projection): .obsm['X_pca_obs' | embeddings_name] : csr_matrix | csc_matrix | ndarray (shape (adata.n_obs, n_comps))

PCA representation of data.

.varm['PCs_obs' | embeddings_name]ndarray (shape (adata.n_vars, n_comps))

The principal components containing the loadings.

.uns['variance_pca_obs' | embeddings_name]['variance_ratio']ndarray (shape (n_comps,))

Ratio of explained variance.

.uns['variance_pca_obs' | embeddings_name]['variance']ndarray (shape (n_comps,))

Explained variance, equivalent to the eigenvalues of the covariance matrix.

for dim_space='var' (sample projection): .varm['X_pca_var' | embeddings_name] : csr_matrix | csc_matrix | ndarray (shape (adata.n_obs, n_comps))

PCA representation of data.

.obsm['PCs_var' | embeddings_name]ndarray (shape (adata.n_vars, n_comps))

The principal components containing the loadings.

.uns['variance_pca_var' | embeddings_name]['variance_ratio']ndarray (shape (n_comps,))

Ratio of explained variance.

.uns['variance_pca_var' | embeddings_name]['variance']ndarray (shape (n_comps,))

Explained variance, equivalent to the eigenvalues of the covariance matrix.

Examples

Run PCA using a metadata mask to select core proteins:

import anndata as ad
import pandas as pd
import numpy as np
import alphapepttools as at

# Create a 5x5 dataset where 4 proteins are core (no missing values)
X = np.array(
    [
        [10.5, 12.3, 11.8, 9.2, np.nan],  # Sample 1
        [11.2, 13.1, 12.5, 10.1, 7.5],  # Sample 2
        [9.8, 11.9, 10.2, 8.9, np.nan],  # Sample 3
        [12.1, 14.2, 13.3, 11.3, 8.2],  # Sample 4
        [10.9, 12.7, 11.5, 9.8, np.nan],  # Sample 5
    ]
)

adata = ad.AnnData(
    X=X,
    obs=pd.DataFrame({"sample": ["S1", "S2", "S3", "S4", "S5"]}),
    var=pd.DataFrame(
        {
            "protein": ["P1", "P2", "P3", "P4", "P5"],
            "is_core": [True, True, True, True, False],  # First 4 are core proteins
        }
    ),
)

# Run PCA on feature space using only core proteins
at.tl.pca(adata, meta_data_mask_column_name="is_core", n_comps=2, dim_space="var")

# The PCA results are now stored in the AnnData object:
# adata.varm['X_pca_var'] - PCA coordinates for each protein (5 x 2)
# adata.obsm['PCs_var'] - Sample loadings (5 x 2)
# adata.uns['variance_pca_var'] - Variance explained by each PC

# To get the PCA embedding of proteins in the reduced space:
protein_pca_coords = adata.varm["X_pca_var"]
# First 4 proteins have coordinates, P5 has NaN (not used in PCA)

# To project samples into the PC space:
sample_loadings = adata.obsm["PCs_var"]

# To see variance explained by each component:
variance_ratio = adata.uns["variance_pca_var"]["variance_ratio"]