alphapepttools.tl.pca#
- alphapepttools.tl.pca(adata, layer=None, dim_space='obs', embeddings_name=None, n_comps=None, meta_data_mask_column_name=None, *, copy=False, **pca_kwargs)#
Principal component analysis [].
Computes PCA coordinates, loadings and variance decomposition. The passed adata will be changed as a result to include the pca calculations. depending on the
dim_spaceparameter, the PCA result is dimensionality reduction projection of samples (obs) or of features (var). After PCA, the updated adata object will includeadata.obsmlayer for the PCA coordinates,`adata.varm` layer (for PCA feature loadings), andadata.unslayer (for PCA variance decomposition) for PCA done on the feature space. For PCA done on the sample space, the PCA coordinates will be stored inadata.varm, the PCA loadings inadata.obsm, and the variance decomposition inadata.uns. Uses the implementation of Scanpy, which in turn uses implementation of scikit-learn [].- Parameters:
adata (
AnnData) – The (annotated) data matrix of shapen_obsXn_vars. Rows correspond to samples and columns to features.layer (
str|None(default:None)) – If provided, which element of layers to use for PCA. If None, the.Xattribute ofadatais used.dim_space (
str(default:'obs')) – The dimension to project PCA on. Can be either “obs” (default) for sample projection or “var” for feature projection.embeddings_name (
str|None(default:None)) – If provided, this will be used as the key under which to store the PCA results inadata.obsm,adata.varm, andadata.uns(see Returns). If None, the default keys will be used: - Fordim_space='obs':X_pca_obsfor PC coordinates,PCs_obsfor the feature loadings,variance_pca_obsfor the variance. - Fordim_space='var':X_pca_varfor PC corrdinates,PCs_varfor the sample loadings,variance_pca_varfor the variance. If provided, the keys will beembeddings_namefor all three data frames.n_comps (
int|None(default:None)) – Number of principal components to compute. Defaults to 50, or 1 - minimum dimension size of selected representation.meta_data_mask_column_name (
str|None(default:None)) – If provided, the colname inadata.varto use as a mask for the features to be used in PCA. This is useful for running PCA with the core proteome as “mask_var” to remove nan values. Must be of boolean dtype. If None, all features are used (data should not include NaNs!).copy (
bool(default:False)) – IfFalse(default), modifiesadatainplace and returnsNone. IfTrue, returns a copy of theadataobject.**pca_kwargs (
dict|None) – Additional keyword arguments for thescanpy.pp.pca()By default None.
- Return type:
- Returns:
If
copy=Trueand an updatedadataobject, else changes anndata object inplace.Sets the following fields: for
dim_space='obs'(sample projection):.obsm['X_pca_obs' | embeddings_name]:csr_matrix|csc_matrix|ndarray(shape(adata.n_obs, n_comps))PCA representation of data.
.varm['PCs_obs' | embeddings_name]ndarray(shape(adata.n_vars, n_comps))The principal components containing the loadings.
.uns['variance_pca_obs' | embeddings_name]['variance_ratio']ndarray(shape(n_comps,))Ratio of explained variance.
.uns['variance_pca_obs' | embeddings_name]['variance']ndarray(shape(n_comps,))Explained variance, equivalent to the eigenvalues of the covariance matrix.
for
dim_space='var'(sample projection):.varm['X_pca_var' | embeddings_name]:csr_matrix|csc_matrix|ndarray(shape(adata.n_obs, n_comps))PCA representation of data.
.obsm['PCs_var' | embeddings_name]ndarray(shape(adata.n_vars, n_comps))The principal components containing the loadings.
.uns['variance_pca_var' | embeddings_name]['variance_ratio']ndarray(shape(n_comps,))Ratio of explained variance.
.uns['variance_pca_var' | embeddings_name]['variance']ndarray(shape(n_comps,))Explained variance, equivalent to the eigenvalues of the covariance matrix.
Examples
Run PCA using a metadata mask to select core proteins:
import anndata as ad import pandas as pd import numpy as np import alphapepttools as at # Create a 5x5 dataset where 4 proteins are core (no missing values) X = np.array( [ [10.5, 12.3, 11.8, 9.2, np.nan], # Sample 1 [11.2, 13.1, 12.5, 10.1, 7.5], # Sample 2 [9.8, 11.9, 10.2, 8.9, np.nan], # Sample 3 [12.1, 14.2, 13.3, 11.3, 8.2], # Sample 4 [10.9, 12.7, 11.5, 9.8, np.nan], # Sample 5 ] ) adata = ad.AnnData( X=X, obs=pd.DataFrame({"sample": ["S1", "S2", "S3", "S4", "S5"]}), var=pd.DataFrame( { "protein": ["P1", "P2", "P3", "P4", "P5"], "is_core": [True, True, True, True, False], # First 4 are core proteins } ), ) # Run PCA on feature space using only core proteins at.tl.pca(adata, meta_data_mask_column_name="is_core", n_comps=2, dim_space="var") # The PCA results are now stored in the AnnData object: # adata.varm['X_pca_var'] - PCA coordinates for each protein (5 x 2) # adata.obsm['PCs_var'] - Sample loadings (5 x 2) # adata.uns['variance_pca_var'] - Variance explained by each PC # To get the PCA embedding of proteins in the reduced space: protein_pca_coords = adata.varm["X_pca_var"] # First 4 proteins have coordinates, P5 has NaN (not used in PCA) # To project samples into the PC space: sample_loadings = adata.obsm["PCs_var"] # To see variance explained by each component: variance_ratio = adata.uns["variance_pca_var"]["variance_ratio"]