alphapepttools.tl.extract_pca_anndata#
- alphapepttools.tl.extract_pca_anndata(adata, dim_space='obs', embeddings_name=None, expression_columns=None)#
Extract PCA data required for PCA plotting from an AnnData object.
- Parameters:
adata (
AnnData) – AnnData object containing PCA results.dim_space (
str(default:'obs')) – Either “obs” or “var”, indicating the PCA projection space.embeddings_name (
str|None(default:None)) – Custom embeddings name or None to use the default naming scheme.expression_columns (
list[str] |None(default:None)) – List ofvar_namesto include as additional numerical column(s) in the returned AnnData’s.obsfor coloring PCA plots by expression. Note that this is only applicable whendim_space="obs", as there’s no equivalent in observations when projecting in feature space (dim_space="var").
- Return type:
- Returns:
ad.AnnData An AnnData object containing the PCA results. -
.Xstores the PCA embeddings:shape (observations x components) if
dim_space="obs"shape (variables x components) if
dim_space="var"
.varcontains the PCA variance information.obscontains the corresponding metadata, and, if specified, additional expression values for coloring plots.PCA dimensions in
.var_namesare named aspc_1,pc_2, etc.
Examples
Extract PCA projections after running PCA on sample space:
import anndata as ad import pandas as pd import numpy as np import alphapepttools as at # Create a 5x5 dataset where 4 proteins are core (no missing values) X = np.array( [ [10.5, 12.3, 11.8, 9.2, np.nan], # Sample 1 [11.2, 13.1, 12.5, 10.1, 7.5], # Sample 2 [9.8, 11.9, 10.2, 8.9, np.nan], # Sample 3 [12.1, 14.2, 13.3, 11.3, 8.2], # Sample 4 [10.9, 12.7, 11.5, 9.8, np.nan], # Sample 5 ] ) adata = ad.AnnData( X=X, obs=pd.DataFrame({"sample": ["S1", "S2", "S3", "S4", "S5"], "condition": ["A", "B", "A", "B", "A"]}), var=pd.DataFrame({"protein": ["P1", "P2", "P3", "P4", "P5"], "is_core": [True, True, True, True, False]}), ) # First run PCA on observation space (samples) at.tl.pca(adata, meta_data_mask_column_name="is_core", n_comps=2, dim_space="obs") # Extract PCA data for plotting/analysis pca_adata = at.tl.extract_pca_anndata(adata, dim_space="obs") display(pca_adata.to_df()) # DataFrame with PC1 and PC2 coordinates for each sample # The PCA projections are now in pca_adata.X (5 samples x 2 PCs) print(pca_adata.X.shape) # (5, 2) print(pca_adata.var_names.tolist()) # ['pc_1', 'pc_2'] # Access PC1 and PC2 coordinates for all samples pc1_coords = pca_adata[:, "pc_1"].X.flatten() pc2_coords = pca_adata[:, "pc_2"].X.flatten() # The original metadata is preserved in pca_adata.obs print(pca_adata.obs["condition"]) # ['A', 'B', 'A', 'B', 'A'] # Variance explained is in pca_adata.var print(pca_adata.var["variance_ratio"]) # Proportion of variance per PC
Extract PCA projections from feature space with expression data:
# Run PCA on feature space (proteins) at.tl.pca(adata, meta_data_mask_column_name="is_core", n_comps=2, dim_space="var") # Extract PCA data - now proteins are the "observations" pca_adata = at.tl.extract_pca_anndata(adata, dim_space="var") # The PCA projections are now in pca_adata.X (5 proteins x 2 PCs) print(pca_adata.X.shape) # (5, 2) # Protein metadata is in pca_adata.obs (proteins are "observations" when dim_space="var") print(pca_adata.obs["protein"]) # ['P1', 'P2', 'P3', 'P4', 'P5'] # Note: P5 will have NaN coordinates since it wasn't included in PCA (is_core=False)
Include expression values for plotting:
# When extracting sample PCA, include protein expression for coloring pca_adata = at.tl.extract_pca_anndata( adata, dim_space="obs", expression_columns=["P1", "P2"], # Include P1 and P2 expression values ) # Now pca_adata.obs contains the original metadata plus expression values print(pca_adata.obs.columns) # Contains 'sample', 'condition', 'P1', 'P2' # This allows coloring PCA plots by protein expression p1_expression = pca_adata.obs["P1"] # Expression of protein P1 across samples