alphapepttools.tl.extract_pca_anndata

alphapepttools.tl.extract_pca_anndata#

alphapepttools.tl.extract_pca_anndata(adata, dim_space='obs', embeddings_name=None, expression_columns=None)#

Extract PCA data required for PCA plotting from an AnnData object.

Parameters:
  • adata (AnnData) – AnnData object containing PCA results.

  • dim_space (str (default: 'obs')) – Either “obs” or “var”, indicating the PCA projection space.

  • embeddings_name (str | None (default: None)) – Custom embeddings name or None to use the default naming scheme.

  • expression_columns (list[str] | None (default: None)) – List of var_names to include as additional numerical column(s) in the returned AnnData’s .obs for coloring PCA plots by expression. Note that this is only applicable when dim_space="obs", as there’s no equivalent in observations when projecting in feature space (dim_space="var").

Return type:

AnnData

Returns:

ad.AnnData An AnnData object containing the PCA results. - .X stores the PCA embeddings:

  • shape (observations x components) if dim_space="obs"

  • shape (variables x components) if dim_space="var"

  • .var contains the PCA variance information

  • .obs contains the corresponding metadata, and, if specified, additional expression values for coloring plots.

  • PCA dimensions in .var_names are named as pc_1, pc_2, etc.

Examples

Extract PCA projections after running PCA on sample space:

import anndata as ad
import pandas as pd
import numpy as np
import alphapepttools as at

# Create a 5x5 dataset where 4 proteins are core (no missing values)
X = np.array(
    [
        [10.5, 12.3, 11.8, 9.2, np.nan],  # Sample 1
        [11.2, 13.1, 12.5, 10.1, 7.5],  # Sample 2
        [9.8, 11.9, 10.2, 8.9, np.nan],  # Sample 3
        [12.1, 14.2, 13.3, 11.3, 8.2],  # Sample 4
        [10.9, 12.7, 11.5, 9.8, np.nan],  # Sample 5
    ]
)

adata = ad.AnnData(
    X=X,
    obs=pd.DataFrame({"sample": ["S1", "S2", "S3", "S4", "S5"], "condition": ["A", "B", "A", "B", "A"]}),
    var=pd.DataFrame({"protein": ["P1", "P2", "P3", "P4", "P5"], "is_core": [True, True, True, True, False]}),
)

# First run PCA on observation space (samples)
at.tl.pca(adata, meta_data_mask_column_name="is_core", n_comps=2, dim_space="obs")

# Extract PCA data for plotting/analysis
pca_adata = at.tl.extract_pca_anndata(adata, dim_space="obs")
display(pca_adata.to_df())  # DataFrame with PC1 and PC2 coordinates for each sample

# The PCA projections are now in pca_adata.X (5 samples x 2 PCs)
print(pca_adata.X.shape)  # (5, 2)
print(pca_adata.var_names.tolist())  # ['pc_1', 'pc_2']

# Access PC1 and PC2 coordinates for all samples
pc1_coords = pca_adata[:, "pc_1"].X.flatten()
pc2_coords = pca_adata[:, "pc_2"].X.flatten()

# The original metadata is preserved in pca_adata.obs
print(pca_adata.obs["condition"])  # ['A', 'B', 'A', 'B', 'A']

# Variance explained is in pca_adata.var
print(pca_adata.var["variance_ratio"])  # Proportion of variance per PC

Extract PCA projections from feature space with expression data:

# Run PCA on feature space (proteins)
at.tl.pca(adata, meta_data_mask_column_name="is_core", n_comps=2, dim_space="var")

# Extract PCA data - now proteins are the "observations"
pca_adata = at.tl.extract_pca_anndata(adata, dim_space="var")

# The PCA projections are now in pca_adata.X (5 proteins x 2 PCs)
print(pca_adata.X.shape)  # (5, 2)

# Protein metadata is in pca_adata.obs (proteins are "observations" when dim_space="var")
print(pca_adata.obs["protein"])  # ['P1', 'P2', 'P3', 'P4', 'P5']

# Note: P5 will have NaN coordinates since it wasn't included in PCA (is_core=False)

Include expression values for plotting:

# When extracting sample PCA, include protein expression for coloring
pca_adata = at.tl.extract_pca_anndata(
    adata,
    dim_space="obs",
    expression_columns=["P1", "P2"],  # Include P1 and P2 expression values
)

# Now pca_adata.obs contains the original metadata plus expression values
print(pca_adata.obs.columns)  # Contains 'sample', 'condition', 'P1', 'P2'

# This allows coloring PCA plots by protein expression
p1_expression = pca_adata.obs["P1"]  # Expression of protein P1 across samples