Tutorial: Reading PSM tables into AnnData format#

This notebook demonstrates how to use the AnnDataFactory class to convert proteomics PSM (Peptide Spectrum Matches) data into AnnData format, which is widely used in single-cell analysis pipelines.

import pandas as pd
from alphabase.psm_reader.keys import PsmDfCols

import alphapepttools as at
from alphapepttools.io.anndata_factory import AnnDataFactory

1. Creating an AnnDataFactory from a DataFrame#

First, let’s create a sample PSM DataFrame with the required columns and pass it to the AnnDataFactory constructor.

The resulting AnnData object has:

  • Rows (obs) representing samples (raw names)

  • Columns (var) representing proteins

  • X matrix containing intensity values

# Create sample PSM data
sample_psm_data = {
    PsmDfCols.RAW_NAME: ["sample1", "sample1", "sample2", "sample2"],
    PsmDfCols.PROTEINS: ["proteinA", "proteinB", "proteinA", "proteinB"],
    PsmDfCols.INTENSITY: [100, 200, 150, 250],
}

psm_df = pd.DataFrame(sample_psm_data)

# Create AnnDataFactory instance
factory = AnnDataFactory(
    psm_df,
    intensity_column=PsmDfCols.INTENSITY,
    sample_id_column=PsmDfCols.RAW_NAME,
    feature_id_column=PsmDfCols.PROTEINS,
)

# Convert to AnnData
adata = factory.create_anndata()

print("AnnData shape:", adata.shape)
print("\nObservations (samples):", adata.obs_names)
print("\nVariables (proteins):", adata.var_names)
print("\nIntensity matrix:\n", adata.X)
AnnData shape: (2, 2)

Observations (samples): Index(['sample1', 'sample2'], dtype='object', name='raw_name')

Variables (proteins): Index(['proteinA', 'proteinB'], dtype='object', name='proteins')

Intensity matrix:
 [[100 200]
 [150 250]]

2. Loading Data from Files (AlphaDIA Example)#

The AnnDataFactory can also read data directly from PSM files. Here’s how to use it with AlphaDIA output:

# Download alphadia v1.8.1 .tsv PSM report
alphadia_path = at.data.get_data("alphadia_1.8.1_psm_report")

# Load the AlphaDIA report
factory = AnnDataFactory.from_files(file_paths=alphadia_path, reader_type="alphadia")

# Convert to AnnData
adata = factory.create_anndata()

print("AnnData shape:", adata.shape)

adata.to_df()
/Users/lucas-diedrich/Documents/Projects/scverse/alphatools/programming/alphatools/docs/notebooks/alphadia_1.8.1_report_head.tsv already exists (0.0773305892944336 MB)
AnnData shape: (1, 95)
proteins A6ZKI3;Q17RB0 O00410 O14744 O43143 O43592 O43823 O60341 O60664 O60716 O60841 ... Q9UHB9 Q9UHD8 Q9UMR2 Q9UPN9 Q9UQE7 Q9Y312 Q9Y3A4 Q9Y5K6 Q9Y5L0 Q9Y5X1
raw_name
20240408_OA1_Evo12_31min_TiHe_SA_H032_E32_F-40_B1 1.215493e+06 9.855868e+07 3.198485e+07 7.974717e+07 1.501879e+07 7.759782e+06 8.533623e+06 1.242373e+08 2.398779e+07 9.689934e+07 ... 3.666914e+07 2.845033e+07 6.295127e+06 2.432301e+06 3.500413e+07 2.069527e+06 2.532437e+06 3.714149e+06 3.025387e+06 9.251197e+06

1 rows × 95 columns

3. Customizing Column Names#

If your input files use different column names than what is preconfigured in AnnDataFactory, you can specify them:

# Load the DIANN report from the same downloaded folder
diann_file = at.data.get_data("diann_1.9.0_psm_report")

factory = AnnDataFactory.from_files(
    file_paths=diann_file,
    reader_type="diann",
    raw_name_column="File.Name",
    protein_id_column="Protein.Group",
    # intensity_column="PG.MaxLFQ",
)

adata = factory.create_anndata()

print("AnnData shape:", adata.shape)

adata.to_df()
/Users/lucas-diedrich/Documents/Projects/scverse/alphatools/programming/alphatools/docs/notebooks/diann_1.9.0_report_head.parquet already exists (0.07235431671142578 MB)
AnnData shape: (6, 4)
proteins P36578 Q96L58 Q9BQG0 Q9P258
raw_name
CPD_NE_000011_01 1020730.0 20720.0 540623.0 NaN
CPD_NE_000011_02 909317.0 17365.7 336248.0 1687790.0
CPD_NE_000011_03 777209.0 23797.3 424641.0 2119210.0
CPD_NE_000011_04 777209.0 23797.3 424641.0 2119210.0
CPD_NE_000011_05 1066430.0 30596.0 452681.0 NaN
CPD_NE_000011_06 1102520.0 29851.1 410367.0 681368.0