Tutorial: Reading PSM tables into AnnData format#

This notebook demonstrates how to use the AnnDataFactory class to convert proteomics PSM (Peptide Spectrum Matches) data into AnnData format, which is widely used in single-cell analysis pipelines.

import tempfile
import pandas as pd
from alphabase.psm_reader.keys import PsmDfCols

import alphapepttools as at
from alphapepttools.io.anndata_factory import AnnDataFactory

1. Creating an AnnDataFactory from a DataFrame#

First, let’s create a sample PSM DataFrame with the required columns and pass it to the AnnDataFactory constructor.

The resulting AnnData object has:

Rows (obs) representing samples (raw names)
Columns (var) representing proteins
X matrix containing intensity values

# Create sample PSM data
sample_psm_data = {
    PsmDfCols.RAW_NAME: ["sample1", "sample1", "sample2", "sample2"],
    PsmDfCols.PROTEINS: ["proteinA", "proteinB", "proteinA", "proteinB"],
    PsmDfCols.INTENSITY: [100, 200, 150, 250],
}

psm_df = pd.DataFrame(sample_psm_data)

# Create AnnDataFactory instance
factory = AnnDataFactory(
    psm_df,
    intensity_column=PsmDfCols.INTENSITY,
    sample_id_column=PsmDfCols.RAW_NAME,
    feature_id_column=PsmDfCols.PROTEINS,
)

# Convert to AnnData
adata = factory.create_anndata()

print("AnnData shape:", adata.shape)
print("\nObservations (samples):", adata.obs_names)
print("\nVariables (proteins):", adata.var_names)
print("\nIntensity matrix:\n", adata.X)

AnnData shape: (2, 2)

Observations (samples): Index(['sample1', 'sample2'], dtype='object', name='raw_name')

Variables (proteins): Index(['proteinA', 'proteinB'], dtype='object', name='proteins')

Intensity matrix:
 [[100 200]
 [150 250]]

2. Loading Data from Files (AlphaDIA Example)#

The AnnDataFactory can also read data directly from PSM files. Here’s how to use it with AlphaDIA output:

# Download alphadia v1.8.1 .tsv PSM report
output_dir = "./datasets/data_for_A01_load_psm_tables"
alphadia_path = at.data.get_data(
    "alphadia_1.8.1_psm_report", output_dir=output_dir if output_dir else tempfile.mkdtemp()
)

# Load the AlphaDIA report
factory = AnnDataFactory.from_files(file_paths=alphadia_path, reader_type="alphadia")

# Convert to AnnData
adata = factory.create_anndata()

print("AnnData shape:", adata.shape)

adata.to_df()

./datasets/data_for_A01_load_psm_tables/alphadia_1.8.1_report_head.tsv already exists (0.0773305892944336 MB)
AnnData shape: (1, 95)

proteins	A6ZKI3;Q17RB0	O00410	O14744	O43143	O43592	O43823	O60341	O60664	O60716	O60841	...	Q9UHB9	Q9UHD8	Q9UMR2	Q9UPN9	Q9UQE7	Q9Y312	Q9Y3A4	Q9Y5K6	Q9Y5L0	Q9Y5X1
raw_name
20240408_OA1_Evo12_31min_TiHe_SA_H032_E32_F-40_B1	1.215493e+06	9.855868e+07	3.198485e+07	7.974717e+07	1.501879e+07	7.759782e+06	8.533623e+06	1.242373e+08	2.398779e+07	9.689934e+07	...	3.666914e+07	2.845033e+07	6.295127e+06	2.432301e+06	3.500413e+07	2.069527e+06	2.532437e+06	3.714149e+06	3.025387e+06	9.251197e+06

1 rows × 95 columns

3. Customizing Column Names#

If your input files use different column names than what is preconfigured in AnnDataFactory, you can specify them:

# Load the DIANN report from the same downloaded folder
output_dir = "./datasets/data_for_A01_load_psm_tables"
diann_file = at.data.get_data("diann_1.9.0_psm_report", output_dir=output_dir if output_dir else tempfile.mkdtemp())

factory = AnnDataFactory.from_files(
    file_paths=diann_file,
    reader_type="diann",
    raw_name_column="File.Name",
    protein_id_column="Protein.Group",
    # intensity_column="PG.MaxLFQ",
)

adata = factory.create_anndata()

print("AnnData shape:", adata.shape)

adata.to_df()

./datasets/data_for_A01_load_psm_tables/diann_1.9.0_report_head.parquet already exists (0.07235431671142578 MB)
AnnData shape: (6, 4)

proteins	P36578	Q96L58	Q9BQG0	Q9P258
raw_name
CPD_NE_000011_01	1020730.0	20720.0	540623.0	NaN
CPD_NE_000011_02	909317.0	17365.7	336248.0	1687790.0
CPD_NE_000011_03	777209.0	23797.3	424641.0	2119210.0
CPD_NE_000011_04	777209.0	23797.3	424641.0	2119210.0
CPD_NE_000011_05	1066430.0	30596.0	452681.0	NaN
CPD_NE_000011_06	1102520.0	29851.1	410367.0	681368.0

Tutorial: Reading PSM tables into AnnData format

Contents

Tutorial: Reading PSM tables into AnnData format#

1. Creating an AnnDataFactory from a DataFrame#

2. Loading Data from Files (AlphaDIA Example)#

3. Customizing Column Names#