Tutorial: Reading PSM tables into AnnData format#
This notebook demonstrates how to use the AnnDataFactory class to convert proteomics PSM (Peptide Spectrum Matches) data into AnnData format, which is widely used in single-cell analysis pipelines.
import pandas as pd
from alphabase.psm_reader.keys import PsmDfCols
import alphapepttools as at
from alphapepttools.io.anndata_factory import AnnDataFactory
1. Creating an AnnDataFactory from a DataFrame#
First, let’s create a sample PSM DataFrame with the required columns and pass it to the AnnDataFactory constructor.
The resulting AnnData object has:
Rows (obs) representing samples (raw names)
Columns (var) representing proteins
X matrix containing intensity values
# Create sample PSM data
sample_psm_data = {
PsmDfCols.RAW_NAME: ["sample1", "sample1", "sample2", "sample2"],
PsmDfCols.PROTEINS: ["proteinA", "proteinB", "proteinA", "proteinB"],
PsmDfCols.INTENSITY: [100, 200, 150, 250],
}
psm_df = pd.DataFrame(sample_psm_data)
# Create AnnDataFactory instance
factory = AnnDataFactory(
psm_df,
intensity_column=PsmDfCols.INTENSITY,
sample_id_column=PsmDfCols.RAW_NAME,
feature_id_column=PsmDfCols.PROTEINS,
)
# Convert to AnnData
adata = factory.create_anndata()
print("AnnData shape:", adata.shape)
print("\nObservations (samples):", adata.obs_names)
print("\nVariables (proteins):", adata.var_names)
print("\nIntensity matrix:\n", adata.X)
AnnData shape: (2, 2)
Observations (samples): Index(['sample1', 'sample2'], dtype='object', name='raw_name')
Variables (proteins): Index(['proteinA', 'proteinB'], dtype='object', name='proteins')
Intensity matrix:
[[100 200]
[150 250]]
2. Loading Data from Files (AlphaDIA Example)#
The AnnDataFactory can also read data directly from PSM files. Here’s how to use it with AlphaDIA output:
# Download alphadia v1.8.1 .tsv PSM report
alphadia_path = at.data.get_data("alphadia_1.8.1_psm_report")
# Load the AlphaDIA report
factory = AnnDataFactory.from_files(file_paths=alphadia_path, reader_type="alphadia")
# Convert to AnnData
adata = factory.create_anndata()
print("AnnData shape:", adata.shape)
adata.to_df()
/Users/lucas-diedrich/Documents/Projects/scverse/alphatools/programming/alphatools/docs/notebooks/alphadia_1.8.1_report_head.tsv already exists (0.0773305892944336 MB)
AnnData shape: (1, 95)
| proteins | A6ZKI3;Q17RB0 | O00410 | O14744 | O43143 | O43592 | O43823 | O60341 | O60664 | O60716 | O60841 | ... | Q9UHB9 | Q9UHD8 | Q9UMR2 | Q9UPN9 | Q9UQE7 | Q9Y312 | Q9Y3A4 | Q9Y5K6 | Q9Y5L0 | Q9Y5X1 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| raw_name | |||||||||||||||||||||
| 20240408_OA1_Evo12_31min_TiHe_SA_H032_E32_F-40_B1 | 1.215493e+06 | 9.855868e+07 | 3.198485e+07 | 7.974717e+07 | 1.501879e+07 | 7.759782e+06 | 8.533623e+06 | 1.242373e+08 | 2.398779e+07 | 9.689934e+07 | ... | 3.666914e+07 | 2.845033e+07 | 6.295127e+06 | 2.432301e+06 | 3.500413e+07 | 2.069527e+06 | 2.532437e+06 | 3.714149e+06 | 3.025387e+06 | 9.251197e+06 |
1 rows × 95 columns
3. Customizing Column Names#
If your input files use different column names than what is preconfigured in AnnDataFactory, you can specify them:
# Load the DIANN report from the same downloaded folder
diann_file = at.data.get_data("diann_1.9.0_psm_report")
factory = AnnDataFactory.from_files(
file_paths=diann_file,
reader_type="diann",
raw_name_column="File.Name",
protein_id_column="Protein.Group",
# intensity_column="PG.MaxLFQ",
)
adata = factory.create_anndata()
print("AnnData shape:", adata.shape)
adata.to_df()
/Users/lucas-diedrich/Documents/Projects/scverse/alphatools/programming/alphatools/docs/notebooks/diann_1.9.0_report_head.parquet already exists (0.07235431671142578 MB)
AnnData shape: (6, 4)
| proteins | P36578 | Q96L58 | Q9BQG0 | Q9P258 |
|---|---|---|---|---|
| raw_name | ||||
| CPD_NE_000011_01 | 1020730.0 | 20720.0 | 540623.0 | NaN |
| CPD_NE_000011_02 | 909317.0 | 17365.7 | 336248.0 | 1687790.0 |
| CPD_NE_000011_03 | 777209.0 | 23797.3 | 424641.0 | 2119210.0 |
| CPD_NE_000011_04 | 777209.0 | 23797.3 | 424641.0 | 2119210.0 |
| CPD_NE_000011_05 | 1066430.0 | 30596.0 | 452681.0 | NaN |
| CPD_NE_000011_06 | 1102520.0 | 29851.1 | 410367.0 | 681368.0 |