Introduction to proteomics data analysis with anndata#

About anndata#

What is AnnData and why is it the main data structure of alphapepttools?

AnnData is a Python data for high-dimensional biological data analysis. Originally developed for single-cell genomics, it is increasingly adapted in the scientific ecosystem for the analysis of omics data. It forms the central data structure in the scverse, an open source software ecosystem for the analysis of omics data.

AnnData solves a key challenge of modern, high-throughput proteomics: managing thousands of protein measurements alongside complex sample and protein metadata. Unlike pandas DataFrames where mixing numeric and categorical data complicates matrix operations, AnnData keeps measurement and metadata aligned but separate. It can store

  • Quantitative measurements (protein abundances, intensities) for thousands of proteins

  • Sample metadata (experimental conditions, patient demographics, batch information)

  • Protein annotations (gene names, functional categories, pathway memberships)

  • and Analysis results (statistical tests, clustering assignments, quality metrics)

at a single place which facilitates handling these large amounts of data.

alphapepttools is built around anndata and has a suite of useful filtering and processing functions, making your analyses simpler, more robust and - importantly - compatible with Scanpy/Scverse downstream packages

In this tutorial, we explore how AnnData is actually organized to understand how alphapepttools enables the search engine-agnostic ingestion of proteomics data into the scverse.

Core components of anndata#

Here is a schematic of an AnnData object, which is created by the AnnData class of the anndata package [1]:

For most alphapepttools applications, you’ll primarily work with three key components:

  • X: The Numeric Expression Matrix
    A numpy array where rows represent samples and columns represent features (e.g., proteins, precursors, genes)

  • obs: Sample Metadata
    A DataFrame where rows are samples and columns contain metadata properties (e.g., age, disease state, cohort, batch)

  • var: Feature Metadata
    A DataFrame where rows are features and columns contain feature properties (e.g., for proteins: Gene names, GO terms, functional annotations)

This structure ensures that when you filter samples or features, all associated metadata automatically stays synchronized, preventing common annotation misalignment issues in analysis workflows.

[1] AnnData Documentation

Basic AnnData syntax#

We will now explore how users can create and interact with anndata and finally show how alphapepttools enables the search-engine agnostic creation of anndata objects from the raw outputs.

Import the relevant modules

import anndata as ad
import numpy as np
import pandas as pd
import alphapepttools as at

We’ll start by manually creating a small synthetic anndata object. Under the hood, anndata uses very familiar data structures such as numpy arrays and pandas Dataframes.

numerical_data = np.array([[1, 0, 0], [0, 2, 3]])

sample_metadata = pd.DataFrame(
    {
        "obs_names": ["cell1", "cell2"],
        "age": [28, 29],
    }
).set_index("obs_names")

feature_metadata = pd.DataFrame({"var_names": ["gene1", "gene2", "gene3"]}).set_index("var_names")

# Generate AnnData object
adata = ad.AnnData(
    X=numerical_data,
    obs=sample_metadata,
    var=feature_metadata,
)

# We can get a dataframe back
df = adata.to_df()
display(df)

# And also look at the sample and feature metadata
display(adata.obs)
display(adata.var)
var_names gene1 gene2 gene3
obs_names
cell1 1 0 0
cell2 0 2 3
age
obs_names
cell1 28
cell2 29
var_names
gene1
gene2
gene3

Conversion to other data structures#

To get out of AnnData and back to the more familiar world of dataframes, just run the builtin .to_df() function

df = adata.to_df()
display(df)

# Get your sample metadata added as columns
df = df.join(adata.obs)
display(df)
var_names gene1 gene2 gene3
obs_names
cell1 1 0 0
cell2 0 2 3
gene1 gene2 gene3 age
obs_names
cell1 1 0 0 28
cell2 0 2 3 29

How to load search engine data into AnnData with alphapepttools?#

The functionality relies on the alphabase backbone of PSM and PG readers, which allows for loading and parsing of common search engine output formats. In this notebook we will look at reading data for three common search engines: DIANN, AlphaDIA and Spectronaut. Search engines output data in two main ways: Either as a long table, where rows are individual precursors in their respective samples:

PSM tables#

Peptide spectrum match (PSM) tables are the primary output of proteomics search engines and typically returned in a long format. In a PSM table, each row typically represents a single peptide-spectrum-match, i.e. a peptide sequence that the proteomics search engine identified to be compatible with an observed mass spectrum in a given sample. PSM tables contain information about both 1) the peptide sequence, 2) the spectrum, as well as 3) the score assigned to the PSM by the search engine.

An example for a PSM table could be something like this

Precursor

Run

Stripped.Sequence

PEPTIDEK1

file_1.raw

PEPTIDEK

PEPTIDERK2

file_1.raw

PEPTIDERK

PEPTIDR2

file_1.raw

PEPTIDR

PEPTIDEK1

file_2.raw

PEPTIDEK

PEPTIDERK2

file_2.raw

PEPTIDERK

PEPTIDR2

file_2.raw

PEPTIDR

PEPTIDEK1

file_3.raw

PEPTIDEK

PEPTIDERK2

file_3.raw

PEPTIDERK

PEPTIDR2

file_3.raw

PEPTIDR

Protein group tables#

Protein group tables are the primary output for protein-level quantification in proteomics workflows. After search engines identify peptide spectrum matches, they aggregate peptide-level evidence to infer protein-level abundances. These protein group tables represent a structured matrix in the wide format that maps protein groups (features) to samples (observations), with estimated intensity values as entries.

Protein

file_1.raw

file_2.raw

file_3.raw

PROT1

1000

1200

1100

PROT2

2000

2200

2100

PROT3

1500

1600

1550

Typically, downstream analysis tasks are performed on a sample level. Therefore, we are usually interested in a transposed version of the wide format, where rows correspond to samples and columns to features. alphapepttools allows users to easily generate compliant anndata objects from raw search engine outputs with its I/O functionalities.

Getting the example data#

Small versions of larger real-world datasets are stored in a public repository. Users can access them via the utility function alphapepttools.data.get_data.

We can inspect the available datasets with:

at.data.available_data()
name url search_engine data_type citation description
0 bader2020_pg_alphadia https://datashare.biochem.mpg.de/s/yLpjkoQzMHpdDsB alphadia study_pg Bader JM, Geyer PE, Müller JB, Strauss MT, Koch M, Leypoldt F, Koertvelyessy P, Bittner D, Schipke CG, Incesoy EI, Peters O, Deigendesch N, Simons M, Jensen MK, Zetterberg H, Mann M. Proteome profiling in cerebrospinal fluid reveals novel biomarkers of Alzheimer's disease. Mol Syst Biol. 2020 Jun;16(6):e9356. doi: 10.15252/msb.20199356. PMID: 32485097; PMCID: PMC7266499. Study on Cerebral Spinal Fluid of Alzheimer patients by Bader et al.
1 bader2020_pg_diann https://datashare.biochem.mpg.de/s/3oZsya2L5bGnmtQ DiaNN study_pg Bader JM, Geyer PE, Müller JB, Strauss MT, Koch M, Leypoldt F, Koertvelyessy P, Bittner D, Schipke CG, Incesoy EI, Peters O, Deigendesch N, Simons M, Jensen MK, Zetterberg H, Mann M. Proteome profiling in cerebrospinal fluid reveals novel biomarkers of Alzheimer's disease. Mol Syst Biol. 2020 Jun;16(6):e9356. doi: 10.15252/msb.20199356. PMID: 32485097; PMCID: PMC7266499. Study on Cerebral Spinal Fluid of Alzheimer patients by Bader et al.
2 spectronaut_pg https://datashare.biochem.mpg.de/s/Ai9TiBTeaPHK5by spectronaut pg None An example spectronaut report
3 bader2020_psm_alphadia https://datashare.biochem.mpg.de/s/awYyxod4ksz86kk alphadia study_psm Bader JM, Geyer PE, Müller JB, Strauss MT, Koch M, Leypoldt F, Koertvelyessy P, Bittner D, Schipke CG, Incesoy EI, Peters O, Deigendesch N, Simons M, Jensen MK, Zetterberg H, Mann M. Proteome profiling in cerebrospinal fluid reveals novel biomarkers of Alzheimer's disease. Mol Syst Biol. 2020 Jun;16(6):e9356. doi: 10.15252/msb.20199356. PMID: 32485097; PMCID: PMC7266499. Study on Cerebral Spinal Fluid of Alzheimer patients by Bader et al.
4 bader2020_psm_diann https://datashare.biochem.mpg.de/s/c4Z5Yg6srKQyDym DiaNN study_psm Bader JM, Geyer PE, Müller JB, Strauss MT, Koch M, Leypoldt F, Koertvelyessy P, Bittner D, Schipke CG, Incesoy EI, Peters O, Deigendesch N, Simons M, Jensen MK, Zetterberg H, Mann M. Proteome profiling in cerebrospinal fluid reveals novel biomarkers of Alzheimer's disease. Mol Syst Biol. 2020 Jun;16(6):e9356. doi: 10.15252/msb.20199356. PMID: 32485097; PMCID: PMC7266499. Study on Cerebral Spinal Fluid of Alzheimer patients by Bader et al.
5 spectronaut_psm https://datashare.biochem.mpg.de/s/GtfJL49Rf9w78EE spectronaut study_psm None An example spectronaut report
6 bader2020_full_diann https://datashare.biochem.mpg.de/s/zNtxEQJTdqQYLd4 DiaNN psm Bader JM, Geyer PE, Müller JB, Strauss MT, Koch M, Leypoldt F, Koertvelyessy P, Bittner D, Schipke CG, Incesoy EI, Peters O, Deigendesch N, Simons M, Jensen MK, Zetterberg H, Mann M. Proteome profiling in cerebrospinal fluid reveals novel biomarkers of Alzheimer's disease. Mol Syst Biol. 2020 Jun;16(6):e9356. doi: 10.15252/msb.20199356. PMID: 32485097; PMCID: PMC7266499. Full DiaNN report.parquet for the Bader et al. Alzheimer CSF study
7 bader2020_metadata https://datashare.biochem.mpg.de/s/iSgdPnHgczbcktF none metadata Bader JM, Geyer PE, Müller JB, Strauss MT, Koch M, Leypoldt F, Koertvelyessy P, Bittner D, Schipke CG, Incesoy EI, Peters O, Deigendesch N, Simons M, Jensen MK, Zetterberg H, Mann M. Proteome profiling in cerebrospinal fluid reveals novel biomarkers of Alzheimer's disease. Mol Syst Biol. 2020 Jun;16(6):e9356. doi: 10.15252/msb.20199356. PMID: 32485097; PMCID: PMC7266499. Sample metadata for the Bader et al. Alzheimer CSF study
8 hela_pg_diann https://datashare.biochem.mpg.de/s/XDi34GJpGHBgm5x DiaNN pg None HeLa QC protein group matrix from DiaNN (report.pg_matrix.tsv)
9 hela_metadata https://datashare.biochem.mpg.de/s/aHpXSLX9RxbYPzp none metadata None HeLa QC sample metadata (simple_metadata.csv)
10 pelsa_report_diann https://datashare.biochem.mpg.de/s/6piDQGm2yAEdtKQ DiaNN psm Li, Kejia, et al. A peptide-centric local stability assay enables proteome-scale identification of the protein targets and binding regions of diverse ligands. Nature Methods 22.2 (2025): 278-282. PELSA Staurosporine study DiaNN report for differential expression analysis
11 pelsa_metadata https://datashare.biochem.mpg.de/s/NGof744gWw66Mc8 none metadata Li, Kejia, et al. A peptide-centric local stability assay enables proteome-scale identification of the protein targets and binding regions of diverse ligands. Nature Methods 22.2 (2025): 278-282. Sample metadata for the PELSA Staurosporine study
12 kinase_table https://datashare.biochem.mpg.de/s/DnNe8EFSQyqy5pb none annotation Manning, Gerard, et al. The protein kinase complement of the human genome. Science 298.5600 (2002): 1912-1934. Human kinase annotation table from kinhub.org
13 alphadia_1.8.1_psm_report https://datashare.biochem.mpg.de/s/ZBAGvQb4j2Lh0fo alphadia psm None Small example reports for testing PSM readers (AlphaDIA 1.8.1)
14 alphadia_2.0.0_psm_report https://datashare.biochem.mpg.de/s/FoTMb5Gmk9qgZtr alphadia psm None Small example reports for testing PSM readers (AlphaDIA 2.0.0, parquet)
15 diann_1.8.1_psm_report https://datashare.biochem.mpg.de/s/DQeP6FYDFYcXadd diann psm None Small example reports for testing PSM readers (DiaNN 1.8.1)
16 diann_1.9.0_psm_report https://datashare.biochem.mpg.de/s/N65ZsaaX74eesx7 diann psm None Small example reports for testing PSM readers (DiaNN 1.9.0, parquet)

Protein Group Tables#

Let’s first take a look a the protein group tables.

When you inspect the raw files you will notice that protein group table outputs vary significantly between search engines with respect to the features and quantities they contain. alphapepttools provides a single function alphapepttools.io.read_pg_table() to read any of the widely used search engine reports into the streamlined anndata format.

You only need to specify the path to the raw output and the type of search engine that generated the output.

alphadia#

data_path = at.data.get_data("bader2020_pg_alphadia")
/Users/lucas-diedrich/Documents/Projects/scverse/alphatools/programming/alphatools/docs/notebooks/bader2020_alphadia_pg does not yet exist
bader2020_alphadia_pg.zip successfully unzipped
/Users/lucas-diedrich/Documents/Projects/scverse/alphatools/programming/alphatools/docs/notebooks/bader2020_alphadia_pg successfully downloaded (0.27982044219970703 MB)
alphadia_pg_anndata = at.io.read_pg_table(
    path=data_path / "pg.matrix_top100.tsv",
    search_engine="alphadia",
)

alphadia_pg_anndata
AnnData object with n_obs × n_vars = 61 × 100
# Inspect the data
display(alphadia_pg_anndata.to_df().iloc[:5, :5])
display(alphadia_pg_anndata.obs.head())
display(alphadia_pg_anndata.var.head())
uniprot_ids A0A024R6N5;A0A0G2JRN3 A0A075B6H7 A0A075B6H9 A0A075B6I0 A0A075B6I1
20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleE01 0.000000e+00 2.098375e+07 1.603786e+09 1.234308e+09 9.981811e+07
20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleB05 5.248854e+07 1.558676e+07 1.264443e+09 1.017132e+09 1.438419e+08
20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleB11 3.063311e+08 7.633061e+06 9.859006e+08 6.482135e+08 5.423701e+07
20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA12 1.558016e+08 1.442227e+07 3.558411e+08 1.124133e+09 8.751671e+07
20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleB03 2.585067e+08 4.627953e+07 1.224894e+09 8.738715e+08 8.460091e+07
20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleE01
20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleB05
20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleB11
20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA12
20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleB03
uniprot_ids
A0A024R6N5;A0A0G2JRN3
A0A075B6H7
A0A075B6H9
A0A075B6I0
A0A075B6I1

DIANN#

Similarly you can proceed with alternative search engine outputs like DIANN’s protein group table. Note that DIANN outputs additional feature-level metadata compared to alphadia that is stored in the anndata.AnnData.var attribute.

data_path = at.data.get_data("bader2020_pg_diann")
/Users/lucas-diedrich/Documents/Projects/scverse/alphatools/programming/alphatools/docs/notebooks/bader2020_diann_pg does not yet exist
bader2020_diann_pg.zip successfully unzipped
/Users/lucas-diedrich/Documents/Projects/scverse/alphatools/programming/alphatools/docs/notebooks/bader2020_diann_pg successfully downloaded (0.08965206146240234 MB)
# Using the pg reader
diann_pg_anndata = at.io.read_pg_table(
    path=data_path / "pg.matrix_top100.tsv",
    search_engine="diann",
)

# Inspect the data
display(diann_pg_anndata.to_df().iloc[:5, :5])
display(diann_pg_anndata.obs.head())
display(diann_pg_anndata.var.head())
proteins LV469_HUMAN LV861_HUMAN LV460_HUMAN LV548_HUMAN LV746_HUMAN
/fs/gpfs41/lv12/fileset02/pool/pool-mann-projects/alphatools/alzheimers_dataset/RAWDATA/20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleE01.raw 77220500.0 103374000.0 10956700.0 NaN 125708000.0
/fs/gpfs41/lv12/fileset02/pool/pool-mann-projects/alphatools/alzheimers_dataset/RAWDATA/20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleB05.raw 39566800.0 71580200.0 12655900.0 NaN 55909900.0
/fs/gpfs41/lv12/fileset02/pool/pool-mann-projects/alphatools/alzheimers_dataset/RAWDATA/20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleB11.raw 27392100.0 42695000.0 4543130.0 NaN 89816500.0
/fs/gpfs41/lv12/fileset02/pool/pool-mann-projects/alphatools/alzheimers_dataset/RAWDATA/20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA12.raw 12829700.0 55057200.0 7905720.0 202590.0 53537600.0
/fs/gpfs41/lv12/fileset02/pool/pool-mann-projects/alphatools/alzheimers_dataset/RAWDATA/20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleB03.raw 33979400.0 66161200.0 4035670.0 NaN 112146000.0
/fs/gpfs41/lv12/fileset02/pool/pool-mann-projects/alphatools/alzheimers_dataset/RAWDATA/20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleE01.raw
/fs/gpfs41/lv12/fileset02/pool/pool-mann-projects/alphatools/alzheimers_dataset/RAWDATA/20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleB05.raw
/fs/gpfs41/lv12/fileset02/pool/pool-mann-projects/alphatools/alzheimers_dataset/RAWDATA/20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleB11.raw
/fs/gpfs41/lv12/fileset02/pool/pool-mann-projects/alphatools/alzheimers_dataset/RAWDATA/20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA12.raw
/fs/gpfs41/lv12/fileset02/pool/pool-mann-projects/alphatools/alzheimers_dataset/RAWDATA/20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleB03.raw
uniprot_ids genes description peptide_count proteotypic_peptide_count
proteins
LV469_HUMAN A0A075B6H9 IGLV4-69 NaN 2 2
LV861_HUMAN A0A075B6I0 IGLV8-61 NaN 3 3
LV460_HUMAN A0A075B6I1 IGLV4-60 NaN 2 2
LV548_HUMAN A0A075B6I7 IGLV5-48 NaN 1 1
LV746_HUMAN A0A075B6I9 IGLV7-46 NaN 3 1

Spectronaut#

# data_path = at.data.get_data("spectronaut_pg")
# sn_pg_anndata = at.io.read_pg_table(
#     path=data_path / "spectronaut_protein_table.tsv",
#     search_engine="spectronaut",


# # Inspect the data
# display(sn_pg_anndata.to_df().iloc[:5, :5])
# display(sn_pg_anndata.obs.head())
# display(sn_pg_anndata.var.head())

Protein data from precursor spectrum match tables#

You can similarly extract the protein intensity information from PSM tables with the alphapepttools.io.read_psm_table function. Note that the information from the pg tables and psm tables might differ.

When might you use PSM tables instead of PG tables?

  • You might choose PSM tables when you need custom protein quantification strategies that differ from the search engine’s default approach,

  • when you want to perform quality control at the precursor level to identify problematic peptides,

  • when you’re integrating protein data with peptide-level analyses,

  • or when protein group tables aren’t available from your search engine.

We explore again examples for PSM reports from different search engines.

AlphaDIA#

data_path = at.data.get_data("bader2020_psm_alphadia")
/Users/lucas-diedrich/Documents/Projects/scverse/alphatools/programming/alphatools/docs/notebooks/alphadia does not yet exist
alphadia.zip successfully unzipped
/Users/lucas-diedrich/Documents/Projects/scverse/alphatools/programming/alphatools/docs/notebooks/alphadia successfully downloaded (8.177705764770508 MB)
ad_psm_andata = at.io.read_psm_table(
    file_paths=data_path / "top20_precursors.tsv",
    search_engine="alphadia",
)
ad_psm_andata
AnnData object with n_obs × n_vars = 60 × 19
# Inspect the data
display(ad_psm_andata.to_df().iloc[:5, :5])
display(ad_psm_andata.obs.head())
display(ad_psm_andata.var.head())
proteins P01009;A0A024R6N5 P01011;A0A087WY93;G3V3A0;G3V595 P01024;A0A8Q3SI05;A0A8Q3SI34;A0A8Q3WM02 P02675;D6REL8 P02787
raw_name
20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA01 NaN NaN 5.122159e+09 NaN 1.383408e+11
20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA02 3.942457e+09 1.895097e+09 7.004725e+09 5.359053e+09 1.166750e+11
20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA03 3.670290e+09 1.466623e+09 1.070513e+10 4.963886e+09 1.096607e+11
20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA04 5.022584e+09 1.653896e+09 6.425221e+09 5.095784e+09 1.073282e+11
20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA05 3.314550e+09 1.044103e+09 5.690721e+09 4.237927e+09 8.503823e+10
raw_name
20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA01
20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA02
20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA03
20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA04
20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA05
proteins
P01009;A0A024R6N5
P01011;A0A087WY93;G3V3A0;G3V595
P01024;A0A8Q3SI05;A0A8Q3SI34;A0A8Q3WM02
P02675;D6REL8
P02787

DIANN#

data_path = at.data.get_data("bader2020_psm_diann")
/Users/lucas-diedrich/Documents/Projects/scverse/alphatools/programming/alphatools/docs/notebooks/diann does not yet exist
diann.zip successfully unzipped
/Users/lucas-diedrich/Documents/Projects/scverse/alphatools/programming/alphatools/docs/notebooks/diann successfully downloaded (1.0407352447509766 MB)
diann_pg_andata = at.io.read_psm_table(
    file_paths=data_path / "top20_report.parquet",
    search_engine="diann",
)

# Inspect the data
display(diann_pg_andata.to_df().iloc[:5, :5])
display(diann_pg_andata.obs.head())
display(diann_pg_andata.var.head())
proteins P01011 P02649 P02671 P02765 P02768
raw_name
20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA01 9.798148e+08 6.036072e+09 129011936.0 1.455730e+09 1.929707e+10
20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA02 1.396344e+09 7.632371e+09 204724768.0 1.515461e+09 2.287499e+10
20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA03 1.025414e+09 4.464353e+09 204250576.0 9.584196e+08 1.264076e+10
20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA04 1.272798e+09 6.688002e+09 192892832.0 1.185967e+09 1.557733e+10
20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA05 8.905130e+08 4.973349e+09 206050384.0 1.481705e+09 1.579670e+10
raw_name
20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA01
20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA02
20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA03
20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA04
20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA05
proteins
P01011
P02649
P02671
P02765
P02768

Spectronaut#

# data_path = at.data.get_data("spectronaut_psm")
# sn_pg_andata = at.io.read_psm_table(
#     file_paths=data_path / "example_dataset_mouse_sn_top20peptides.tsv",
#     search_engine="spectronaut",
#     intensity_column = "F.PeakArea",
#     feature_id_column = "FG.Id"
# )

# # Inspect the data
# display(sn_pg_andata.to_df().iloc[:5, :5])
# display(sn_pg_andata.obs.head())
# display(sn_pg_andata.var.head())

In Summary, we learned…#

  • How AnnData can help us keep our data and metadata aligned

  • How to generate AnnData objects from our input dataframes

  • How to get dataframes back from AnnData objects

  • How to load protein tables using alphapepttools readers

  • How to load and pivot PSM tables using alphapepttools readers

From here on, proteomics data can be used for all further downstream analyses. Continue with the basic workflow notebook to learn how to merge your metadata into AnnData objects and perform filtering and exploratory data analysis!