Introduction to proteomics data analysis with anndata#
About anndata#
What is AnnData and why is it the main data structure of alphapepttools?
AnnData is a Python data for high-dimensional biological data analysis. Originally developed for single-cell genomics, it is increasingly adapted in the scientific ecosystem for the analysis of omics data. It forms the central data structure in the scverse, an open source software ecosystem for the analysis of omics data.
AnnData solves a key challenge of modern, high-throughput proteomics: managing thousands of protein measurements alongside complex sample and protein metadata. Unlike pandas DataFrames where mixing numeric and categorical data complicates matrix operations, AnnData keeps measurement and metadata aligned but separate. It can store
Quantitative measurements (protein abundances, intensities) for thousands of proteins
Sample metadata (experimental conditions, patient demographics, batch information)
Protein annotations (gene names, functional categories, pathway memberships)
and Analysis results (statistical tests, clustering assignments, quality metrics)
at a single place which facilitates handling these large amounts of data.
alphapepttools is built around anndata and has a suite of useful filtering and processing functions, making your analyses simpler, more robust and - importantly - compatible with Scanpy/Scverse downstream packages
In this tutorial, we explore how AnnData is actually organized to understand how alphapepttools enables the search engine-agnostic ingestion of proteomics data into the scverse.
Core components of anndata#
Here is a schematic of an AnnData object, which is created by the AnnData class of the anndata package [1]:
For most alphapepttools applications, you’ll primarily work with three key components:
X: The Numeric Expression Matrix
A numpy array where rows represent samples and columns represent features (e.g., proteins, precursors, genes)obs: Sample Metadata
A DataFrame where rows are samples and columns contain metadata properties (e.g., age, disease state, cohort, batch)var: Feature Metadata
A DataFrame where rows are features and columns contain feature properties (e.g., for proteins: Gene names, GO terms, functional annotations)
This structure ensures that when you filter samples or features, all associated metadata automatically stays synchronized, preventing common annotation misalignment issues in analysis workflows.
Basic AnnData syntax#
We will now explore how users can create and interact with anndata and finally show how alphapepttools enables the search-engine agnostic creation of anndata objects from the raw outputs.
Import the relevant modules
import anndata as ad
import numpy as np
import pandas as pd
import alphapepttools as at
We’ll start by manually creating a small synthetic anndata object. Under the hood, anndata uses very familiar data structures such as numpy arrays and pandas Dataframes.
numerical_data = np.array([[1, 0, 0], [0, 2, 3]])
sample_metadata = pd.DataFrame(
{
"obs_names": ["cell1", "cell2"],
"age": [28, 29],
}
).set_index("obs_names")
feature_metadata = pd.DataFrame({"var_names": ["gene1", "gene2", "gene3"]}).set_index("var_names")
# Generate AnnData object
adata = ad.AnnData(
X=numerical_data,
obs=sample_metadata,
var=feature_metadata,
)
# We can get a dataframe back
df = adata.to_df()
display(df)
# And also look at the sample and feature metadata
display(adata.obs)
display(adata.var)
| var_names | gene1 | gene2 | gene3 |
|---|---|---|---|
| obs_names | |||
| cell1 | 1 | 0 | 0 |
| cell2 | 0 | 2 | 3 |
| age | |
|---|---|
| obs_names | |
| cell1 | 28 |
| cell2 | 29 |
| var_names |
|---|
| gene1 |
| gene2 |
| gene3 |
Conversion to other data structures#
To get out of AnnData and back to the more familiar world of dataframes, just run the builtin .to_df() function
df = adata.to_df()
display(df)
# Get your sample metadata added as columns
df = df.join(adata.obs)
display(df)
| var_names | gene1 | gene2 | gene3 |
|---|---|---|---|
| obs_names | |||
| cell1 | 1 | 0 | 0 |
| cell2 | 0 | 2 | 3 |
| gene1 | gene2 | gene3 | age | |
|---|---|---|---|---|
| obs_names | ||||
| cell1 | 1 | 0 | 0 | 28 |
| cell2 | 0 | 2 | 3 | 29 |
How to load search engine data into AnnData with alphapepttools?#
The functionality relies on the alphabase backbone of PSM and PG readers, which allows for loading and parsing of common search engine output formats. In this notebook we will look at reading data for three common search engines: DIANN, AlphaDIA and Spectronaut. Search engines output data in two main ways: Either as a long table, where rows are individual precursors in their respective samples:
PSM tables#
Peptide spectrum match (PSM) tables are the primary output of proteomics search engines and typically returned in a long format. In a PSM table, each row typically represents a single peptide-spectrum-match, i.e. a peptide sequence that the proteomics search engine identified to be compatible with an observed mass spectrum in a given sample. PSM tables contain information about both 1) the peptide sequence, 2) the spectrum, as well as 3) the score assigned to the PSM by the search engine.
An example for a PSM table could be something like this
Precursor |
Run |
Stripped.Sequence |
… |
|---|---|---|---|
PEPTIDEK1 |
file_1.raw |
PEPTIDEK |
… |
PEPTIDERK2 |
file_1.raw |
PEPTIDERK |
… |
PEPTIDR2 |
file_1.raw |
PEPTIDR |
… |
PEPTIDEK1 |
file_2.raw |
PEPTIDEK |
… |
PEPTIDERK2 |
file_2.raw |
PEPTIDERK |
… |
PEPTIDR2 |
file_2.raw |
PEPTIDR |
… |
PEPTIDEK1 |
file_3.raw |
PEPTIDEK |
… |
PEPTIDERK2 |
file_3.raw |
PEPTIDERK |
… |
PEPTIDR2 |
file_3.raw |
PEPTIDR |
… |
Protein group tables#
Protein group tables are the primary output for protein-level quantification in proteomics workflows. After search engines identify peptide spectrum matches, they aggregate peptide-level evidence to infer protein-level abundances. These protein group tables represent a structured matrix in the wide format that maps protein groups (features) to samples (observations), with estimated intensity values as entries.
Protein |
file_1.raw |
file_2.raw |
file_3.raw |
… |
|---|---|---|---|---|
PROT1 |
1000 |
1200 |
1100 |
… |
PROT2 |
2000 |
2200 |
2100 |
… |
PROT3 |
1500 |
1600 |
1550 |
… |
… |
… |
… |
… |
… |
Typically, downstream analysis tasks are performed on a sample level. Therefore, we are usually interested in a transposed version of the wide format, where rows correspond to samples and columns to features. alphapepttools allows users to easily generate compliant anndata objects from raw search engine outputs with its I/O functionalities.
Getting the example data#
Small versions of larger real-world datasets are stored in a public repository. Users can access them via the utility function alphapepttools.data.get_data.
We can inspect the available datasets with:
at.data.available_data()
| name | url | search_engine | data_type | citation | description | |
|---|---|---|---|---|---|---|
| 0 | bader2020_pg_alphadia | https://datashare.biochem.mpg.de/s/yLpjkoQzMHpdDsB | alphadia | study_pg | Bader JM, Geyer PE, Müller JB, Strauss MT, Koch M, Leypoldt F, Koertvelyessy P, Bittner D, Schipke CG, Incesoy EI, Peters O, Deigendesch N, Simons M, Jensen MK, Zetterberg H, Mann M. Proteome profiling in cerebrospinal fluid reveals novel biomarkers of Alzheimer's disease. Mol Syst Biol. 2020 Jun;16(6):e9356. doi: 10.15252/msb.20199356. PMID: 32485097; PMCID: PMC7266499. | Study on Cerebral Spinal Fluid of Alzheimer patients by Bader et al. |
| 1 | bader2020_pg_diann | https://datashare.biochem.mpg.de/s/3oZsya2L5bGnmtQ | DiaNN | study_pg | Bader JM, Geyer PE, Müller JB, Strauss MT, Koch M, Leypoldt F, Koertvelyessy P, Bittner D, Schipke CG, Incesoy EI, Peters O, Deigendesch N, Simons M, Jensen MK, Zetterberg H, Mann M. Proteome profiling in cerebrospinal fluid reveals novel biomarkers of Alzheimer's disease. Mol Syst Biol. 2020 Jun;16(6):e9356. doi: 10.15252/msb.20199356. PMID: 32485097; PMCID: PMC7266499. | Study on Cerebral Spinal Fluid of Alzheimer patients by Bader et al. |
| 2 | spectronaut_pg | https://datashare.biochem.mpg.de/s/Ai9TiBTeaPHK5by | spectronaut | pg | None | An example spectronaut report |
| 3 | bader2020_psm_alphadia | https://datashare.biochem.mpg.de/s/awYyxod4ksz86kk | alphadia | study_psm | Bader JM, Geyer PE, Müller JB, Strauss MT, Koch M, Leypoldt F, Koertvelyessy P, Bittner D, Schipke CG, Incesoy EI, Peters O, Deigendesch N, Simons M, Jensen MK, Zetterberg H, Mann M. Proteome profiling in cerebrospinal fluid reveals novel biomarkers of Alzheimer's disease. Mol Syst Biol. 2020 Jun;16(6):e9356. doi: 10.15252/msb.20199356. PMID: 32485097; PMCID: PMC7266499. | Study on Cerebral Spinal Fluid of Alzheimer patients by Bader et al. |
| 4 | bader2020_psm_diann | https://datashare.biochem.mpg.de/s/c4Z5Yg6srKQyDym | DiaNN | study_psm | Bader JM, Geyer PE, Müller JB, Strauss MT, Koch M, Leypoldt F, Koertvelyessy P, Bittner D, Schipke CG, Incesoy EI, Peters O, Deigendesch N, Simons M, Jensen MK, Zetterberg H, Mann M. Proteome profiling in cerebrospinal fluid reveals novel biomarkers of Alzheimer's disease. Mol Syst Biol. 2020 Jun;16(6):e9356. doi: 10.15252/msb.20199356. PMID: 32485097; PMCID: PMC7266499. | Study on Cerebral Spinal Fluid of Alzheimer patients by Bader et al. |
| 5 | spectronaut_psm | https://datashare.biochem.mpg.de/s/GtfJL49Rf9w78EE | spectronaut | study_psm | None | An example spectronaut report |
| 6 | bader2020_full_diann | https://datashare.biochem.mpg.de/s/zNtxEQJTdqQYLd4 | DiaNN | psm | Bader JM, Geyer PE, Müller JB, Strauss MT, Koch M, Leypoldt F, Koertvelyessy P, Bittner D, Schipke CG, Incesoy EI, Peters O, Deigendesch N, Simons M, Jensen MK, Zetterberg H, Mann M. Proteome profiling in cerebrospinal fluid reveals novel biomarkers of Alzheimer's disease. Mol Syst Biol. 2020 Jun;16(6):e9356. doi: 10.15252/msb.20199356. PMID: 32485097; PMCID: PMC7266499. | Full DiaNN report.parquet for the Bader et al. Alzheimer CSF study |
| 7 | bader2020_metadata | https://datashare.biochem.mpg.de/s/iSgdPnHgczbcktF | none | metadata | Bader JM, Geyer PE, Müller JB, Strauss MT, Koch M, Leypoldt F, Koertvelyessy P, Bittner D, Schipke CG, Incesoy EI, Peters O, Deigendesch N, Simons M, Jensen MK, Zetterberg H, Mann M. Proteome profiling in cerebrospinal fluid reveals novel biomarkers of Alzheimer's disease. Mol Syst Biol. 2020 Jun;16(6):e9356. doi: 10.15252/msb.20199356. PMID: 32485097; PMCID: PMC7266499. | Sample metadata for the Bader et al. Alzheimer CSF study |
| 8 | hela_pg_diann | https://datashare.biochem.mpg.de/s/XDi34GJpGHBgm5x | DiaNN | pg | None | HeLa QC protein group matrix from DiaNN (report.pg_matrix.tsv) |
| 9 | hela_metadata | https://datashare.biochem.mpg.de/s/aHpXSLX9RxbYPzp | none | metadata | None | HeLa QC sample metadata (simple_metadata.csv) |
| 10 | pelsa_report_diann | https://datashare.biochem.mpg.de/s/6piDQGm2yAEdtKQ | DiaNN | psm | Li, Kejia, et al. A peptide-centric local stability assay enables proteome-scale identification of the protein targets and binding regions of diverse ligands. Nature Methods 22.2 (2025): 278-282. | PELSA Staurosporine study DiaNN report for differential expression analysis |
| 11 | pelsa_metadata | https://datashare.biochem.mpg.de/s/NGof744gWw66Mc8 | none | metadata | Li, Kejia, et al. A peptide-centric local stability assay enables proteome-scale identification of the protein targets and binding regions of diverse ligands. Nature Methods 22.2 (2025): 278-282. | Sample metadata for the PELSA Staurosporine study |
| 12 | kinase_table | https://datashare.biochem.mpg.de/s/DnNe8EFSQyqy5pb | none | annotation | Manning, Gerard, et al. The protein kinase complement of the human genome. Science 298.5600 (2002): 1912-1934. | Human kinase annotation table from kinhub.org |
| 13 | alphadia_1.8.1_psm_report | https://datashare.biochem.mpg.de/s/ZBAGvQb4j2Lh0fo | alphadia | psm | None | Small example reports for testing PSM readers (AlphaDIA 1.8.1) |
| 14 | alphadia_2.0.0_psm_report | https://datashare.biochem.mpg.de/s/FoTMb5Gmk9qgZtr | alphadia | psm | None | Small example reports for testing PSM readers (AlphaDIA 2.0.0, parquet) |
| 15 | diann_1.8.1_psm_report | https://datashare.biochem.mpg.de/s/DQeP6FYDFYcXadd | diann | psm | None | Small example reports for testing PSM readers (DiaNN 1.8.1) |
| 16 | diann_1.9.0_psm_report | https://datashare.biochem.mpg.de/s/N65ZsaaX74eesx7 | diann | psm | None | Small example reports for testing PSM readers (DiaNN 1.9.0, parquet) |
Protein Group Tables#
Let’s first take a look a the protein group tables.
When you inspect the raw files you will notice that protein group table outputs vary significantly between search engines with respect to the features and quantities they contain. alphapepttools provides a single function alphapepttools.io.read_pg_table() to read any of the widely used search engine reports into the streamlined anndata format.
You only need to specify the path to the raw output and the type of search engine that generated the output.
alphadia#
data_path = at.data.get_data("bader2020_pg_alphadia")
/Users/lucas-diedrich/Documents/Projects/scverse/alphatools/programming/alphatools/docs/notebooks/bader2020_alphadia_pg does not yet exist
bader2020_alphadia_pg.zip successfully unzipped
/Users/lucas-diedrich/Documents/Projects/scverse/alphatools/programming/alphatools/docs/notebooks/bader2020_alphadia_pg successfully downloaded (0.27982044219970703 MB)
alphadia_pg_anndata = at.io.read_pg_table(
path=data_path / "pg.matrix_top100.tsv",
search_engine="alphadia",
)
alphadia_pg_anndata
AnnData object with n_obs × n_vars = 61 × 100
# Inspect the data
display(alphadia_pg_anndata.to_df().iloc[:5, :5])
display(alphadia_pg_anndata.obs.head())
display(alphadia_pg_anndata.var.head())
| uniprot_ids | A0A024R6N5;A0A0G2JRN3 | A0A075B6H7 | A0A075B6H9 | A0A075B6I0 | A0A075B6I1 |
|---|---|---|---|---|---|
| 20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleE01 | 0.000000e+00 | 2.098375e+07 | 1.603786e+09 | 1.234308e+09 | 9.981811e+07 |
| 20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleB05 | 5.248854e+07 | 1.558676e+07 | 1.264443e+09 | 1.017132e+09 | 1.438419e+08 |
| 20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleB11 | 3.063311e+08 | 7.633061e+06 | 9.859006e+08 | 6.482135e+08 | 5.423701e+07 |
| 20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA12 | 1.558016e+08 | 1.442227e+07 | 3.558411e+08 | 1.124133e+09 | 8.751671e+07 |
| 20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleB03 | 2.585067e+08 | 4.627953e+07 | 1.224894e+09 | 8.738715e+08 | 8.460091e+07 |
| 20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleE01 |
|---|
| 20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleB05 |
| 20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleB11 |
| 20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA12 |
| 20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleB03 |
| uniprot_ids |
|---|
| A0A024R6N5;A0A0G2JRN3 |
| A0A075B6H7 |
| A0A075B6H9 |
| A0A075B6I0 |
| A0A075B6I1 |
DIANN#
Similarly you can proceed with alternative search engine outputs like DIANN’s protein group table. Note that DIANN outputs additional feature-level metadata compared to alphadia that is stored in the anndata.AnnData.var attribute.
data_path = at.data.get_data("bader2020_pg_diann")
/Users/lucas-diedrich/Documents/Projects/scverse/alphatools/programming/alphatools/docs/notebooks/bader2020_diann_pg does not yet exist
bader2020_diann_pg.zip successfully unzipped
/Users/lucas-diedrich/Documents/Projects/scverse/alphatools/programming/alphatools/docs/notebooks/bader2020_diann_pg successfully downloaded (0.08965206146240234 MB)
# Using the pg reader
diann_pg_anndata = at.io.read_pg_table(
path=data_path / "pg.matrix_top100.tsv",
search_engine="diann",
)
# Inspect the data
display(diann_pg_anndata.to_df().iloc[:5, :5])
display(diann_pg_anndata.obs.head())
display(diann_pg_anndata.var.head())
| proteins | LV469_HUMAN | LV861_HUMAN | LV460_HUMAN | LV548_HUMAN | LV746_HUMAN |
|---|---|---|---|---|---|
| /fs/gpfs41/lv12/fileset02/pool/pool-mann-projects/alphatools/alzheimers_dataset/RAWDATA/20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleE01.raw | 77220500.0 | 103374000.0 | 10956700.0 | NaN | 125708000.0 |
| /fs/gpfs41/lv12/fileset02/pool/pool-mann-projects/alphatools/alzheimers_dataset/RAWDATA/20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleB05.raw | 39566800.0 | 71580200.0 | 12655900.0 | NaN | 55909900.0 |
| /fs/gpfs41/lv12/fileset02/pool/pool-mann-projects/alphatools/alzheimers_dataset/RAWDATA/20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleB11.raw | 27392100.0 | 42695000.0 | 4543130.0 | NaN | 89816500.0 |
| /fs/gpfs41/lv12/fileset02/pool/pool-mann-projects/alphatools/alzheimers_dataset/RAWDATA/20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA12.raw | 12829700.0 | 55057200.0 | 7905720.0 | 202590.0 | 53537600.0 |
| /fs/gpfs41/lv12/fileset02/pool/pool-mann-projects/alphatools/alzheimers_dataset/RAWDATA/20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleB03.raw | 33979400.0 | 66161200.0 | 4035670.0 | NaN | 112146000.0 |
| /fs/gpfs41/lv12/fileset02/pool/pool-mann-projects/alphatools/alzheimers_dataset/RAWDATA/20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleE01.raw |
|---|
| /fs/gpfs41/lv12/fileset02/pool/pool-mann-projects/alphatools/alzheimers_dataset/RAWDATA/20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleB05.raw |
| /fs/gpfs41/lv12/fileset02/pool/pool-mann-projects/alphatools/alzheimers_dataset/RAWDATA/20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleB11.raw |
| /fs/gpfs41/lv12/fileset02/pool/pool-mann-projects/alphatools/alzheimers_dataset/RAWDATA/20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA12.raw |
| /fs/gpfs41/lv12/fileset02/pool/pool-mann-projects/alphatools/alzheimers_dataset/RAWDATA/20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleB03.raw |
| uniprot_ids | genes | description | peptide_count | proteotypic_peptide_count | |
|---|---|---|---|---|---|
| proteins | |||||
| LV469_HUMAN | A0A075B6H9 | IGLV4-69 | NaN | 2 | 2 |
| LV861_HUMAN | A0A075B6I0 | IGLV8-61 | NaN | 3 | 3 |
| LV460_HUMAN | A0A075B6I1 | IGLV4-60 | NaN | 2 | 2 |
| LV548_HUMAN | A0A075B6I7 | IGLV5-48 | NaN | 1 | 1 |
| LV746_HUMAN | A0A075B6I9 | IGLV7-46 | NaN | 3 | 1 |
Spectronaut#
# data_path = at.data.get_data("spectronaut_pg")
# sn_pg_anndata = at.io.read_pg_table(
# path=data_path / "spectronaut_protein_table.tsv",
# search_engine="spectronaut",
# # Inspect the data
# display(sn_pg_anndata.to_df().iloc[:5, :5])
# display(sn_pg_anndata.obs.head())
# display(sn_pg_anndata.var.head())
Protein data from precursor spectrum match tables#
You can similarly extract the protein intensity information from PSM tables with the alphapepttools.io.read_psm_table function. Note that the information from the pg tables and psm tables might differ.
When might you use PSM tables instead of PG tables?
You might choose PSM tables when you need custom protein quantification strategies that differ from the search engine’s default approach,
when you want to perform quality control at the precursor level to identify problematic peptides,
when you’re integrating protein data with peptide-level analyses,
or when protein group tables aren’t available from your search engine.
We explore again examples for PSM reports from different search engines.
AlphaDIA#
data_path = at.data.get_data("bader2020_psm_alphadia")
/Users/lucas-diedrich/Documents/Projects/scverse/alphatools/programming/alphatools/docs/notebooks/alphadia does not yet exist
alphadia.zip successfully unzipped
/Users/lucas-diedrich/Documents/Projects/scverse/alphatools/programming/alphatools/docs/notebooks/alphadia successfully downloaded (8.177705764770508 MB)
ad_psm_andata = at.io.read_psm_table(
file_paths=data_path / "top20_precursors.tsv",
search_engine="alphadia",
)
ad_psm_andata
AnnData object with n_obs × n_vars = 60 × 19
# Inspect the data
display(ad_psm_andata.to_df().iloc[:5, :5])
display(ad_psm_andata.obs.head())
display(ad_psm_andata.var.head())
| proteins | P01009;A0A024R6N5 | P01011;A0A087WY93;G3V3A0;G3V595 | P01024;A0A8Q3SI05;A0A8Q3SI34;A0A8Q3WM02 | P02675;D6REL8 | P02787 |
|---|---|---|---|---|---|
| raw_name | |||||
| 20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA01 | NaN | NaN | 5.122159e+09 | NaN | 1.383408e+11 |
| 20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA02 | 3.942457e+09 | 1.895097e+09 | 7.004725e+09 | 5.359053e+09 | 1.166750e+11 |
| 20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA03 | 3.670290e+09 | 1.466623e+09 | 1.070513e+10 | 4.963886e+09 | 1.096607e+11 |
| 20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA04 | 5.022584e+09 | 1.653896e+09 | 6.425221e+09 | 5.095784e+09 | 1.073282e+11 |
| 20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA05 | 3.314550e+09 | 1.044103e+09 | 5.690721e+09 | 4.237927e+09 | 8.503823e+10 |
| raw_name |
|---|
| 20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA01 |
| 20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA02 |
| 20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA03 |
| 20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA04 |
| 20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA05 |
| proteins |
|---|
| P01009;A0A024R6N5 |
| P01011;A0A087WY93;G3V3A0;G3V595 |
| P01024;A0A8Q3SI05;A0A8Q3SI34;A0A8Q3WM02 |
| P02675;D6REL8 |
| P02787 |
DIANN#
data_path = at.data.get_data("bader2020_psm_diann")
/Users/lucas-diedrich/Documents/Projects/scverse/alphatools/programming/alphatools/docs/notebooks/diann does not yet exist
diann.zip successfully unzipped
/Users/lucas-diedrich/Documents/Projects/scverse/alphatools/programming/alphatools/docs/notebooks/diann successfully downloaded (1.0407352447509766 MB)
diann_pg_andata = at.io.read_psm_table(
file_paths=data_path / "top20_report.parquet",
search_engine="diann",
)
# Inspect the data
display(diann_pg_andata.to_df().iloc[:5, :5])
display(diann_pg_andata.obs.head())
display(diann_pg_andata.var.head())
| proteins | P01011 | P02649 | P02671 | P02765 | P02768 |
|---|---|---|---|---|---|
| raw_name | |||||
| 20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA01 | 9.798148e+08 | 6.036072e+09 | 129011936.0 | 1.455730e+09 | 1.929707e+10 |
| 20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA02 | 1.396344e+09 | 7.632371e+09 | 204724768.0 | 1.515461e+09 | 2.287499e+10 |
| 20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA03 | 1.025414e+09 | 4.464353e+09 | 204250576.0 | 9.584196e+08 | 1.264076e+10 |
| 20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA04 | 1.272798e+09 | 6.688002e+09 | 192892832.0 | 1.185967e+09 | 1.557733e+10 |
| 20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA05 | 8.905130e+08 | 4.973349e+09 | 206050384.0 | 1.481705e+09 | 1.579670e+10 |
| raw_name |
|---|
| 20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA01 |
| 20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA02 |
| 20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA03 |
| 20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA04 |
| 20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM1_sampleA05 |
| proteins |
|---|
| P01011 |
| P02649 |
| P02671 |
| P02765 |
| P02768 |
Spectronaut#
# data_path = at.data.get_data("spectronaut_psm")
# sn_pg_andata = at.io.read_psm_table(
# file_paths=data_path / "example_dataset_mouse_sn_top20peptides.tsv",
# search_engine="spectronaut",
# intensity_column = "F.PeakArea",
# feature_id_column = "FG.Id"
# )
# # Inspect the data
# display(sn_pg_andata.to_df().iloc[:5, :5])
# display(sn_pg_andata.obs.head())
# display(sn_pg_andata.var.head())
In Summary, we learned…#
How AnnData can help us keep our data and metadata aligned
How to generate AnnData objects from our input dataframes
How to get dataframes back from AnnData objects
How to load protein tables using
alphapepttoolsreadersHow to load and pivot PSM tables using
alphapepttoolsreaders
From here on, proteomics data can be used for all further downstream analyses. Continue with the basic workflow notebook to learn how to merge your metadata into AnnData objects and perform filtering and exploratory data analysis!