Example of PELSA differential expression with `alphapepttools`#

An example dataset: The original PELSA publication#

PELSA [1] is a novel method to investigate protein-ligand interactions through limited proteolysis. Cell lysate is treated with a short pulse of trypsin at extremely high (1:2) enzyme - substrate ratio, which allows for the digestion of surface exposed peptides. If a ligand (such as a small molecule binder) is bound to the protein surface, it stablizes the surrounding protein region and digestion is momentarily slowed. When compared against a control without the ligand, the PELSA-stabilized peptides appear downregulated. We replicate the original publication’s analysis of Staurosporine, a pan-kinase binder, and visualize the regulation of kinase targets.

[1]: Li, Kejia, et al. “A peptide-centric local stability assay enables proteome-scale identification of the protein targets and binding regions of diverse ligands.” Nature Methods 22.2 (2025): 278-282.

import tempfile
import pandas as pd
import numpy as np
import logging
import alphapepttools as at  # import io, pp, pl, tl  # input/output, preprocessing, plotting and tools modules
from matplotlib.patches import Rectangle

logger = logging.getLogger(__name__)

Preparing the dataset using `alphapepttools` loaders and AnnData factory.#

After downloading the relevant files from the study’s PRIDE-repository (https://www.ebi.ac.uk/pride/archive/projects/PXD034606):

LKJ_20211007_480_Hela_stau_0uM_1.raw
LKJ_20211007_480_Hela_stau_0uM_2.raw
LKJ_20211007_480_Hela_stau_0uM_3.raw
LKJ_20211007_480_Hela_stau_0uM_4.raw
LKJ_20211007_480_Hela_stau_20uM_1.raw
LKJ_20211007_480_Hela_stau_20uM_2.raw
LKJ_20211007_480_Hela_stau_20uM_3.raw
LKJ_20211007_480_Hela_stau_20uM_4.raw

And processing them with DIANN 2.1.0, the report.parquet file was saved in our datashare alongside sample metadata. We have to extract protein-level intensities and precursor-level intensities from this dataset using the alphapepttools.io.AnnDataFactory class. The resulting AnnData object contains protein-group or precursor quantities and any number of feature-metadata columns (for example, protein groups may have genes as secondary annotation, precursors may have protein groups and genes as secondary annotation).

# Download the data using the alphapepttools data module
report_path = at.data.get_data("pelsa_report_diann", output_dir=tempfile.mkdtemp())
sample_metadata_path = at.data.get_data("pelsa_metadata", output_dir=tempfile.mkdtemp())

# Get the full report
full_report = pd.read_parquet(report_path)

# AnnDataFactory instance containing protein level data
adata_protein = at.io.read_psm_table(
    file_paths=report_path,
    search_engine="diann",
    level="proteins",
    var_columns=["genes"],
)

# AnnDataFactory instance containing gene level data
adata_gene = at.io.read_psm_table(
    file_paths=report_path,
    search_engine="diann",
    level="genes",
)

# AnnDataFactory instance containing precursor level data
adata_precursor = at.io.read_psm_table(
    file_paths=report_path,
    search_engine="diann",
    intensity_column="Precursor.Normalised",
    feature_id_column="Precursor.Id",
    sample_id_column="Run",
    var_columns=["proteins", "genes", "sequence"],
)

/var/folders/2l/hhd_z4hx3070zw8rlj4c3l940000gn/T/tmp5j3sii7k/report.parquet does not yet exist
/var/folders/2l/hhd_z4hx3070zw8rlj4c3l940000gn/T/tmp5j3sii7k/report.parquet successfully downloaded (70.8467435836792 MB)
/var/folders/2l/hhd_z4hx3070zw8rlj4c3l940000gn/T/tmpr1853dya/iteration_L_literature_reprocessing_PELSA_samplemap.csv does not yet exist
/var/folders/2l/hhd_z4hx3070zw8rlj4c3l940000gn/T/tmpr1853dya/iteration_L_literature_reprocessing_PELSA_samplemap.csv successfully downloaded (0.000179290771484375 MB)

We focus on Protein Groups and Precursors, and use `pp.add_metadata()` to merge the metadata into our AnnData instances#

# Transfer the run names to the sample metadata to account for lack of rawfile names in the metadata
sample_metadata = pd.read_csv(sample_metadata_path)
sample_metadata["sample"] = sample_metadata["plate"] + "_" + sample_metadata["running_count"].astype(str)

run_name_map = {"_".join(k.split("_")[-2:]): k for k in adata_protein.obs_names}
sample_metadata.index = sample_metadata["sample"].map(run_name_map)

# Add the metadata to all AnnData instances
adata_protein = at.pp.add_metadata(adata_protein, sample_metadata, axis=0)
adata_precursor = at.pp.add_metadata(adata_precursor, sample_metadata, axis=0)

We log2-transform our data, but keep the untransformed values in a separate layer#

adata_protein.layers["raw"] = adata_protein.X.copy()
adata_precursor.layers["raw"] = adata_precursor.X.copy()

at.pp.nanlog(adata_protein)
at.pp.nanlog(adata_precursor)

# Inspect the log2-transformed values:
display(adata_protein.to_df().iloc[:5, :5])

# Raw values:
display(adata_protein.to_df(layer="raw").iloc[:5, :5])

proteins	A0A096LP55;P07919	A0A0B4J2F0	A0A1B0GUY1	A0A3B3IS91	A0A3B3IU46;Q9BTL3
raw_name
LKJ_20211007_480_Hela_stau_0uM_1	21.790016	23.018673	24.692411	17.029541	23.745449
LKJ_20211007_480_Hela_stau_0uM_2	21.902254	23.051476	24.746101	17.110256	23.861660
LKJ_20211007_480_Hela_stau_0uM_3	21.277990	23.003662	24.555399	17.807335	23.645226
LKJ_20211007_480_Hela_stau_0uM_4	21.540596	23.048594	24.787121	16.149979	23.457727
LKJ_20211007_480_Hela_stau_20uM_1	21.918283	23.059902	24.633326	NaN	23.821888

proteins	A0A096LP55;P07919	A0A0B4J2F0	A0A1B0GUY1	A0A3B3IS91	A0A3B3IU46;Q9BTL3
raw_name
LKJ_20211007_480_Hela_stau_0uM_1	3626171.50	8497893.0	27111686.0	133783.515625	14063468.0
LKJ_20211007_480_Hela_stau_0uM_2	3919543.50	8693320.0	28139656.0	141481.781250	15243168.0
LKJ_20211007_480_Hela_stau_0uM_3	2542808.25	8409926.0	24655366.0	229372.937500	13119653.0
LKJ_20211007_480_Hela_stau_0uM_4	3050461.00	8675965.0	28951204.0	72715.656250	11520718.0
LKJ_20211007_480_Hela_stau_20uM_1	3963336.50	8744247.0	26023742.0	NaN	14828694.0

Perform basic EDA on protein and precursor data#

Starting with a panel for feature counts and CVs in each sample

# Small custom function to generate CVs
# In the future, this could become an alphapepttools core functionality
import anndata as ad


def make_group_cvs(
    adata: ad.AnnData,
    group_column: str,
) -> None:
    """Calculate CVs for each group in group_columns and store them in adata.var"""
    levels = adata.obs[group_column].unique()
    for level in levels:
        group_adata = at.pp.filter_by_metadata(adata, {group_column: level}, axis=0)
        stds = group_adata.to_df(layer="raw").std(axis=0)
        means = group_adata.to_df(layer="raw").mean(axis=0)
        adata.var[f"{level}_CV"] = stds / means


# Calculate CVs for the different groups in the condition column
make_group_cvs(adata_protein, "treatment")
make_group_cvs(adata_precursor, "treatment")

We use alphapepttools’ plotting syntax to generate a stylized panel with violin plots for sample group feature CVs, and to assess the median intensity of all features across samples.

fig, axm = at.pl.create_figure(2, 2, figsize=(12, 6), gridspec_kwargs={"width_ratios": [0.5, 3]})

# Define color dictionary for the different treatments
color_dict = {
    "CTRL": at.pl.BaseColors.get("red"),
    "Stauro": at.pl.BaseColors.get("blue"),
}

# Map the sample names to their respective colors based on treatment
sample_color_dict = {
    sample: color_dict[treatment]
    for sample, treatment in zip(adata_protein.obs_names, adata_protein.obs["treatment"], strict=False)
}

# Iterative plotting
plot_data = {
    "protein": adata_protein,
    "precursor": adata_precursor,
}
for readout, adata in plot_data.items():
    # start with the CV plots
    ax = axm.next()
    at.pl.Plots.violinplot(
        ax=ax,
        data=adata.T,
        direct_columns=["CTRL_CV", "Stauro_CV"],
        color_dict={f"{k}_CV": v for k, v in color_dict.items()},
    )
    ax.set_ylim(0, 1)

    at.pl.label_axes(
        ax=ax,
        xlabel="Treatment",
        ylabel=f"{readout.capitalize()} CV",
        title=f"{readout.capitalize()} CVs by treatment",
    )

    # Then do the log2 abundance plots
    ax = axm.next()
    at.pl.Plots.boxplot(
        ax=ax,
        data=adata.T,
        direct_columns=adata.obs_names.tolist(),
        color_dict=sample_color_dict,
    )

    # process x ticks to fit the plot better
    xtick_labels = [item.get_text() for item in ax.get_xticklabels()]  # Get xtick labels directly from axes
    xtick_labels = [
        "_".join(x.split("_")[-2:]) for x in xtick_labels
    ]  # Shorten the xtick labels to make them more readable
    _ = ax.set_xticklabels(xtick_labels)  # Set the new xtick labels

    at.pl.label_axes(
        ax=ax,
        xlabel="Sample",
        ylabel=f"{readout.capitalize()} log2 intensity",
        title=f"{readout.capitalize()} log2 intensities by sample",
    )

    # save figure
    at.pl.save_figure(
        fig=fig,  # Note that here we're using the whole figure to whose axes we added our plots
        filename="pelsa_quantities_overview.svg",
        output_dir="./example_outputs",
    )

../_images/e26803134cd3c4aff3434922060cc5d9e6263ac51b8ffa1dff5ac7ce84464f4c.png

PELSA analysis depends on differential expression#

Peptides from the stablized protein regions are expected to appear downregulated. We perform comparative differential expression analysis:

On the one hand, we evaluate “vanilla” independent two-sample t-test.
On the other, we employ alphapepttools’ wrapper of AlphaQuant [1], a differential expression method specially adapted for mass-spectrometry proteomics data.
Additionally, we examine a Python-implementation of the popular Limma package’s Empirical Bayes moderated ttest, which is more sophisticated than a simple ttest but considerably faster and more lightweight than alphaquant.

We want to retain the original AnnData object’s feature-level metadata in our differential expression results, which can be accomplished by joining the var object to the results.

[1]: Ammar, Constantin, et al. “Tree-based quantification infers proteoform regulation in bottom-up proteomics data.” bioRxiv (2025): 2025-03.

comparison = ("Stauro", "CTRL")

# Vanilla ttest on peptide data
ttest_peptide_results = at.tl.diff_exp_ttest(
    adata=adata_precursor,
    between_column="treatment",
    comparison=comparison,
    min_valid_values=3,
    equal_var=False,
)
ttest_peptide_results = ttest_peptide_results.join(adata_precursor.var, how="left")

Comparing groups: Stauro vs CTRL

# Run Limma Empirical Bayes moderated ttest on peptides: alphapepttools inmoose
ebayes_peptide_results = at.tl.diff_exp_ebayes(
    adata=adata_precursor,
    between_column="treatment",
    comparison=comparison,
)[1]
ebayes_peptide_results = ebayes_peptide_results.join(adata_precursor.var, how="left")

# AlphaQuant ttest on full search engine output (takes approx. 3-4 minutes on a M2 MacBook Pro. Set "REPROCESS" flag to True to re-run, otherwise cached data will be used)
REPROCESS = False

if REPROCESS:
    comparison_key, alphaquant_results = at.tl.diff_exp_alphaquant(
        adata=adata_precursor,
        report=full_report,
        between_column="treatment",
        comparison=comparison,
        valid_values_filter_mode="either",
        min_valid_values=3,
        plots="hide",
    )
    alphaquant_peptide_results = alphaquant_results["peptide"]
    alphaquant_peptide_results.to_pickle("./example_data/notebook_04_alphaquant_peptide_results.pkl")
else:
    logger.info(" --> Using cached AlphaQuant results")
    alphaquant_peptide_results = pd.read_pickle("./example_data/notebook_04_alphaquant_peptide_results.pkl")

Alphaquant results are saved into a dictionary based on the quantification level:

results
- protein level results
- proteoform level results
- peptide level results

We can use `alphapepttools` functionality to construct a volcanoplot function#

alphapepttools does not have a dedicated volcanoplotting function. However, it is straightforward to assemble one from the tools already implemented. In its basic form, a volcanoplot consists of:

figure canvas (pl.create_figure())
scatterplot (pl.Plots.scatter())
lines (pl.add_lines())
labels (pl.label_plot())
axis labels (pl.label_axes())

In the publication, peptides are aggregated into proteins in a nonstandard way: For each protein, the peptide with the highest significance is assigned its protein ID. This is how a dataset of peptides is converted into a dataset of proteins. We perform the same aggregation in our preprocessing.

ebayes_peptide_results

	condition_pair	protein	log2fc	p_value	-log10(p_value)	fdr	-log10(fdr)	method	max_level_1_samples	max_level_2_samples	stat	B	AveExpr	proteins	genes	sequence	CTRL_CV	Stauro_CV
(UniMod:1)AAAAAAAGDSDSWDADAFSVEDPVR3	Stauro_VS_CTRL	(UniMod:1)AAAAAAAGDSDSWDADAFSVEDPVR3	-0.779220	0.009479	2.023230	0.560184	0.251669	limma_ebayes_inmoose	4	4	-3.326301	-2.842216	24.208256	O75822	EIF3J	AAAAAAAGDSDSWDADAFSVEDPVR	0.348759	0.168788
(UniMod:1)AAAAAAAGDSDSWDADAFSVEDPVRK3	Stauro_VS_CTRL	(UniMod:1)AAAAAAAGDSDSWDADAFSVEDPVRK3	-0.163618	0.465420	0.332155	0.960454	0.017523	limma_ebayes_inmoose	4	4	-0.763940	-6.411124	20.255520	O75822	EIF3J	AAAAAAAGDSDSWDADAFSVEDPVRK	0.255562	0.243771
(UniMod:1)AAAAAAAPSGGGGGGEEERLEEK3	Stauro_VS_CTRL	(UniMod:1)AAAAAAAPSGGGGGGEEERLEEK3	0.131008	0.226355	0.645211	0.933111	0.030067	limma_ebayes_inmoose	4	4	1.303456	-5.869777	20.313845	P51608-2	MECP2	AAAAAAAPSGGGGGGEEERLEEK	0.116230	0.060557
(UniMod:1)AAAAAAGAASGLPGPVAQGLK3	Stauro_VS_CTRL	(UniMod:1)AAAAAAGAASGLPGPVAQGLK3	-0.001897	0.985646	0.006279	0.999811	0.000082	limma_ebayes_inmoose	4	4	-0.018523	-6.721547	22.979843	Q96P70	IPO9	AAAAAAGAASGLPGPVAQGLK	0.064267	0.109800
(UniMod:1)AAAAAAGEAR2	Stauro_VS_CTRL	(UniMod:1)AAAAAAGEAR2	0.120329	0.590239	0.228972	0.972137	0.012272	limma_ebayes_inmoose	4	4	0.559355	-6.552634	21.665287	P09417;P09417-2	QDPR	AAAAAAGEAR	0.314222	0.103197
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
YYVTIIDAPGHRDFIK3	Stauro_VS_CTRL	YYVTIIDAPGHRDFIK3	0.296327	0.030165	1.520504	0.743737	0.128581	limma_ebayes_inmoose	4	4	2.594342	-4.001476	21.721304	P68104	EEF1A1	YYVTIIDAPGHRDFIK	0.093635	0.124795
YYVTIIDAPGHRDFIK4	Stauro_VS_CTRL	YYVTIIDAPGHRDFIK4	0.047830	0.727330	0.138268	0.984444	0.006809	limma_ebayes_inmoose	4	4	0.360320	-6.650818	24.960939	P68104	EEF1A1	YYVTIIDAPGHRDFIK	0.138696	0.122280
YYYQLNSK2	Stauro_VS_CTRL	YYYQLNSK2	0.039897	0.719846	0.142760	0.983339	0.007297	limma_ebayes_inmoose	4	4	0.370719	-6.646698	23.494913	O00257	CBX4	YYYQLNSK	0.115134	0.081185
YYYSDNFFDGQR2	Stauro_VS_CTRL	YYYSDNFFDGQR2	-0.014498	0.928952	0.032007	0.997290	0.001179	limma_ebayes_inmoose	4	4	-0.091820	-6.717097	22.013996	Q9H4B6	SAV1	YYYSDNFFDGQR	0.152211	0.177059
YYYVQNVYTPVDEHVYPDHR3	Stauro_VS_CTRL	YYYVQNVYTPVDEHVYPDHR3	0.020215	0.833917	0.078877	0.990898	0.003971	limma_ebayes_inmoose	4	4	0.216166	-6.696086	23.499073	Q86TG7-4	PEG10	YYYVQNVYTPVDEHVYPDHR	0.040443	0.107898

68670 rows × 18 columns

def top_peptides_to_proteins(
    ttest_df: pd.DataFrame,
    y_score_column: str = "-log10(p_value)",
    id_column: str = "protein",
) -> pd.DataFrame:
    """Convert peptide level t-test results to protein level by taking the top peptide per protein."""
    ttest_df = ttest_df.copy()
    return ttest_df.sort_values(y_score_column, ascending=False).groupby(id_column).first().reset_index()


aq_protein_table = top_peptides_to_proteins(alphaquant_peptide_results, id_column="protein")
eb_protein_table = top_peptides_to_proteins(ebayes_peptide_results, id_column="genes")
tt_protein_table = top_peptides_to_proteins(ttest_peptide_results, id_column="genes")

Additionally, we have to annotate all kinases in the data in order to visualize them and perform kinase enrichment scoring. To do this, we load a table from Gerard et al. [1] (http://www.kinhub.org/kinases.html) and transfer the annotations to our data.

[1]: Manning, Gerard, et al. “The protein kinase complement of the human genome.” Science 298.5600 (2002): 1912-1934.

def annotate_kinases(
    ttest_df: pd.DataFrame,
    kinase_table_path: str,
    gene_id_column: str = "gene",
) -> list:
    """Assign str value for kinase/no kinase based on publication data:

    Manning, Gerard, et al. "The protein kinase complement of the human genome." Science 298.5600 (2002): 1912-1934.,
    table from: http://www.kinhub.org/kinases.html

    """
    ttest_df = ttest_df.copy()
    kinase_table = pd.read_excel(kinase_table_path)
    kinase_list = kinase_table["HGNC Name"].tolist()
    kinase_dict = dict(zip(kinase_list, ["kinase"] * len(kinase_list), strict=False))

    ttest_df["kinase_status"] = ttest_df[gene_id_column].map(kinase_dict).fillna("other")
    return ttest_df


# Download the kinase table using the alphapepttools data module
kinase_path = at.data.get_data("kinase_table")

tt_protein_table = annotate_kinases(tt_protein_table, kinase_path, "genes")
eb_protein_table = annotate_kinases(eb_protein_table, kinase_path, "genes")
aq_protein_table = annotate_kinases(aq_protein_table, kinase_path, "protein")

/Users/vincenthbrennsteiner/Documents/mann_labs/_git_repositories/alphapepttools/docs/notebooks/kinome_table.xlsx already exists (0.06418609619140625 MB)

The main readout is a custom Kinase Enrichment Score:#

In the main publication [1] and its follow-up [2], a kinase score is implemented to assess the sensitivity and specificity of the assay. This score derives from the fact that, in the ideal case we should only see kinase peptides downregulated. In practice, there are off-targets, i.e. peptides/proteins that appear as downregulated even though they are not kinases. The score is the ratio of downregulated (log2fc < 0) kinases/non-kinases at 80 % specificity when ranking by -log10(p_value).

Imagine a boundary moving down from the highest point in the left half of the volcanoplot. Each time it passes a point, we count if the point is a kinase or a non-kinase. When the ratio of passed kinases / passed non-kinases becomes smaller than 0.8 we state that 80 % specificity is reached. However many kinases we have up to this point is our score.

[1]: Li, Kejia, et al. “A peptide-centric local stability assay to unveil protein targets of diverse ligands.” bioRxiv (2023): 2023-10.

[2]: Li, Kejia, et al. “High-throughput peptide-centric local stability assay extends protein-ligand identification to membrane proteins, tissues, and bacteria.” bioRxiv (2025): 2025-04.

def kinase_score(
    ttest_df: pd.DataFrame,
    id_col: str,
    score_column: str = "-log10(p_value)",
    max_candidates: int = 500,
    threshold: float = 0.8,
) -> dict | None:
    """Perform kinase scoring analysis per HT-Pelsa.

    1. Sort peptides by -log10(significance) (score_column) descending
    2. Consider only regulated peptides, i.e. log2fc < 0
    3. Iterate through ranks, calculating the cumulative kinase fraction
         (number of kinases / total number of peptides)
    4. Stop when kinase fraction drops below min_kinase_pct or max_candidates is reached

    """
    ttest_df = ttest_df.copy()
    print(
        f"Total unique kinases in the dataset: {ttest_df.loc[ttest_df['kinase_status'] == 'kinase', id_col].nunique()}"
    )

    # Restrict to downregulated peptides
    ttest_df = ttest_df[ttest_df["log2fc"] < 0]
    print(
        f"Total unique kinases in the downregulated dataset: {ttest_df.loc[ttest_df['kinase_status'] == 'kinase', id_col].nunique()}"
    )

    # Rank peptides by score
    ttest_df = ttest_df.sort_values(score_column, ascending=False).reset_index(drop=True)
    ttest_df["rank"] = ttest_df.index + 1

    # Iterate through ranks and record the kinase fraction
    def _kinase_fraction(df: pd.DataFrame) -> float:
        kinase_count = df[df["kinase_status"] == "kinase"].shape[0]
        total_count = df.shape[0]
        return kinase_count / total_count if total_count > 0 else np.nan

    kinase_fractions = []
    for n in range(1, len(ttest_df) + 1):
        subset = ttest_df.head(n)
        kinase_fraction = _kinase_fraction(subset)
        score_threshold = subset[score_column].iloc[-1]
        kinase_fractions.append(
            {
                "rank": n,
                "kinase_fraction": kinase_fraction,
                "score_threshold": score_threshold,
                "n_kinases": subset[subset["kinase_status"] == "kinase"].shape[0],
                "n_total": subset.shape[0],
            }
        )

        if n >= max_candidates and kinase_fraction < threshold:
            break

    kinase_fraction_df = pd.DataFrame(kinase_fractions)

    # Determine score based on specificity threshold
    above_threshold = kinase_fraction_df[kinase_fraction_df["kinase_fraction"] >= threshold]
    optimal_row = above_threshold.iloc[-1] if not above_threshold.empty else None
    if optimal_row is None:
        print("No suitable threshold found.")
        return None

    return {
        "threshold": optimal_row["score_threshold"],
        "n_candidates": optimal_row["rank"],
        "n_kinases": optimal_row["n_kinases"],
        "n_non_kinases": optimal_row["n_total"] - optimal_row["n_kinases"],
        "kinase_percentage": optimal_row["kinase_fraction"],
    }

# uniform coloring
color_dict = {
    "kinase": at.pl.BaseColors.get("red"),
    "other": at.pl.BaseColors.get("grey"),
}

First, inspect the aggregated protein results for the standard t-test#

fig, axm = at.pl.create_figure(1, 1, figsize=(4, 4))
ax = axm.next()

at.pl.Plots.scatter(
    ax=ax,
    data=tt_protein_table,
    x_column="log2fc",
    y_column="-log10(p_value)",
    xlim=(-7, 7),
    color_map_column="kinase_status",
    color_dict=color_dict,
    legend="auto",
)

at.pl.add_lines(
    ax=ax,
    intercepts=[0],
    linetype="vline",
    linestyle="--",
)

at.pl.add_lines(
    ax=ax,
    intercepts=[-np.log10(0.05)],
    linetype="hline",
    linestyle="--",
)

at.pl.label_axes(
    ax=ax,
    xlabel="log2 fold change (Stauro vs CTRL)",
    ylabel="-log10 p-value",
    title="Literature data: Protein level t-test results",
)

# We draw a rectangle to indicate the kinase scoring threshold

ttest_ks_result = kinase_score(tt_protein_table, id_col="genes", threshold=0.8)
if ttest_ks_result:
    rect = Rectangle(
        xy=(-7, ttest_ks_result["threshold"]),  # Bottom-left corner
        width=7,  # Width to cover from -7 to 0 (downregulated region)
        height=tt_protein_table["-log10(p_value)"].max()
        + 0.5
        - ttest_ks_result["threshold"],  # Height with small padding
        linewidth=2,
        facecolor=at.pl.BaseColors.get("green", alpha=0.2),
    )
    print(
        f"Kinase scoring threshold at -log10(p_value) = {ttest_ks_result['threshold']:.2f}, covering {ttest_ks_result['n_candidates']} proteins with {ttest_ks_result['n_kinases']} kinases ({ttest_ks_result['kinase_percentage'] * 100:.1f}%)"
    )
    ax.add_patch(rect)

# save figure
at.pl.save_figure(
    fig=fig,  # Note that here we're using the whole figure to whose axes we added our plots
    filename="Literature_PELSA_TTEST_AggProt_Stauro.svg",
    output_dir="./example_outputs",
)

Total unique kinases in the dataset: 276
Total unique kinases in the downregulated dataset: 215
Kinase scoring threshold at -log10(p_value) = 3.34, covering 108.0 proteins with 87.0 kinases (80.6%)

../_images/4ba5b042f6c266e41aa49ce06968a7201d77ba64c1ef750435f83f850a96d505.png

Next, inspect the protein-aggregated results from the Empirical Bayes moderated ttest (Limma-like)#

fig, axm = at.pl.create_figure(1, 1, figsize=(4, 4))
ax = axm.next()

at.pl.Plots.scatter(
    ax=ax,
    data=eb_protein_table,
    x_column="log2fc",
    y_column="-log10(p_value)",
    xlim=(-7, 7),
    color_map_column="kinase_status",
    color_dict=color_dict,
    legend="auto",
)

at.pl.add_lines(
    ax=ax,
    intercepts=[0],
    linetype="vline",
    linestyle="--",
)

at.pl.add_lines(
    ax=ax,
    intercepts=[-np.log10(0.05)],
    linetype="hline",
    linestyle="--",
)

at.pl.label_axes(
    ax=ax,
    xlabel="log2 fold change (Stauro vs CTRL)",
    ylabel="-log10 p-value",
    title="Literature data: Protein level Empirical Bayes t-test results",
)

# We draw a rectangle to indicate the kinase scoring threshold

ebayes_ttest_ks_results = kinase_score(eb_protein_table, id_col="genes", threshold=0.8)
if ebayes_ttest_ks_results:
    rect = Rectangle(
        xy=(-7, ebayes_ttest_ks_results["threshold"]),  # Bottom-left corner
        width=7,  # Width to cover from -7 to 0 (downregulated region)
        height=eb_protein_table["-log10(p_value)"].max()
        + 0.5
        - ebayes_ttest_ks_results["threshold"],  # Height with small padding
        linewidth=2,
        facecolor=at.pl.BaseColors.get("green", alpha=0.2),
    )
    print(
        f"Kinase scoring threshold at -log10(p_value) = {ebayes_ttest_ks_results['threshold']:.2f}, covering {ebayes_ttest_ks_results['n_candidates']} proteins with {ebayes_ttest_ks_results['n_kinases']} kinases ({ebayes_ttest_ks_results['kinase_percentage'] * 100:.1f}%)"
    )
    ax.add_patch(rect)

# save figure
at.pl.save_figure(
    fig=fig,  # Note that here we're using the whole figure to whose axes we added our plots
    filename="Literature_PELSA_EBAYES_TTEST_AggProt_Stauro.svg",
    output_dir="./example_outputs",
)

Total unique kinases in the dataset: 253
Total unique kinases in the downregulated dataset: 197
Kinase scoring threshold at -log10(p_value) = 3.57, covering 117.0 proteins with 94.0 kinases (80.3%)

../_images/8f074c2dcf417db4454df9bc95a4f5c215279ce5cc27e2f72b37d1d0b11f9a99.png

Next, inspect the protein-aggregated results from Alphaquant#

import numpy as np

fig, axm = at.pl.create_figure(1, 1, figsize=(4, 4))
ax = axm.next()

at.pl.Plots.scatter(
    ax=ax,
    data=aq_protein_table,
    x_column="log2fc",
    y_column="-log10(p_value)",
    xlim=(-7, 7),
    color_map_column="kinase_status",
    color_dict=color_dict,
    legend="auto",
)

at.pl.add_lines(
    ax=ax,
    intercepts=[0],
    linetype="vline",
    linestyle="--",
)

at.pl.add_lines(
    ax=ax,
    intercepts=[-np.log10(0.05)],
    linetype="hline",
    linestyle="--",
)

at.pl.label_axes(
    ax=ax,
    xlabel="log2 fold change (Stauro vs CTRL)",
    ylabel="-log10 p-value",
    title="Literature data: Protein level AlphaQuant results",
)

# We draw a rectangle to indicate the kinase scoring threshold
aq_ks_result = kinase_score(aq_protein_table, id_col="protein", threshold=0.8)
if aq_ks_result:
    rect = Rectangle(
        xy=(-7, aq_ks_result["threshold"]),  # Bottom-left corner
        width=7,  # Width to cover from -7 to 0 (downregulated region)
        height=aq_protein_table["-log10(p_value)"].max() + 0.5 - aq_ks_result["threshold"],  # Height with small padding
        linewidth=2,
        facecolor=at.pl.BaseColors.get("green", alpha=0.2),
    )
    print(
        f"Kinase scoring threshold at -log10(p_value) = {aq_ks_result['threshold']:.2f}, covering {aq_ks_result['n_candidates']} proteins with {aq_ks_result['n_kinases']} kinases ({aq_ks_result['kinase_percentage'] * 100:.1f}%)"
    )
    ax.add_patch(rect)

# save figure
at.pl.save_figure(
    fig=fig,  # Note that here we're using the whole figure to whose axes we added our plots
    filename="Literature_PELSA_AlphaQuant_AggProt_Stauro.svg",
    output_dir="./example_outputs",
)

Total unique kinases in the dataset: 276
Total unique kinases in the downregulated dataset: 215
Kinase scoring threshold at -log10(p_value) = 3.40, covering 137.0 proteins with 110.0 kinases (80.3%)

../_images/54f9ed25b249d0fb5d92b0271f60c38572b6afbdecf843d64bdb2e04fc4cdae9.png

Visualize the kinase enrichment results compared to literature data#

In Figure 2 b of Kejia, Li et al., the number of downregulated vs. kinase proteins at 80 % specificity (as computed above) is given at 135 downregulated candidates and 108 kinase targets. Note that their result were computed using an Empirical Bayes test. We can visualize this data and compare to our alphapepttools differential expression data:

total_candidates = pd.DataFrame(
    {
        "group": ["EB [Lit]", "alphapepttools t-test", "alphapepttools EBayes t-test", "alphapepttools\nAlphaQuant"],
        "value": [
            135,
            ttest_ks_result["n_candidates"],
            ebayes_ttest_ks_results["n_candidates"],
            aq_ks_result["n_candidates"],
        ],
    }
)

total_kinases = pd.DataFrame(
    {
        "group": ["EB [Lit]", "alphapepttools t-test", "alphapepttools EBayes t-test", "alphapepttools\nAlphaQuant"],
        "value": [108, ttest_ks_result["n_kinases"], ebayes_ttest_ks_results["n_kinases"], aq_ks_result["n_kinases"]],
    }
)

color_dict = {
    "candidates": at.pl.BaseColors.get("green"),
    "kinases": at.pl.BaseColors.get("red", lighten=0.1, alpha=0.8),
}

fig, axm = at.pl.create_figure(1, 1, figsize=(6, 4))

# First, plot the total candidates
ax = axm.next()
ax.set_ylim(0, 175)

at.pl.Plots.barplot(
    ax=ax,
    data=total_candidates,
    grouping_column="group",
    value_column="value",
    color=at.pl.BaseColors.get("green"),
)

at.pl.label_plot(
    ax=ax,
    x_values=range(1, len(total_candidates) + 1),
    y_values=total_candidates["value"].tolist(),
    labels=total_candidates["value"].astype(str).tolist(),
    label_kwargs={"va": "bottom", "ha": "center"},
)

# Then, add the kinases
at.pl.Plots.barplot(
    ax=ax,
    data=total_kinases,
    grouping_column="group",
    value_column="value",
    color=at.pl.BaseColors.get("red", lighten=0.3),
)

at.pl.label_plot(
    ax=ax,
    x_values=range(1, len(total_kinases) + 1),
    y_values=total_kinases["value"].tolist(),
    labels=total_kinases["value"].astype(str).tolist(),
    label_kwargs={"va": "bottom", "ha": "center"},
)

at.pl.label_axes(
    ax=ax,
    xlabel="",
    ylabel="Number of candidates",
    title="Kinase enrichment candidates",
)

at.pl.add_legend_to_axes(
    ax=ax,
    levels=color_dict,
)

# save figure
at.pl.save_figure(
    fig=fig,  # Note that here we're using the whole figure to whose axes we added our plots
    filename="Literature_PELSA_Kinase_Enrichment_Candidates_Comparison.svg",
    output_dir="./example_outputs",
)

../_images/bfefae5358740cdac32ad267eb21de6182b59a623ecd1841394d4ff894a143c7.png

In summary, we saw that `alphapepttools`…#

can swiftly load precursor and protein data into AnnData instances
contains standardized, ‘plug and play’ modules for proteomics differential expression
Equals/outperforms the empirical Bayes approach from literature, demonstrating that we have a strong framework for analysing PELSA-data in-house using AlphaQuant.

Example of PELSA differential expression with alphapepttools

Contents

Example of PELSA differential expression with `alphapepttools`#

An example dataset: The original PELSA publication#

Preparing the dataset using `alphapepttools` loaders and AnnData factory.#

We focus on Protein Groups and Precursors, and use `pp.add_metadata()` to merge the metadata into our AnnData instances#

We log2-transform our data, but keep the untransformed values in a separate layer#

Perform basic EDA on protein and precursor data#

PELSA analysis depends on differential expression#

We can use `alphapepttools` functionality to construct a volcanoplot function#

The main readout is a custom Kinase Enrichment Score:#

First, inspect the aggregated protein results for the standard t-test#

Next, inspect the protein-aggregated results from the Empirical Bayes moderated ttest (Limma-like)#

Next, inspect the protein-aggregated results from Alphaquant#

Visualize the kinase enrichment results compared to literature data#

In summary, we saw that `alphapepttools`…#

Example of PELSA differential expression with alphapepttools

Contents

Example of PELSA differential expression with alphapepttools#

An example dataset: The original PELSA publication#

Preparing the dataset using alphapepttools loaders and AnnData factory.#

We focus on Protein Groups and Precursors, and use pp.add_metadata() to merge the metadata into our AnnData instances#

We log2-transform our data, but keep the untransformed values in a separate layer#

Perform basic EDA on protein and precursor data#

PELSA analysis depends on differential expression#

We can use alphapepttools functionality to construct a volcanoplot function#

The main readout is a custom Kinase Enrichment Score:#

First, inspect the aggregated protein results for the standard t-test#

Next, inspect the protein-aggregated results from the Empirical Bayes moderated ttest (Limma-like)#

Next, inspect the protein-aggregated results from Alphaquant#

Visualize the kinase enrichment results compared to literature data#

In summary, we saw that alphapepttools…#

Example of PELSA differential expression with `alphapepttools`#

Preparing the dataset using `alphapepttools` loaders and AnnData factory.#

We focus on Protein Groups and Precursors, and use `pp.add_metadata()` to merge the metadata into our AnnData instances#

We can use `alphapepttools` functionality to construct a volcanoplot function#

In summary, we saw that `alphapepttools`…#