Quantification

Functions related to quantification

Label-free quantification

Algorithms related to label-free quantifications are motivated by the MaxLFQ paper. The main goal is to derive relative protein intensities that can be used for downstream analyses. In a first step, constant normalization coefficients are derived for each run. In a second step, pseudointensities are derived for each protein, such that differing conditions can be compared.

Delayed Normalization

Delayed normalization describes the process of normalizing the differences that occur from prefractionation as well as from sample handling. For each sample, a constant scaling factor is derived by minimizing the term \[H(\vec{N}) = \sum_{P \in peptides} \sum_{A,B \in sample pairs} |\frac{I(N_A, P, A)}{I(N_B, P, B)}|, \] with peptide intensities \(I\), which are determined by the peptide \(P\) the sample \(A\) or \(B\) and the normalization factors \(N_A\), \(N_B\). In principle H(N) quantifies the variation of peptides over the samples. Minimizing this variation gives appropriate scaling factors under the assumption that most peptides do not change between the samples. Peptide intensities for fractionated samples are described as the sum of the intensities over the fractions, with fraction-specific normalization factors. Therefore, calculation of the summed intensities is delayed until the normalization is finished.

In Silico Test data

To test the delayed normalization approach we create an in silico test dataset with a known ground truth. We therefore know, which systematic changes are between the samples and we employ different solvers to recover the normalization parameters.

gaussian

 gaussian (mu:float, sigma:float, grid:numpy.ndarray)

Calculates normally distributed probability densities along an input array.

Args: mu (float): mean of ND. sigma (float): standard deviation of ND. grid (np.ndarray): input array np.int[:]. For each element of the array, the probability density is calculated.

Returns: np.ndarray: probability density array, np.float[:].

return_elution_profile

 return_elution_profile (timepoint:float, sigma:float, n_runs:int)

Simulates a gaussian elution profile.

Args: timepoint (float): coordinate of the peak apex. sigma (float): standard deviation of the gaussian. n_runs (int): number of points along which the density is calculated.

Returns: np.ndarray: probability density array, np.float[:].

simulate_sample_profiles

 simulate_sample_profiles (n_peptides:int, n_runs:int, n_samples:int,
                           threshold:float=0.2, use_noise:bool=True)

Generates random profiles to serve as test_data.

Args: n_peptides (int): number of peptides to be simulated. n_runs (int): number of runs to be simulated. n_samples (int): number of samples to be simulated. threshold (float, optional): threshold below which a simulated intensity will be discarded. Defaults to 0.2. use_noise (bool, optional): add simulated noise to the profile values. Defaults to True.

Returns: Tuple[np.ndarray, np.ndarray]: profiles: np.float[:,:,:] array containing the simulated profiles, true_normalization: np.float[:,:,:] array containing the ground truth.

Delayed Normalization

get_total_error

 get_total_error (normalization:numpy.ndarray, profiles:numpy.ndarray)

Computes the summed peptide errors over the whole dataset.

Args: normalization (np.ndarray): per sample normalization factors. profiles (np.ndarray): peptide intensity profiles over the dataset.

Returns: float: summed peptide error.

Benchmarking different optimiziers

The normalization step is in principle a quadratic minimization of the normalization factors. Such minimization problems can be solved in various ways and a variety of approaches are realized in python community packages. We compare different solvers using our benchmarking set and uncover substantial differences in precision and runtime. We observe that the Sequential Least Squares Quadratic Programming (SLSQP) approach is a robust solution in our benchmarking, which gives substantial speed improvements.

from scipy.optimize import minimize
from time import time
from scipy.optimize import least_squares
import pandas as pd
import warnings

n_peptides = 100
n_runs = 10
n_samples = 3

profiles, true_normalization = simulate_sample_profiles(n_peptides, n_runs, n_samples)

methods = ['L-BFGS-B', 'TNC', 'SLSQP','trf']

results = []

for method in methods:
    
    start = time()
    
    with warnings.catch_warnings():
        warnings.filterwarnings("ignore", category=RuntimeWarning)
    
        if method in ['trf']:
            x0 = np.ones(profiles.shape[0] * profiles.shape[1])
            bounds = (x0*0.1, x0)
            res = least_squares(get_total_error, args = [profiles], bounds = bounds, x0 = x0*0.5, verbose=0, method = method)

        else:
            x0 = np.ones(profiles.shape[0] * profiles.shape[1])
            bounds = [(0.1, 1) for _ in x0]
            res = minimize(get_total_error, args = profiles , x0 = x0*0.5, bounds=bounds, method=method)

    solution = res.x/np.max(res.x)
    solution = solution.reshape(profiles.shape[:2])
    
    end = time()
    
    time_elapsed_min = (end-start)/60

    optimality = get_total_error(solution, profiles) /get_total_error(x0, profiles)
    optimality_ = get_total_error(solution, profiles) / get_total_error(true_normalization, profiles)
    
    results.append((method, time_elapsed_min, optimality, optimality_))
    
pd.DataFrame(results, columns=['Method', 'Time Elapsed (min)','Error / Baseline Error','Error / Ground Truth'])

	Method	Time Elapsed (min)	Error / Baseline Error	Error / Ground Truth
0	L-BFGS-B	0.038708	0.650344	0.549985
1	TNC	0.036841	0.730719	0.617958
2	SLSQP	0.004065	0.650344	0.549985
3	trf	0.214873	0.650427	0.550056

normalize_experiment_SLSQP

 normalize_experiment_SLSQP (profiles:numpy.ndarray)

Calculates normalization with SLSQP approach.

Args: profiles (np.ndarray): peptide intensities.

Returns: np.ndarray: normalization factors.

normalize_experiment_BFGS

 normalize_experiment_BFGS (profiles:numpy.ndarray)

Calculates normalization with BFGS approach.

Args: profiles (np.ndarray): peptide intensities.

Returns: np.ndarray: normalization factors.

delayed_normalization

 delayed_normalization (df:pandas.core.frame.DataFrame,
                        field:str='ms1_int_sum',
                        minimum_occurence:bool=None)

Returns normalization factors for given peptide intensities. If the solver does not converge, the unnormalized data will be used.

Args: df (pd.DataFrame): alphapept quantified features table. field (str, optional): The column in df containing the quantitative peptide information (i.e. precursor intensities). minimum_occurence (bool, optional): minimum number of replicates the peptide must be observed in. Defaults to None.

Returns: [pd.DataFrame, np.ndarray]: pd.DataFrame: alphapept quantified features table extended with the normalized intensities, np.ndarray: normalized intensities

sample_data = {}

sample_data['precursor'] = ['Prec_1'] * 6 + ['Prec_2'] * 6 + ['Prec_3'] * 6
sample_data['fraction'] = [1,1,2]*6
sample_data['sample_group'] = ['A','A','A', 'B','B','B'] * 3
sample_data['ms1_int_sum'] = [0.6, 0.8, 0.6, 1.2, 1.6, 1.2] * 3

test_df = pd.DataFrame(sample_data)
test_df, normalization = delayed_normalization(test_df, field='ms1_int_sum', minimum_occurence=0)

display(pd.DataFrame(normalization))
display(test_df.head(6))

	0	1
0	1.0	0.5
1	1.0	0.5

	precursor	fraction	sample_group	ms1_int_sum	ms1_int_sum_dn
0	Prec_1	1	A	0.6	0.9
1	Prec_1	1	A	0.8	1.2
2	Prec_1	2	A	0.6	0.9
3	Prec_1	1	B	1.2	0.9
4	Prec_1	1	B	1.6	1.2
5	Prec_1	2	B	1.2	0.9

Constructing protein intensity profiles

Protein intensity profiles are constructed for each protein individually. All possible protein fold changes between the samples are derived from the median peptide fold changes. Subsequently, pseudointensities are chosen such that the fold changes between the pseudointensities ideally reconstruct the actually observed fold changes. Similar to the delayed normalization, this is formulated as a quadratic minimization, which we solve with the SLSQP solver.

Codewise, we start with simulating in-silico test data to serve as a ground-truth for assessing solvers for the optimization problem. For the algorithmic optimization, we define the function get_protein_ratios that allows to quickly calculate the protein ratios. Next, we define an error function triangle_error that we use for the optimization problem. Lastly, we have several wrapper functions to access the functions.

In-silico test data

Create a simulated input dataset of peptide intensities.

generate_dummy_data

 generate_dummy_data (n_sequences:int, n_samples:int, noise:bool=True,
                      remove:bool=True, peptide_ratio:bool=True,
                      abundance:bool=True, signal_level:int=100,
                      noise_divider:int=10, keep:float=0.8)

Simulate an input dataset of peptide intensities.

Args: n_sequences (int): number of peptides to simulate. n_samples (int): number of samples to simulate. noise (bool, optional): add random signal to distort the simulated intensity levels. Defaults to True. remove (bool, optional): remove intensities (i.e. add missing values). Defaults to True. peptide_ratio (bool, optional): simulate different peptide intensities. Defaults to True. abundance (bool, optional): simulate different abundances for each sample (i.e. systematic shifts). Defaults to True. signal_level (int, optional): signal level for simulated intensity. Defaults to 100. noise_divider (int, optional): the factor through which the noise is divided (higher factor -> higher signal to noise). Defaults to 10. keep (float, optional): aimed-at fraction of non-missing values, applies if ‘remove’ is set. Defaults to 0.8.

Returns: [pd.DataFrame, list, np.ndarray]: pd.DataFrame: simulated dataset with peptide intensities, list: sample names: np.ndarray: shift factors of each sample

Determine pair-wise intenisty ratios

The pair-wise protein ratios are determined from the median peptide ratio.

get_protein_ratios

 get_protein_ratios (signal:numpy.ndarray, column_combinations:list,
                     minimum_ratios:int=1)

Calculates the protein ratios between samples for one protein.

Args: signal (np.ndarray): np.array[:,:] containing peptide intensities for each sample. column_combinations (list): list of all index combinations to compare (usually all sample combinations). minimum_ratios (int, optional): minimum number of peptide ratios necessary to calculate a protein ratio. Defaults to 1.

Returns: np.ndarray: np.array[:,:] matrix comparing the ratios for all column combinations.

Error Function

The error function evaluates the difference between the actual observed fold change and the fold change that is derived from the pseudointensities.

triangle_error

 triangle_error (normalization:numpy.ndarray, ratios:numpy.ndarray)

Calculates the difference between calculated ratios and expected ratios.

Args: normalization (np.ndarray): Used normalization. ratios (np.ndarray): Peptide ratios.

Returns: float: summed quadratic difference.

Solver implementation

As with the delayed normalization we implement multiple solvers from scipy.

solve_profile

 solve_profile (ratios:numpy.ndarray, method:str)

Calculates protein pseudointensities with a specified solver. Args: ratios (np.ndarray): np.array[:,:] matrix containing all estimated protein ratios between samples. method (str): string specifying which solver to use. Raises: NotImplementedError: if the solver is not implemented. Returns: [np.ndarray, bool]: np.ndarray: the protein pseudointensities, bool: wether the solver was successful.

Solving single profiles

protein_profile

 protein_profile (files:list, minimum_ratios:int, chunk:tuple)

Function to extract optimal protein ratios for a given input of peptides.

Note for the chunk argument: This construction is needed to call this function from a parallel pool.

Args: files (list): A list of files for which the profile shall be extracted. minimum_ratios (int): A minimum number of peptide ratios to be considered for optimization. chunk: (tuple[pd.DataFrame, str]): A pandas dataframe with the peptide information and a string to identify the protein.

Returns: np.ndarray: optimized profile np.ndarray: profile w/o optimization str: protein identifier

import matplotlib.pyplot as plt

sample_data = {}

sample_data['precursor'] = ['Prec_1'] * 3 + ['Prec_2'] * 3 + ['Prec_3'] * 3
sample_data['sample_group'] = ['A','B','C'] * 3
sample_data['protein_group'] = ['X'] * 9
sample_data['ms1_int_sum'] = [0.6, 0.8, 1.0, 0.6, 1.2, 1.4, 1.6, 1.2, 1.8]

test_df = pd.DataFrame(sample_data)

display(test_df.head(6))

grouped = test_df.groupby(['protein_group','sample_group','precursor']).sum().loc['X']
files = ['A','B','C']
minimum_ratios = 1
chunk = (grouped, 'X')

if False: #TODO: this test seems to break the CI
    profile, pre_lfq, protein, success = protein_profile(files, minimum_ratios, chunk)

    plt.figure(figsize=(5,5))
    plt.title('Protein ratio')
    plt.plot(pre_lfq, 'o', label='before optimization')
    plt.plot(profile, 'o', label='after optimization')
    plt.legend()
    plt.show()

	precursor	sample_group	protein_group	ms1_int_sum
0	Prec_1	A	X	0.6
1	Prec_1	B	X	0.8
2	Prec_1	C	X	1.0
3	Prec_2	A	X	0.6
4	Prec_2	B	X	1.2
5	Prec_2	C	X	1.4

Wrapper functions

To be compatible with interface, we have three wrapper functions:

protein_profile_parallel: A wrapper that executes protein_profile in parallel
protein_profile_parallel_ap: A wrapper function to calculate protein ratios based on AlphaPept tabular data
protein_profile_prallalel_mq: A wrapper function to calculate protein ratios based on MaxQuant tabular data

protein_profile_parallel

 protein_profile_parallel (df:pandas.core.frame.DataFrame,
                           minimum_ratios:int, field:str, callback=None)

Derives LFQ intensities from the feature table.

Args: df (pd.DataFrame): Feature table by alphapept. minimum_ratios (int): Minimum number of peptide ratios necessary to derive a protein ratio. field (str): The field containing the quantitative peptide information (i.e. precursor intensities). callback ([type], optional): Callback function. Defaults to None.

Returns: pd.DataFrame: table containing the LFQ intensities of each protein in each sample.

protein_profile_parallel_mq

 protein_profile_parallel_mq (evidence_path:str, protein_groups_path:str,
                              minimum_ratios:int=1,
                              minimum_occurence:bool=None,
                              delayed:bool=True, callback=None)

Derives protein LFQ intensities from Maxquant quantified features.

Args: evidence_path (str): path to the Maxquant standard output table evidence.txt. protein_groups_path (str): path to the Maxquant standard output table proteinGroups.txt. minimum_ratios (int): minimum ratios (LFQ parameter) minimum_occurence (int): minimum occurence (LFQ parameter) delayed (bool): toggle for delayed normalization (on/off) callback ([type], optional): [description]. Defaults to None.

Raises: FileNotFoundError: if Maxquant files cannot be found.

Returns: pd.DataFrame: table containing the LFQ intensities of each protein in each sample.

protein_profile_parallel_ap

 protein_profile_parallel_ap (settings:dict,
                              df:pandas.core.frame.DataFrame,
                              callback=None)

Derives protein LFQ intensities from the alphapept quantified feature table

Args: settings (dict): alphapept settings dictionary. df (pd.DataFrame): alphapept feature table. callback ([type], optional): [description]. Defaults to None.

Raises: ValueError: raised in case of observed negative intensities.

Returns: pd.DataFrame: table containing the LFQ intensities of each protein in each sample.