Algorithms related to label-free quantifications are motivated by the MaxLFQ paper. The main goal is to derive relative protein intensities that can be used for downstream analyses. In a first step, constant normalization coefficients are derived for each run. In a second step, pseudointensities are derived for each protein, such that differing conditions can be compared.
Delayed Normalization
Delayed normalization describes the process of normalizing the differences that occur from prefractionation as well as from sample handling. For each sample, a constant scaling factor is derived by minimizing the term \[H(\vec{N}) = \sum_{P \in peptides} \sum_{A,B \in sample pairs} |\frac{I(N_A, P, A)}{I(N_B, P, B)}|, \] with peptide intensities \(I\), which are determined by the peptide \(P\) the sample \(A\) or \(B\) and the normalization factors \(N_A\), \(N_B\). In principle H(N) quantifies the variation of peptides over the samples. Minimizing this variation gives appropriate scaling factors under the assumption that most peptides do not change between the samples. Peptide intensities for fractionated samples are described as the sum of the intensities over the fractions, with fraction-specific normalization factors. Therefore, calculation of the summed intensities is delayed until the normalization is finished.
In Silico Test data
To test the delayed normalization approach we create an in silico test dataset with a known ground truth. We therefore know, which systematic changes are between the samples and we employ different solvers to recover the normalization parameters.
Calculates normally distributed probability densities along an input array.
Args: mu (float): mean of ND. sigma (float): standard deviation of ND. grid (np.ndarray): input array np.int[:]. For each element of the array, the probability density is calculated.
Returns: np.ndarray: probability density array, np.float[:].
Args: timepoint (float): coordinate of the peak apex. sigma (float): standard deviation of the gaussian. n_runs (int): number of points along which the density is calculated.
Returns: np.ndarray: probability density array, np.float[:].
Args: n_peptides (int): number of peptides to be simulated. n_runs (int): number of runs to be simulated. n_samples (int): number of samples to be simulated. threshold (float, optional): threshold below which a simulated intensity will be discarded. Defaults to 0.2. use_noise (bool, optional): add simulated noise to the profile values. Defaults to True.
Returns: Tuple[np.ndarray, np.ndarray]: profiles: np.float[:,:,:] array containing the simulated profiles, true_normalization: np.float[:,:,:] array containing the ground truth.
Computes the summed peptide errors over the whole dataset.
Args: normalization (np.ndarray): per sample normalization factors. profiles (np.ndarray): peptide intensity profiles over the dataset.
Returns: float: summed peptide error.
Benchmarking different optimiziers
The normalization step is in principle a quadratic minimization of the normalization factors. Such minimization problems can be solved in various ways and a variety of approaches are realized in python community packages. We compare different solvers using our benchmarking set and uncover substantial differences in precision and runtime. We observe that the Sequential Least Squares Quadratic Programming (SLSQP) approach is a robust solution in our benchmarking, which gives substantial speed improvements.
Returns normalization factors for given peptide intensities. If the solver does not converge, the unnormalized data will be used.
Args: df (pd.DataFrame): alphapept quantified features table. field (str, optional): The column in df containing the quantitative peptide information (i.e. precursor intensities). minimum_occurence (bool, optional): minimum number of replicates the peptide must be observed in. Defaults to None.
Returns: [pd.DataFrame, np.ndarray]: pd.DataFrame: alphapept quantified features table extended with the normalized intensities, np.ndarray: normalized intensities
Protein intensity profiles are constructed for each protein individually. All possible protein fold changes between the samples are derived from the median peptide fold changes. Subsequently, pseudointensities are chosen such that the fold changes between the pseudointensities ideally reconstruct the actually observed fold changes. Similar to the delayed normalization, this is formulated as a quadratic minimization, which we solve with the SLSQP solver.
Codewise, we start with simulating in-silico test data to serve as a ground-truth for assessing solvers for the optimization problem. For the algorithmic optimization, we define the function get_protein_ratios that allows to quickly calculate the protein ratios. Next, we define an error function triangle_error that we use for the optimization problem. Lastly, we have several wrapper functions to access the functions.
In-silico test data
Create a simulated input dataset of peptide intensities.
Args: n_sequences (int): number of peptides to simulate. n_samples (int): number of samples to simulate. noise (bool, optional): add random signal to distort the simulated intensity levels. Defaults to True. remove (bool, optional): remove intensities (i.e. add missing values). Defaults to True. peptide_ratio (bool, optional): simulate different peptide intensities. Defaults to True. abundance (bool, optional): simulate different abundances for each sample (i.e. systematic shifts). Defaults to True. signal_level (int, optional): signal level for simulated intensity. Defaults to 100. noise_divider (int, optional): the factor through which the noise is divided (higher factor -> higher signal to noise). Defaults to 10. keep (float, optional): aimed-at fraction of non-missing values, applies if ‘remove’ is set. Defaults to 0.8.
Returns: [pd.DataFrame, list, np.ndarray]: pd.DataFrame: simulated dataset with peptide intensities, list: sample names: np.ndarray: shift factors of each sample
Determine pair-wise intenisty ratios
The pair-wise protein ratios are determined from the median peptide ratio.
Calculates the protein ratios between samples for one protein.
Args: signal (np.ndarray): np.array[:,:] containing peptide intensities for each sample. column_combinations (list): list of all index combinations to compare (usually all sample combinations). minimum_ratios (int, optional): minimum number of peptide ratios necessary to calculate a protein ratio. Defaults to 1.
Returns: np.ndarray: np.array[:,:] matrix comparing the ratios for all column combinations.
Error Function
The error function evaluates the difference between the actual observed fold change and the fold change that is derived from the pseudointensities.
Calculates protein pseudointensities with a specified solver. Args: ratios (np.ndarray): np.array[:,:] matrix containing all estimated protein ratios between samples. method (str): string specifying which solver to use. Raises: NotImplementedError: if the solver is not implemented. Returns: [np.ndarray, bool]: np.ndarray: the protein pseudointensities, bool: wether the solver was successful.
Function to extract optimal protein ratios for a given input of peptides.
Note for the chunk argument: This construction is needed to call this function from a parallel pool.
Args: files (list): A list of files for which the profile shall be extracted. minimum_ratios (int): A minimum number of peptide ratios to be considered for optimization. chunk: (tuple[pd.DataFrame, str]): A pandas dataframe with the peptide information and a string to identify the protein.
Args: df (pd.DataFrame): Feature table by alphapept. minimum_ratios (int): Minimum number of peptide ratios necessary to derive a protein ratio. field (str): The field containing the quantitative peptide information (i.e. precursor intensities). callback ([type], optional): Callback function. Defaults to None.
Returns: pd.DataFrame: table containing the LFQ intensities of each protein in each sample.