Recalibration

Functions related to recalibrating

This notebook contains evertyhing related to recalibration of data.

Recalibration after search

Precursor mass calibration

Recalibration refers to the computational step where masses are recalibrated after a first search. The identified peptides are used to calculate the deviations of experimental masses to their theoretical masses. After recalibration, a second search with decreased precursor tolerance is performed.

The recalibration is largely motivated by the software lock mass paper:

Cox J, Michalski A, Mann M. Software lock mass by two-dimensional minimization of peptide mass errors. J Am Soc Mass Spectrom. 2011;22(8):1373-1380. doi:10.1007/s13361-011-0142-8

Here, mass offsets are piecewise linearly approximated. The positions for approximation need to fulfill a number of criteria (e.g., a minimum number of samples and a minimum distance). The AlphaPept implementation is slightly modified by employing a more general KNeighborsRegressor-approach. In brief, the calibration is calculated for each point individually by estimating the deviation from its identified neighbors in n-dimensional space (e.g., retention time, mass, mobility).

More specifically, the algorithm consists of the following steps:

Outlier removal: We remove outliers from the identified peptides by only accepting identifications with a mass offset that is within n (default 3) standard deviations to the mean.
For each point, we perform a neighbors lookup of the next n (default 100) neighbors. For the neighbor’s lookup we need to scale the axis, which is done with a transform function either absolute or relative.
Next, we perform a regression based on the neighbors to determine the mass offset. The contribution of each neighbor is weighted by their distance.

Fragment mass calibration

The fragment mass calibration is based on the identified fragment_ions (i.e., b-hits and y-hits). For each hit, we calculate the offset to its theoretical mass. The correction is then applied by taking the median offset in ppm and applying it globally.

source

remove_outliers

 remove_outliers (df:pandas.core.frame.DataFrame, outlier_std:float)

Helper function to remove outliers from a dataframe. Outliers are removed based on the precursor offset mass (prec_offset). All values within x standard deviations to the median are kept.

Args: df (pd.DataFrame): Input dataframe that contains a prec_offset_ppm-column. outlier_std (float): Range of standard deviations to filter outliers

Raises: ValueError: An error if the column is not present in the dataframe.

Returns: pd.DataFrame: A dataframe w/o outliers.

source

transform

 transform (x:numpy.ndarray, column:str, scaling_dict:dict)

Helper function to transform an input array for neighbors lookup used for calibration

Note: The scaling_dict stores information about how scaling is applied and is defined in get_calibration

Relative transformation: Compare distances relatively, for mz that is ppm, for mobility %. Absolute transformation: Compare distance absolute, for RT it is the timedelta.

An example definition is below:

scaling_dict = {} scaling_dict[‘mz’] = (‘relative’, calib_mz_range/1e6) scaling_dict[‘rt’] = (‘absolute’, calib_rt_range) scaling_dict[‘mobility’] = (‘relative’, calib_mob_range)

Args: x (np.ndarray): Input array. column (str): String to lookup what scaling should be applied. scaling_dict (dict): Lookup dict to retrieve the scaling operation and factor for the column.

Raises: KeyError: An error if the column is not present in the dict. NotImplementedError: An error if the column is not present in the dict.

Returns: np.ndarray: A scaled array.

source

kneighbors_calibration

 kneighbors_calibration (df:pandas.core.frame.DataFrame,
                         features:pandas.core.frame.DataFrame, cols:list,
                         target:str, scaling_dict:dict,
                         calib_n_neighbors:int)

Calibration using a KNeighborsRegressor. Input arrays from are transformed to be used with a nearest-neighbor approach. Based on neighboring points a calibration is calculated for each input point.

Args: df (pd.DataFrame): Input dataframe that contains identified peptides (w/o outliers). features (pd.DataFrame): Features dataframe for which the masses are calibrated. cols (list): List of input columns for the calibration. target (str): Target column on which offset is calculated. scaling_dict (dict): A dictionary that contains how scaling operations are applied. calib_n_neighbors (int): Number of neighbors for calibration.

Returns: np.ndarray: A numpy array with calibrated masses.

source

get_calibration

 get_calibration (df:pandas.core.frame.DataFrame,
                  features:pandas.core.frame.DataFrame, file_name='',
                  settings=None, outlier_std:float=3,
                  calib_n_neighbors:int=100, calib_mz_range:int=100,
                  calib_rt_range:float=0.5, calib_mob_range:float=0.3,
                  **kwargs)

Wrapper function to get calibrated values for the precursor mass.

Args: df (pd.DataFrame): Input dataframe that contains identified peptides. features (pd.DataFrame): Features dataframe for which the masses are calibrated. outlier_std (float, optional): Range in standard deviations for outlier removal. Defaults to 3. calib_n_neighbors (int, optional): Number of neighbors used for regression. Defaults to 100. calib_mz_range (int, optional): Scaling factor for mz range. Defaults to 20. calib_rt_range (float, optional): Scaling factor for rt_range. Defaults to 0.5. calib_mob_range (float, optional): Scaling factor for mobility range. Defaults to 0.3. **kwargs: Arbitrary keyword arguments so that settings can be passes as whole.

Returns: corrected_mass (np.ndarray): The calibrated mass y_hat_std (float): The standard deviation of the precursor offset after calibration

source

calibrate_fragments_nn

 calibrate_fragments_nn (ms_file_, file_name, settings)

source

save_precursor_calibration

 save_precursor_calibration (df, corrected, std_offset, file_name,
                             settings)

source

save_fragment_calibration

 save_fragment_calibration (fragment_ions, corrected, std_offset,
                            file_name, settings)

source

density_scatter

 density_scatter (x, y, ax=None, sort=True, bins=20, **kwargs)

Scatter plot colored by 2d histogram Adapted from https://stackoverflow.com/questions/20105364/how-can-i-make-a-scatter-plot-colored-by-density-in-matplotlib

source

chunks

 chunks (lst, n)

Yield successive n-sized chunks from lst.

source

calibrate_hdf

 calibrate_hdf (to_process:tuple, callback=None, parallel=True)

Wrapper function to get calibrate a hdf file when using the parallel executor. The function loads the respective dataframes from the hdf, calls the calibration function and applies the offset.

Args: to_process (tuple): Tuple that contains the file index and the settings dictionary. callback ([type], optional): Placeholder for callback (unused). parallel (bool, optional): Placeholder for parallel usage (unused).

Returns: Union[str,bool]: Either True as boolean when calibration is successfull or the Error message as string.

Database calibration

Another way to calibrate the fragment and precursor masses is by directly comparing them to a previously generated theoretical mass database. Here, peaks in the distribution of databases are used to align the experimental masses.

source

get_db_targets

 get_db_targets (db_file_name:str, max_ppm:int=100,
                 min_distance:float=0.5, ms_level:int=2)

Function to extract database targets for database-calibration. Based on the FASTA database it finds masses that occur often. These will be used for calibration.

Args: db_file_name (str): Path to the database. max_ppm (int, optional): Maximum distance in ppm between two peaks. Defaults to 100. min_distance (float, optional): Minimum distance between two calibration peaks. Defaults to 0.5. ms_level (int, optional): MS-Level used for calibration, either precursors (1) or fragmasses (2). Defaults to 2.

Raises: ValueError: When ms_level is not valid.

Returns: np.ndarray: Numpy array with calibration masses.

source

align_run_to_db

 align_run_to_db (ms_data_file_name:str, db_array:numpy.ndarray,
                  max_ppm_distance:int=1000000, rt_step_size:float=0.1,
                  plot_ppms:bool=False, ms_level:int=2)

Function align a run to it’s theoretical FASTA database.

Args: ms_data_file_name (str): Path to the run. db_array (np.ndarray): Numpy array containing the database targets. max_ppm_distance (int, optional): Maximum distance in ppm. Defaults to 1000000. rt_step_size (float, optional): Stepsize for rt calibration. Defaults to 0.1. plot_ppms (bool, optional): Flag to indicate plotting. Defaults to False. ms_level (int, optional): ms_level for calibration. Defaults to 2.

Raises: ValueError: When ms_level is not valid.

Returns: np.ndarray: Estimated errors

source

calibrate_fragments

 calibrate_fragments (db_file_name:str, ms_data_file_name:str,
                      ms_level:int=2, write=True, plot_ppms=False)

Wrapper function to calibrate fragments. Calibrated values are saved to corrected_fragment_mzs

Args: db_file_name (str): Path to database ms_data_file_name (str): Path to ms_data file ms_level (int, optional): MS-level for calibration. Defaults to 2. write (bool, optional): Boolean flag for test purposes to avoid writing to testfile. Defaults to True. plot_ppms (bool, optional): Boolean flag to plot the calibration. Defaults to False.