Recalibration
This notebook contains evertyhing related to recalibration of data.
Recalibration after search
Precursor mass calibration
Recalibration refers to the computational step where masses are recalibrated after a first search. The identified peptides are used to calculate the deviations of experimental masses to their theoretical masses. After recalibration, a second search with decreased precursor tolerance is performed.
The recalibration is largely motivated by the software lock mass paper:
Here, mass offsets are piecewise linearly approximated. The positions for approximation need to fulfill a number of criteria (e.g., a minimum number of samples and a minimum distance). The AlphaPept implementation is slightly modified by employing a more general KNeighborsRegressor
-approach. In brief, the calibration is calculated for each point individually by estimating the deviation from its identified neighbors in n-dimensional space (e.g., retention time, mass, mobility).
More specifically, the algorithm consists of the following steps:
- Outlier removal: We remove outliers from the identified peptides by only accepting identifications with a mass offset that is within n (default 3) standard deviations to the mean.
- For each point, we perform a neighbors lookup of the next n (default 100) neighbors. For the neighbor’s lookup we need to scale the axis, which is done with a transform function either absolute or relative.
- Next, we perform a regression based on the neighbors to determine the mass offset. The contribution of each neighbor is weighted by their distance.
Fragment mass calibration
The fragment mass calibration is based on the identified fragment_ions (i.e., b-hits and y-hits). For each hit, we calculate the offset to its theoretical mass. The correction is then applied by taking the median offset in ppm and applying it globally.
remove_outliers
remove_outliers (df:pandas.core.frame.DataFrame, outlier_std:float)
Helper function to remove outliers from a dataframe. Outliers are removed based on the precursor offset mass (prec_offset). All values within x standard deviations to the median are kept.
Args: df (pd.DataFrame): Input dataframe that contains a prec_offset_ppm-column. outlier_std (float): Range of standard deviations to filter outliers
Raises: ValueError: An error if the column is not present in the dataframe.
Returns: pd.DataFrame: A dataframe w/o outliers.
transform
transform (x:numpy.ndarray, column:str, scaling_dict:dict)
Helper function to transform an input array for neighbors lookup used for calibration
Note: The scaling_dict stores information about how scaling is applied and is defined in get_calibration
Relative transformation: Compare distances relatively, for mz that is ppm, for mobility %. Absolute transformation: Compare distance absolute, for RT it is the timedelta.
An example definition is below:
scaling_dict = {} scaling_dict[‘mz’] = (‘relative’, calib_mz_range/1e6) scaling_dict[‘rt’] = (‘absolute’, calib_rt_range) scaling_dict[‘mobility’] = (‘relative’, calib_mob_range)
Args: x (np.ndarray): Input array. column (str): String to lookup what scaling should be applied. scaling_dict (dict): Lookup dict to retrieve the scaling operation and factor for the column.
Raises: KeyError: An error if the column is not present in the dict. NotImplementedError: An error if the column is not present in the dict.
Returns: np.ndarray: A scaled array.
kneighbors_calibration
kneighbors_calibration (df:pandas.core.frame.DataFrame, features:pandas.core.frame.DataFrame, cols:list, target:str, scaling_dict:dict, calib_n_neighbors:int)
Calibration using a KNeighborsRegressor. Input arrays from are transformed to be used with a nearest-neighbor approach. Based on neighboring points a calibration is calculated for each input point.
Args: df (pd.DataFrame): Input dataframe that contains identified peptides (w/o outliers). features (pd.DataFrame): Features dataframe for which the masses are calibrated. cols (list): List of input columns for the calibration. target (str): Target column on which offset is calculated. scaling_dict (dict): A dictionary that contains how scaling operations are applied. calib_n_neighbors (int): Number of neighbors for calibration.
Returns: np.ndarray: A numpy array with calibrated masses.
get_calibration
get_calibration (df:pandas.core.frame.DataFrame, features:pandas.core.frame.DataFrame, file_name='', settings=None, outlier_std:float=3, calib_n_neighbors:int=100, calib_mz_range:int=100, calib_rt_range:float=0.5, calib_mob_range:float=0.3, **kwargs)
Wrapper function to get calibrated values for the precursor mass.
Args: df (pd.DataFrame): Input dataframe that contains identified peptides. features (pd.DataFrame): Features dataframe for which the masses are calibrated. outlier_std (float, optional): Range in standard deviations for outlier removal. Defaults to 3. calib_n_neighbors (int, optional): Number of neighbors used for regression. Defaults to 100. calib_mz_range (int, optional): Scaling factor for mz range. Defaults to 20. calib_rt_range (float, optional): Scaling factor for rt_range. Defaults to 0.5. calib_mob_range (float, optional): Scaling factor for mobility range. Defaults to 0.3. **kwargs: Arbitrary keyword arguments so that settings can be passes as whole.
Returns: corrected_mass (np.ndarray): The calibrated mass y_hat_std (float): The standard deviation of the precursor offset after calibration
calibrate_fragments_nn
calibrate_fragments_nn (ms_file_, file_name, settings)
save_precursor_calibration
save_precursor_calibration (df, corrected, std_offset, file_name, settings)
save_fragment_calibration
save_fragment_calibration (fragment_ions, corrected, std_offset, file_name, settings)
density_scatter
density_scatter (x, y, ax=None, sort=True, bins=20, **kwargs)
Scatter plot colored by 2d histogram Adapted from https://stackoverflow.com/questions/20105364/how-can-i-make-a-scatter-plot-colored-by-density-in-matplotlib
chunks
chunks (lst, n)
Yield successive n-sized chunks from lst.
calibrate_hdf
calibrate_hdf (to_process:tuple, callback=None, parallel=True)
Wrapper function to get calibrate a hdf file when using the parallel executor. The function loads the respective dataframes from the hdf, calls the calibration function and applies the offset.
Args: to_process (tuple): Tuple that contains the file index and the settings dictionary. callback ([type], optional): Placeholder for callback (unused). parallel (bool, optional): Placeholder for parallel usage (unused).
Returns: Union[str,bool]: Either True as boolean when calibration is successfull or the Error message as string.
Database calibration
Another way to calibrate the fragment and precursor masses is by directly comparing them to a previously generated theoretical mass database. Here, peaks in the distribution of databases are used to align the experimental masses.
get_db_targets
get_db_targets (db_file_name:str, max_ppm:int=100, min_distance:float=0.5, ms_level:int=2)
Function to extract database targets for database-calibration. Based on the FASTA database it finds masses that occur often. These will be used for calibration.
Args: db_file_name (str): Path to the database. max_ppm (int, optional): Maximum distance in ppm between two peaks. Defaults to 100. min_distance (float, optional): Minimum distance between two calibration peaks. Defaults to 0.5. ms_level (int, optional): MS-Level used for calibration, either precursors (1) or fragmasses (2). Defaults to 2.
Raises: ValueError: When ms_level is not valid.
Returns: np.ndarray: Numpy array with calibration masses.
align_run_to_db
align_run_to_db (ms_data_file_name:str, db_array:numpy.ndarray, max_ppm_distance:int=1000000, rt_step_size:float=0.1, plot_ppms:bool=False, ms_level:int=2)
Function align a run to it’s theoretical FASTA database.
Args: ms_data_file_name (str): Path to the run. db_array (np.ndarray): Numpy array containing the database targets. max_ppm_distance (int, optional): Maximum distance in ppm. Defaults to 1000000. rt_step_size (float, optional): Stepsize for rt calibration. Defaults to 0.1. plot_ppms (bool, optional): Flag to indicate plotting. Defaults to False. ms_level (int, optional): ms_level for calibration. Defaults to 2.
Raises: ValueError: When ms_level is not valid.
Returns: np.ndarray: Estimated errors
calibrate_fragments
calibrate_fragments (db_file_name:str, ms_data_file_name:str, ms_level:int=2, write=True, plot_ppms=False)
Wrapper function to calibrate fragments. Calibrated values are saved to corrected_fragment_mzs
Args: db_file_name (str): Path to database ms_data_file_name (str): Path to ms_data file ms_level (int, optional): MS-level for calibration. Defaults to 2. write (bool, optional): Boolean flag for test purposes to avoid writing to testfile. Defaults to True. plot_ppms (bool, optional): Boolean flag to plot the calibration. Defaults to False.