For matching MS2-identifications to MS1-features, we first need to align the datasets on top of each other to be able to transfer identifications correctly. Datasets are aligned by comparing shared precursors and calculating the median offset. When comparing all files to each other, we get an overdetermined linear equation system. By solving this, we find offset parameters that minimize the shift of all files to each other. Offset is either applied relative (mz, mobility) or absolute (rt).
Relative offset
For some parameters, we would like to have a relative correction of values. Consider the case of different mz-values, e.g. 300 and 600. If we assume that the offset is larger for larger m/z values, we would not want an absolute correction of e.g. +0.5 Da (300.5 and 600.5) but rather a relative correction of e.g. +0.1% (300.3 and 600.6).
Absolute correction
In contrast to the relative correction, sometimes absolute correction is more applicable. Consider the case of retention time. Here one would rather not expect a relative offset but rather an absolute offset. As an example, consider a lag time of 0.5 Minutes. This would be constant for all retention times and not differ e.g., for later retention times.
Calculate the distance between two precursors for different columns Distance can either be relative or absolute.
An example for a minimal offset_dict is: offset_dict = {‘mass’:‘absolute’}
Args: table_1 (pd.DataFrame): Dataframe with precusor data. table_2 (pd.DataFrame): Dataframe with precusor data. offset_dict (dict): Dictionary with column names and how the distance should be calculated. calib (bool): Flag to indicate that distances should be calculated on calibrated columns. Defaults to False.
Raises: KeyError: If either table_1 or table_2 is not indexed by precursor
Apply offset to a table. Different operations for offsets exist. Offsets will be saved with a ’_calib’-suffix. If this does not already exist, it will be created.
Args: table_1 (pd.DataFrame): Dataframe with data. delta (pd.Series): Series cotaining the offset. offset_dict (dict): Dictionary with column names and how the distance should be calculated.
Raises: NotImplementedError: If the type of vonversion is not implemented.
Align multiple datasets. This function creates a matrix to represent the shifts from each dataset to another. This effectively is an overdetermined equation system and is solved with a linear regression.
Args: deltas (pd.DataFrame): Distances from each dataset to another. filenames (list): The filenames of the datasts that were compared. weights (np.ndarray, optional): Distances can be weighted by their number of shared elements. Defaults to None. n_jobs (optional): Number of processes to be used. Defaults to None (=1).
Wrapper function to calculate the distances of multiple files.
In here, we define the offset_dict to make a relative comparison for mz and mobility and absolute for rt.
TODO: This function could be speed-up by parallelization
Args: combos (list): A list containing tuples of filenames that should be compared. calib (bool): Boolean flag to indicate distance should be calculated on calibrated data. callback (Callable): A callback function to track progress.
Returns: pd.DataFrame: Dataframe containing the deltas of the files np.ndarray: Numpy array containing the weights of each comparison (i.e. number of shared elements) dict: Offset dictionary whicch was used for comparing.
Args: filenames (list): A list with raw file names. alignment (pd.DataFrame): A pandas dataframe containing the alignment information. offset_dict (dict): Dictionary with column names and how the distance should be calculated.
Matching
Transfer MS2 identifications to similar MS1 features.
For “match-between-runs” we start with aligning datasets. To create a reference we use for matching, we combine all datasets of a matching group. When using the default settings, the matching group consists of all files. We then group the dataset by precursor and calculate it’s average properties (rt, mz, mobility). By combining several files we further are able to calculate a standard deviation. This allows us to know where and with what deviation we would expect an MS1 feature and have the corresponding identification. This is our matching reference. In the matching step, we go through each dataset individually and check if there are precursors in the reference that were not identified in this dataset. We then perform a nearest-neighbor lookup to find if any MS1 features exist that are in close proximity to the reference. The distance metric we use is normed by the median standard of the deviation. Lastly we assess the confidence in a transfered identifcation by using the Mahalanobis distance.
Probablity estimate of a transfered identification using the Mahalanobis distance.
The function calculates the probability that a feature is a reference feature. The reference features containing std deviations so that a probability can be estimated.
It is required that the data frames are matched, meaning that the first entry in df matches to the first entry in ref.
Args: df (pd.DataFrame): Dataset containing transferered features ref (pd.DataFrame): Dataset containing reference features sigma (pd.DataFrame): Dataset containing the standard deviations of the reference features index (int): Index to the datframes that should be compared