Matching

Functions related to matching

Dataset Alignment

For matching MS2-identifications to MS1-features, we first need to align the datasets on top of each other to be able to transfer identifications correctly. Datasets are aligned by comparing shared precursors and calculating the median offset. When comparing all files to each other, we get an overdetermined linear equation system. By solving this, we find offset parameters that minimize the shift of all files to each other. Offset is either applied relative (mz, mobility) or absolute (rt).

Relative offset

For some parameters, we would like to have a relative correction of values. Consider the case of different mz-values, e.g. 300 and 600. If we assume that the offset is larger for larger m/z values, we would not want an absolute correction of e.g. +0.5 Da (300.5 and 600.5) but rather a relative correction of e.g. +0.1% (300.3 and 600.6).

Absolute correction

In contrast to the relative correction, sometimes absolute correction is more applicable. Consider the case of retention time. Here one would rather not expect a relative offset but rather an absolute offset. As an example, consider a lag time of 0.5 Minutes. This would be constant for all retention times and not differ e.g., for later retention times.

source

calculate_distance

 calculate_distance (table_1:pandas.core.frame.DataFrame,
                     table_2:pandas.core.frame.DataFrame,
                     offset_dict:dict, calib:bool=False)

Calculate the distance between two precursors for different columns Distance can either be relative or absolute.

An example for a minimal offset_dict is: offset_dict = {‘mass’:‘absolute’}

Args: table_1 (pd.DataFrame): Dataframe with precusor data. table_2 (pd.DataFrame): Dataframe with precusor data. offset_dict (dict): Dictionary with column names and how the distance should be calculated. calib (bool): Flag to indicate that distances should be calculated on calibrated columns. Defaults to False.

Raises: KeyError: If either table_1 or table_2 is not indexed by precursor

source

calib_table

 calib_table (table:pandas.core.frame.DataFrame,
              delta:pandas.core.series.Series, offset_dict:dict)

Apply offset to a table. Different operations for offsets exist. Offsets will be saved with a ’_calib’-suffix. If this does not already exist, it will be created.

Args: table_1 (pd.DataFrame): Dataframe with data. delta (pd.Series): Series cotaining the offset. offset_dict (dict): Dictionary with column names and how the distance should be calculated.

Raises: NotImplementedError: If the type of vonversion is not implemented.

source

align

 align (deltas:pandas.core.frame.DataFrame, filenames:list,
        weights:numpy.ndarray=None, n_jobs=None)

Align multiple datasets. This function creates a matrix to represent the shifts from each dataset to another. This effectively is an overdetermined equation system and is solved with a linear regression.

Args: deltas (pd.DataFrame): Distances from each dataset to another. filenames (list): The filenames of the datasts that were compared. weights (np.ndarray, optional): Distances can be weighted by their number of shared elements. Defaults to None. n_jobs (optional): Number of processes to be used. Defaults to None (=1).

Returns: np.ndarray: alignment values.

source

calculate_deltas

 calculate_deltas (combos:list, calib:bool=False, callback:Callable=None)

Wrapper function to calculate the distances of multiple files.

In here, we define the offset_dict to make a relative comparison for mz and mobility and absolute for rt.

TODO: This function could be speed-up by parallelization

Args: combos (list): A list containing tuples of filenames that should be compared. calib (bool): Boolean flag to indicate distance should be calculated on calibrated data. callback (Callable): A callback function to track progress.

Returns: pd.DataFrame: Dataframe containing the deltas of the files np.ndarray: Numpy array containing the weights of each comparison (i.e. number of shared elements) dict: Offset dictionary whicch was used for comparing.

source

align_datasets

 align_datasets (settings:dict, callback:<built-infunctioncallable>=None)

Wrapper function that aligns all experimental files specified a settings file.

Args: settings (dict): A list with raw file names. callback (Callable): Callback function to indicate progress.

source

align_files

 align_files (filenames:list, alignment:pandas.core.frame.DataFrame,
              offset_dict:dict)

Wrapper function that aligns a list of files.

Args: filenames (list): A list with raw file names. alignment (pd.DataFrame): A pandas dataframe containing the alignment information. offset_dict (dict): Dictionary with column names and how the distance should be calculated.

Matching

Transfer MS2 identifications to similar MS1 features.

For “match-between-runs” we start with aligning datasets. To create a reference we use for matching, we combine all datasets of a matching group. When using the default settings, the matching group consists of all files. We then group the dataset by precursor and calculate it’s average properties (rt, mz, mobility). By combining several files we further are able to calculate a standard deviation. This allows us to know where and with what deviation we would expect an MS1 feature and have the corresponding identification. This is our matching reference. In the matching step, we go through each dataset individually and check if there are precursors in the reference that were not identified in this dataset. We then perform a nearest-neighbor lookup to find if any MS1 features exist that are in close proximity to the reference. The distance metric we use is normed by the median standard of the deviation. Lastly we assess the confidence in a transfered identifcation by using the Mahalanobis distance.

source

get_probability

 get_probability (df:pandas.core.frame.DataFrame,
                  ref:pandas.core.frame.DataFrame,
                  sigma:pandas.core.frame.DataFrame, index:int)

Probablity estimate of a transfered identification using the Mahalanobis distance.

The function calculates the probability that a feature is a reference feature. The reference features containing std deviations so that a probability can be estimated.

It is required that the data frames are matched, meaning that the first entry in df matches to the first entry in ref.

Args: df (pd.DataFrame): Dataset containing transferered features ref (pd.DataFrame): Dataset containing reference features sigma (pd.DataFrame): Dataset containing the standard deviations of the reference features index (int): Index to the datframes that should be compared

Returns: float: Mahalanobis distance

#Example usage

a = pd.DataFrame({'mass':[100,200,300],'rt':[1,2,3]})
b = pd.DataFrame({'mass':[100,200,302],'rt':[1,2.5,3]})
std = pd.DataFrame({'mass':[0.1,0.1,0.1],'rt':[1,1,1]})

print(f"First element: (ideal match): {get_probability(a, b, std, 0):.2f}")
print(f"Second element: (rt slightly off): {get_probability(a, b, std, 1):.2f}")
print(f"Third element: (mass completely off): {get_probability(a, b, std, 2):.2f}")

First element: (ideal match): 0.00
Second element: (rt slightly off): 0.12
Third element: (mass completely off): 1.00

source

match_datasets

 match_datasets (settings:dict, callback:Callable=None)

Match datasets: Wrapper function to match datasets based on a settings file. This implementation uses matching groups but not fractions.

Args: settings (dict): Dictionary containg specifications of the run callback (Callable): Callback function to indicate progress.

source

convert_decoy

 convert_decoy (float_)

Utility function to convert type for decoy after grouping.