This notebook provides functions to format input data for sequence plotting

The formatted input data is returned as pandas dataframe containing following columns:

  • unique_protein_id: a single uniprot accession
  • modified_sequence: modified peptide sequence
  • naked_sequence: naked peptide sequence
  • all_protein_ids: all UniProt IDs the peptide map to separated by ';'
  • start: peptide start position on protein sequence
  • end: peptide end position on protein sequence
  • PTMsites: list with PTM positions
  • PTMtypes: list with PTM types

Split protein group into unique protein accessions

The all_protein_ids column in df is split by ';' to result in unique rows for each UniProt accession. The exploded dataframe has a new column unique_protein_id.

extract_uniprot_id[source]

extract_uniprot_id(protein_id:str)

Extract the Uniprot unique entry id from the unusual formatted protein_id.

expand_protein_ids[source]

expand_protein_ids(df:DataFrame)

Function to split protein groups in 'all_protein_ids' by ';' into separate rows. The resulting dataframe has a new column 'unique_protein_id'. Args: df (pd.DataFrame): Experimental data that was imported by the 'import_data' function. Returns: pd.DataFrame: Exploded dataframe with a new column 'unique_protein_id'.

Annotate peptides with start and end position

The get_peptide_position function annotates a peptide's start and end position in the given protein sequence.

pep_position_helper[source]

pep_position_helper(seq:str, prot:str, fasta:<module 'pyteomics.fasta' from '/opt/anaconda3/envs/alphamap_update/lib/python3.8/site-packages/pyteomics/fasta.py'>, verbose:bool=True)

Helper function for 'get_peptide_position'.

Args: seq (str): Naked peptide sequence. prot (str): UniProt protein accession. fasta (fasta): Fasta file imported by pyteomics 'fasta.IndexedUniProt'. verbose (bool, optional): Flag to print warnings if no matching sequence is found for a protein in the provided fasta. Defaults to 'True'. Returns: [int, int]: int: peptide start position, int: peptide end position

get_peptide_position[source]

get_peptide_position(df:DataFrame, fasta:<module 'pyteomics.fasta' from '/opt/anaconda3/envs/alphamap_update/lib/python3.8/site-packages/pyteomics/fasta.py'>, verbose:bool=True)

Function to get start and end position of each peptide in the given protein.

Args: df (pd.DataFrame): Experimental data that was imported by the 'import_data' function and processed by 'expand_protein_ids'. fasta (fasta): Fasta file imported by pyteomics 'fasta.IndexedUniProt'. verbose (bool, optional): Flag to print warnings if no matching sequence is found for a protein in the provided fasta. Defaults to 'True'. Returns: pd.DataFrame: Dataframe with a new columns 'start' and 'end', indicating the start and end index position of the peptide sequence.

Annotate each peptide with PTM site indices and modification types

The get_modifications function annotates sequence positions and modification types of all PTMs on a peptide.

get_ptm_sites[source]

get_ptm_sites(peptide:str, modification_reg:str)

Function to get sequence positions of all PTMs of a peptide in the given protein.

Args: peptide (str): Experimental data that was imported by the 'import_data' function and processed by 'expand_protein_ids'. modification_reg (str): Regular expression for the modifications. Returns: list: List of integers with PTM site location indices on the peptide.

get_modifications[source]

get_modifications(df:DataFrame, mod_reg:str)

Function to get sequence positions and modification types of all PTMs of a peptide in the given protein.

Args: df (pd.DataFrame): Experimental data that was imported by the 'import_data' function and processed by 'expand_protein_ids'. mod_reg (str): Regular expression for the modifications. Returns: pd.DataFrame: Dataframe with a new columns 'PTMsites' and 'PTMtypes' containing lists of PTM site indices and modification types, respectively.

Preprocessing wrapper

The format_input_data function is a wrapper for all previous functions described in this notebook. It calls following functios:

  • expand_protein_ids: to split protein groups into unique uniprot accessions
  • get_peptide_position: to annotate each peptide with its start and end position on its respective protein sequence
  • get_modifications: To get the sequence positions and modification types of all PTMs of a peptide.

format_input_data[source]

format_input_data(df:DataFrame, fasta:<module 'pyteomics.fasta' from '/opt/anaconda3/envs/alphamap_update/lib/python3.8/site-packages/pyteomics/fasta.py'>, modification_exp:str, verbose:bool=True)

Function to format input data and to annotate sequence start and end positions plus PTM sites.

Args: df (pd.DataFrame): Experimental data that was imported by the 'import_data' function. fasta (fasta): Fasta file imported by pyteomics 'fasta.IndexedUniProt'. modification_exp (str): Regular expression for the modifications. verbose (bool, optional): Flag to print warnings if no matching sequence is found for a protein in the provided fasta. Defaults to 'True'. Returns: pd.DataFrame: Dataframe with unique uniprot accessions, sequence start and end positions, and PTM site information.