Data preprocessing

This notebook provides functions to format input data for sequence plotting

The formatted input data is returned as pandas dataframe containing following columns:

unique_protein_id: a single uniprot accession
modified_sequence: modified peptide sequence
naked_sequence: naked peptide sequence
all_protein_ids: all UniProt IDs the peptide map to separated by ';'
start: peptide start position on protein sequence
end: peptide end position on protein sequence
PTMsites: list with PTM positions
PTMtypes: list with PTM types

Split protein group into unique protein accessions

The all_protein_ids column in df is split by ';' to result in unique rows for each UniProt accession. The exploded dataframe has a new column unique_protein_id.

`extract_uniprot_id`[source]

extract_uniprot_id(protein_id:str)

Extract the Uniprot unique entry id from the unusual formatted protein_id.

`expand_protein_ids`[source]

expand_protein_ids(df:DataFrame)

Function to split protein groups in 'all_protein_ids' by ';' into separate rows. The resulting dataframe has a new column 'unique_protein_id'. Args: df (pd.DataFrame): Experimental data that was imported by the 'import_data' function. Returns: pd.DataFrame: Exploded dataframe with a new column 'unique_protein_id'.

Annotate peptides with start and end position

The get_peptide_position function annotates a peptide's start and end position in the given protein sequence.

`pep_position_helper`[source]

pep_position_helper(seq:str, prot:str, fasta:<module 'pyteomics.fasta' from '/opt/anaconda3/envs/alphamap_update/lib/python3.8/site-packages/pyteomics/fasta.py'>, verbose:bool=True)

Helper function for 'get_peptide_position'.

Args: seq (str): Naked peptide sequence. prot (str): UniProt protein accession. fasta (fasta): Fasta file imported by pyteomics 'fasta.IndexedUniProt'. verbose (bool, optional): Flag to print warnings if no matching sequence is found for a protein in the provided fasta. Defaults to 'True'. Returns: [int, int]: int: peptide start position, int: peptide end position

`get_peptide_position`[source]

get_peptide_position(df:DataFrame, fasta:<module 'pyteomics.fasta' from '/opt/anaconda3/envs/alphamap_update/lib/python3.8/site-packages/pyteomics/fasta.py'>, verbose:bool=True)

Function to get start and end position of each peptide in the given protein.

Args: df (pd.DataFrame): Experimental data that was imported by the 'import_data' function and processed by 'expand_protein_ids'. fasta (fasta): Fasta file imported by pyteomics 'fasta.IndexedUniProt'. verbose (bool, optional): Flag to print warnings if no matching sequence is found for a protein in the provided fasta. Defaults to 'True'. Returns: pd.DataFrame: Dataframe with a new columns 'start' and 'end', indicating the start and end index position of the peptide sequence.

Annotate each peptide with PTM site indices and modification types

The get_modifications function annotates sequence positions and modification types of all PTMs on a peptide.

`get_ptm_sites`[source]

get_ptm_sites(peptide:str, modification_reg:str)

Function to get sequence positions of all PTMs of a peptide in the given protein.

Args: peptide (str): Experimental data that was imported by the 'import_data' function and processed by 'expand_protein_ids'. modification_reg (str): Regular expression for the modifications. Returns: list: List of integers with PTM site location indices on the peptide.

`get_modifications`[source]

get_modifications(df:DataFrame, mod_reg:str)

Function to get sequence positions and modification types of all PTMs of a peptide in the given protein.

Args: df (pd.DataFrame): Experimental data that was imported by the 'import_data' function and processed by 'expand_protein_ids'. mod_reg (str): Regular expression for the modifications. Returns: pd.DataFrame: Dataframe with a new columns 'PTMsites' and 'PTMtypes' containing lists of PTM site indices and modification types, respectively.

Preprocessing wrapper

The format_input_data function is a wrapper for all previous functions described in this notebook. It calls following functios:

expand_protein_ids: to split protein groups into unique uniprot accessions
get_peptide_position: to annotate each peptide with its start and end position on its respective protein sequence
get_modifications: To get the sequence positions and modification types of all PTMs of a peptide.

`format_input_data`[source]

format_input_data(df:DataFrame, fasta:<module 'pyteomics.fasta' from '/opt/anaconda3/envs/alphamap_update/lib/python3.8/site-packages/pyteomics/fasta.py'>, modification_exp:str, verbose:bool=True)

Function to format input data and to annotate sequence start and end positions plus PTM sites.

Args: df (pd.DataFrame): Experimental data that was imported by the 'import_data' function. fasta (fasta): Fasta file imported by pyteomics 'fasta.IndexedUniProt'. modification_exp (str): Regular expression for the modifications. verbose (bool, optional): Flag to print warnings if no matching sequence is found for a protein in the provided fasta. Defaults to 'True'. Returns: pd.DataFrame: Dataframe with unique uniprot accessions, sequence start and end positions, and PTM site information.

Data preprocessing

This notebook provides functions to format input data for sequence plotting

Split protein group into unique protein accessions

extract_uniprot_id[source]

expand_protein_ids[source]

Annotate peptides with start and end position

pep_position_helper[source]

get_peptide_position[source]