This notebook provides functions to format input data for sequence plotting
The formatted input data is returned as pandas dataframe containing following columns:
- unique_protein_id: a single uniprot accession
- modified_sequence: modified peptide sequence
- naked_sequence: naked peptide sequence
- all_protein_ids: all UniProt IDs the peptide map to separated by ';'
- start: peptide start position on protein sequence
- end: peptide end position on protein sequence
- PTMsites: list with PTM positions
- PTMtypes: list with PTM types
The all_protein_ids column in df is split by ';' to result in unique rows for each UniProt accession. The exploded dataframe has a new column unique_protein_id.
The get_peptide_position function annotates a peptide's start and end position in the given protein sequence.
The get_modifications function annotates sequence positions and modification types of all PTMs on a peptide.
The format_input_data function is a wrapper for all previous functions described in this notebook. It calls following functios:
- expand_protein_ids: to split protein groups into unique uniprot accessions
- get_peptide_position: to annotate each peptide with its start and end position on its respective protein sequence
- get_modifications: To get the sequence positions and modification types of all PTMs of a peptide.