This notebook contains functions to import a uniport annotation file and to format it as pandas dataframe for further usage in alphamap.

The preprocessed uniprot annotation includes information about:

  • the known preprocessing events for proteins, such as signal peptide, transit peptide, propeptide, chain, peptide;
  • information on all available in Uniprot post translational modificatios, like modified residues (Phosphorylation, Methylation, Acetylation, etc.), Lipidation, Glycosylation, etc.;
  • information on sequence similarities with other proteins and the domain(s) present in a protein, such as domain, repeat, region, motif, etc.;
  • information on the secondary and tertiary structure of proteins, such as turn, beta strand, helix.

Instructions on how to download a UniProt annotation file

  1. Go to the Uniprot website(https://www.uniprot.org/uniprot/), select the organism of interest in the "Popular organisms" section and click on it.
  2. Click the "Download" button and select "Text" format.
  3. Select the "Compressed" radio button and click "Go".
  4. Unzip the downloaded file and specify the path to this file.

Helper functions

extract_note[source]

extract_note(string:str, splitted:bool=False)

Helper function to extract information about note of the protein from Uniprot using regular expression.

Args: string (str): Uniprot annotation string. splitted (bool, optional): Flag to allow linebreaks. Default is 'False'. Returns: str: Extracted string of the uniprot note section.

extract_note_end[source]

extract_note_end(string:str, has_mark:bool=True)

Helper function to extract information about note of the protein from Uniprot using regular expression.

Args: string (str): Uniprot annotation string. has_mark (bool, optional): Flag if end quotation marks are present. Default is 'False'. Returns: str: Extracted string of the uniprot note section.

resolve_unclear_position[source]

resolve_unclear_position(value:str)

Replace unclear position of the start/end of the modification defined as '?' with -1 and if it's defined as '?N' or ">N" - by removing the '?'/'>'/'<' signs.

Args: value (str): Unclear sequence position from uniprot. Returns: float: Resolved sequence position.

extract_positions[source]

extract_positions(posit_string:str)

Extract isoform_id(str) and start/end positions(float) of any feature key from the string.

Args: posit_string (str): Uniprot position string. Returns: [str, float, float]: str: Uniprot isoform accession, float: start position, float: end position

Uniprot preprocessing function

preprocess_uniprot[source]

preprocess_uniprot(path_to_file:str)

A complex complete function to preprocess Uniprot data from specifying the path to a flat text file to the returning a dataframe containing information about:

- protein_id(str)
- feature(category)
- isoform_id(str)
- start(float)
- end(float)
- note information(str)

Args: path_to_file (str): Path to a .txt annotation file directly downloaded from uniprot. Returns: pd.DataFrame: Dataframe with formatted uniprot annotations for alphamap.

UniProt feature dictionary

The following is a dictionary that maps feature names to the feature entries in the processed uniprot annotation file.