alphapepttools.tl.get_id2gene_map

alphapepttools.tl.get_id2gene_map#

alphapepttools.tl.get_id2gene_map(fasta_input, source_type='file')#

Reannotate protein groups with gene names from a FASTA input.

The function tries to extract UniProt IDs from the second position in a standard fasta header (see example below), and match the gene name based on whatever comes after the 'GN=' tag in the header (matching via regex r"GN=([^s]+)"). The fasta file typically corresponds to the file that was used during the search step.

Parameters:
  • fasta_input (str | Path) – If source_type is ‘file’ (default), this is interpreted as a filepath to a FASTA file. If source_type is ‘string’, this is parsed directly as a string-format fasta (multi-line with headers and sequences)

  • source_type (str, optional) – Specifies the source type of the FASTA input, either ‘file’ or ‘string’. Defaults to ‘file’.

Return type:

dict[str, str]

Returns:

A dictionary mapping UniProt IDs to gene names. If no gene name is found, the UniProt ID is used as fallback.

Examples

Example for string FASTA input:

fasta_string = '''\
>tr|ID0|ID0_HUMAN Protein1 OS=Homo sapiens OX=9606 GN=GN0 PE=1 SV=1
PEPTIDEKPEPTIDEK
>tr|ID1|ID1_HUMAN Protein1 OS=Homo sapiens OX=9606 GN=GN1 PE=1 SV=1
PEPTIDEKPEPTIDEK
'''

alphatools.tools.get_id2gene_map(fasta_string, source_type="string")
> {'ID0': 'GN0', 'ID1': 'GN1'}