alphapepttools.tl.get_id2gene_map#
- alphapepttools.tl.get_id2gene_map(fasta_input, source_type='file')#
Reannotate protein groups with gene names from a FASTA input.
The function tries to extract UniProt IDs from the second position in a standard fasta header (see example below), and match the gene name based on whatever comes after the
'GN='tag in the header (matching via regexr"GN=([^s]+)"). The fasta file typically corresponds to the file that was used during the search step.- Parameters:
fasta_input (str | Path) – If source_type is ‘file’ (default), this is interpreted as a filepath to a FASTA file. If source_type is ‘string’, this is parsed directly as a string-format fasta (multi-line with headers and sequences)
source_type (str, optional) – Specifies the source type of the FASTA input, either ‘file’ or ‘string’. Defaults to ‘file’.
- Return type:
- Returns:
A dictionary mapping UniProt IDs to gene names. If no gene name is found, the UniProt ID is used as fallback.
Examples
Example for string FASTA input:
fasta_string = '''\ >tr|ID0|ID0_HUMAN Protein1 OS=Homo sapiens OX=9606 GN=GN0 PE=1 SV=1 PEPTIDEKPEPTIDEK >tr|ID1|ID1_HUMAN Protein1 OS=Homo sapiens OX=9606 GN=GN1 PE=1 SV=1 PEPTIDEKPEPTIDEK ''' alphatools.tools.get_id2gene_map(fasta_string, source_type="string") > {'ID0': 'GN0', 'ID1': 'GN1'}