alphapepttools.tl.map_genes_to_protein_groups

alphapepttools.tl.map_genes_to_protein_groups#

alphapepttools.tl.map_genes_to_protein_groups(id2gene_map, protein_groups, delimiter=';')#

Map gene names to protein groups using the provided id2gene_map mapping

Protein groups may consist of multiple UniProt IDs, separated by a delimiter. This function iterates over each protein group and assigns the corresponding unique genes to the protein group.

Parameters:
  • id2gene_map (dict) – Dictionary mapping UniProt IDs to gene names

  • id_column (list) – List containing protein group identifiers, where each identifier may consist of multiple UniProt IDs

  • delimiter (str, optional) – Delimiter used to separate UniProt IDs in the protein group identifiers, by default “;”

Examples

You can map a list of uniprot IDs to gene names

id2gene_map = {"ID0": "GN0", "ID1": "GN1", "ID2": "GN1", "ID3": "GN3", "ID4": "GN4"}
protein_groups = ["ID0", "ID1;ID2", "ID3;ID4"]
map_genes_to_protein_groups(id2gene_map, protein_groups, delimiter=";")
> ["GN0", "GN1", "GN3;GN4"]

To map gene names to an AnnData object, you can use the get_id2gene_map() function to create a mapping from a FASTA file or string and subsequently assign the extracted gene names to the adata.var attribute

from alphapepttools.tl.tools import get_id2gene_map, map_genes_to_protein_groups

fasta = '''\
>tr|ID0|ID0_HUMAN Protein1 OS=Homo sapiens OX=9606 GN=GN0 PE=1 SV=1
PEPTIDEKPEPTIDEK
>tr|ID1|ID1_HUMAN Protein1 OS=Homo sapiens OX=9606 GN=GN1 PE=1 SV=1
PEPTIDEKPEPTIDEK
'''
mapping = get_id2gene_map(fasta, source_type="string")
mapping
# {'ID0': 'GN0', 'ID1': 'GN1'}

adata.var
# Empty DataFrame
# Columns: []
# Index: [ID0, ID1]

adata.var["gene_id"] = map_genes_to_protein_groups(
    id2gene_map=mapping, protein_groups=adata.var_names
)
Return type:

list[str]

Returns:

List of gene names corresponding to each protein group identifier. If no gene name could be found, “NA” is returned.