alphapepttools.io.AnnDataFactory#

class alphapepttools.io.AnnDataFactory(psm_df, intensity_column, sample_id_column, feature_id_column)#

Factory class to convert AlphaBase PSM DataFrames to AnnData format.

Methods table#

create_anndata([var_columns, obs_columns])

Create AnnData object from PSM DataFrame.

from_files(file_paths[, reader_type, level, ...])

Create AnnDataFactory from PSM files.

Methods#

AnnDataFactory.create_anndata(var_columns=None, obs_columns=None)#

Create AnnData object from PSM DataFrame.

Parameters:
  • var_columns (str | list[str] | None (default: None)) – Additional columns to include in var of the AnnData object, by default None

  • obs_columns (str | list[str] | None (default: None)) – Additional columns to include in obs of the AnnData object, by default None

Return type:

AnnData

Returns:

AnnData object where: - obs (rows) are samples - var (columns) are features (e.g., proteins, peptides, or genes) - X contains intensity values

Examples

import pandas as pd
from alphapepttools.io.anndata_factory import AnnDataFactory

# Create sample data with metadata
df = pd.DataFrame(
    {
        "raw_name": ["sample1"] * 3 + ["sample2"] * 3,
        "protein_group": ["PROT1", "PROT2", "PROT3"] * 2,
        "intensity": [100, 200, 150, 120, 210, 160],
        "gene_names": ["GENE1", "GENE2", "GENE3"] * 2,
        "condition": ["control"] * 3 + ["treated"] * 3,
    }
)

factory = AnnDataFactory(
    psm_df=df, intensity_column="intensity", sample_id_column="raw_name", feature_id_column="protein_group"
)

# Create AnnData with metadata
adata = factory.create_anndata(
    var_columns=["gene_names"],  # Add gene names to var
    obs_columns=["condition"],  # Add condition to obs
)

print(adata.shape)  # (2, 3) - 2 samples, 3 proteins
print(adata.var["gene_names"])  # Gene annotations
print(adata.obs["condition"])  # Sample conditions
classmethod AnnDataFactory.from_files(file_paths, reader_type='maxquant', level='proteins', *, intensity_column=None, feature_id_column=None, sample_id_column=None, additional_columns=None, **reader_kwargs)#

Create AnnDataFactory from PSM files.

Parameters:
  • file_paths (str | list[str]) – Path(s) to PSM file(s)

  • reader_type (str (default: 'maxquant')) – Type of PSM reader to use, by default “maxquant”

  • level (str (default: 'proteins')) – Level of quantification to read. One of “proteins”, “precursors”, or “genes”. Defaults to “proteins”.

  • intensity_column (str | None (default: None)) – Name of the column storing intensity data. Default is taken from psm_reader.yaml

  • feature_id_column (str | None (default: None)) – Name of the column storing feature ids. Default is taken from psm_reader.yaml

  • sample_id_column (str | None (default: None)) – Name of the column storing sample ids. Default is taken from psm_reader.yaml

  • additional_columns (list[str] | None (default: None)) – Names of additional columns from the PSM table to retain for experiment-specific metadata. These columns can be added to the resulting AnnData object as annotations. Note that if a column has a higher cardinality than the feature_id_column (i.e., multiple values per feature), only the first value encountered will be kept.

  • **reader_kwargs – Additional arguments passed to PSM reader

Return type:

AnnDataFactory

Returns:

Initialized AnnDataFactory instance

Examples

from alphapepttools.io.anndata_factory import AnnDataFactory

# Load DIA-NN data at protein level
# assuming a diann report called "report.tsv" exists in the current directory

factory = AnnDataFactory.from_files("report.tsv", reader_type="diann", level="proteins")
adata = factory.create_anndata()

# Load with custom column names and additional metadata columns
factory = AnnDataFactory.from_files(
    report_path,
    reader_type="diann",
    intensity_column="Precursor.Quantity",
    additional_columns=["Precursor.Quantity"],  # additional columns need to be specified here.
)
adata = factory.create_anndata(
    var_columns=["charge", "sequence"]
)  # Add m/z and stripped sequence via their alphabase-standardized column names in var
display(adata.var)  # Check that additional columns are included in var