alphapepttools.pp.filter_data_completeness

alphapepttools.pp.filter_data_completeness#

alphapepttools.pp.filter_data_completeness(adata, max_missing, group_column=None, groups=None, action='flag', var_colname='passed_threshold_missing_values')#

Filter features based on missing values.

Filters AnnData features (columns) based on the fraction of missing values. If group_column and groups are provided, only missingness of certain metadata levels is considered. This is especially useful for imbalanced classes, where filtering by global missingness may leave too many missing values in the smaller class.

(In case rows should be filtered, it is recommended to transpose the adata object prior to calling this function and reverting the transpose afterwards.)

Parameters:
  • max_missing (float) – Maximum fraction of missing values allowed. Compared with the fraction of missing values in a “greater than” fashion, i.e. if max_missing is 0.6 and the fraction of missing values is 0.6, the sample or feature is kept. Greater than comparison is used here since the missing fraction may be 0.0, which equals filtering for 100 % data completeness.

  • group_column (str | None (default: None)) – Column in obs to determine groups for filtering.

  • groups (list[str] | None (default: None)) – List of levels of the group_column to consider in filtering. E.g. if the column has the levels [‘A’, ‘B’, ‘C’], and groups = [‘A’, ‘B’], only missingness of features in these groups is considered. If None, all groups are considered.

  • action (str (default: 'flag')) – Action to perform. can be ‘flag’ (default) or ‘drop’. If ‘flag’, a boolean column in adata.var is added to indicate whether the feature passed the missingness threshold. If ‘drop’, features that do not pass the threshold are dropped from the AnnData object.

  • var_colname (str (default: 'passed_threshold_missing_values')) – Name of the adata.var boolean column to add if action is ‘flag’. Default is ‘passed_threshold_missing_values’.

Return type:

AnnData

Returns:

AnnData AnnData object with either a new adata.var column added (if flag) or filtered features (if drop).

Examples

Flag features with too many missing values:

import anndata as ad
import pandas as pd
import numpy as np
from alphapepttools.pp.data import filter_data_completeness

# Create data with missing values
X = np.array(
    [[1.0, np.nan, 3.0, 4.0], [2.0, np.nan, 6.0, 8.0], [3.0, 5.0, np.nan, 12.0], [4.0, 6.0, 9.0, 16.0]]
)
adata = ad.AnnData(
    X=X,
    obs=pd.DataFrame({"group": ["A", "A", "B", "B"]}),
    var=pd.DataFrame(index=["prot1", "prot2", "prot3", "prot4"]),
)

# Flag features with >30% missing values
adata = filter_data_completeness(adata, max_missing=0.3, action="flag")

# Drop features with >30% missing in group A only
adata = filter_data_completeness(adata, max_missing=0.3, group_column="group", groups=["A"], action="drop")

Filter by group-specific completeness:

# Consider missingness only in specific groups
adata = filter_data_completeness(adata, max_missing=0.5, group_column="group", groups=["A", "B"], action="flag")