alphapepttools.pp.filter_data_completeness#
- alphapepttools.pp.filter_data_completeness(adata, max_missing, group_column=None, groups=None, keep_strategy='all', action='flag', var_colname='passed_threshold_missing_values')#
Filter features based on missing values.
Operates globally, or per-group when group_column is set.
- Parameters:
adata (
AnnData) – AnnData objectmax_missing (
float) – Maximum fraction of missing values allowed to pass filtering in the interval [0.0, 1.0]. Features with a fraction of missing values greater than (>)max_missingare filtered out.group_column (
str|None(default:None)) – Column name inadata.obsdefining groups for group-wise filtering. IfNone(default), computes missingness across all samples. If specified, computes statistics separately for each group. This is useful to retain features that are exclusive to a specific sample group.groups (
list[str] |None(default:None)) – List of levels of the group_column to consider in filtering. E.g. if the column has the levels['A', 'B', 'C'], andgroups = ['A', 'B'], only missingness of features in these groups is considered. IfNone, all groups are considered.keep_strategy (
Literal['any','all'] (default:'all')) – Only relevant for groupwise filtering. -all: keep a feature only if it passes in every group. -any: keep a feature if it passes the threshold in at least one group.action (
Literal['flag','drop'] (default:'flag')) – Action to perform. Can beflag(default) ordrop. Ifflag, a boolean column inadata.varis added to indicate whether the feature passed the missingness threshold. Ifdrop, features that do not pass the threshold are dropped from the AnnData object.var_colname (
str(default:'passed_threshold_missing_values')) – Name of theadata.varboolean column to add if action isflag.
- Return type:
- Returns:
AnnData AnnData object with either a new
adata.varcolumn added (ifflag) or filtered features (ifdrop).
Examples
Flag features with too many missing values:
import anndata as ad import pandas as pd import numpy as np import alphapepttools as apt # Create data with missing values X = np.array( [[1.0, np.nan, 3.0, 4.0], [2.0, np.nan, 6.0, 8.0], [3.0, 5.0, np.nan, 12.0], [4.0, 6.0, 9.0, 16.0]] ) adata = ad.AnnData( X=X, obs=pd.DataFrame({"group": ["A", "A", "B", "B"]}), var=pd.DataFrame(index=["prot1", "prot2", "prot3", "prot4"]), ) # Flag features with >30% missing values adata = apt.pp.filter_data_completeness(adata, max_missing=0.3, action="flag") # Drop features with >30% missing in group A only adata = apt.pp.filter_data_completeness( adata, max_missing=0.3, group_column="group", groups=["A"], action="drop" )
Groupwise filtering —
keep_strategycontrols how per-group results are combined:# No grouping: keep features with ≤50% missingness across the whole study apt.pp.filter_data_completeness(adata, max_missing=0.5) # Logical AND (default): keep features with ≤50% missingness in *every* condition. # A feature with 5/5 missing in condition A and 0/995 missing in condition B is removed # despite being 99.5% complete overall. apt.pp.filter_data_completeness(adata, max_missing=0.5, group_column="condition", keep_strategy="all") # Logical OR: keep features with ≤50% missingness in *at least one* condition. # Retains condition-specific features, which are often the most interesting # candidates in clinical studies. apt.pp.filter_data_completeness(adata, max_missing=0.5, group_column="condition", keep_strategy="any")