alphapepttools.pp.filter_data_completeness

alphapepttools.pp.filter_data_completeness#

alphapepttools.pp.filter_data_completeness(adata, max_missing, group_column=None, groups=None, keep_strategy='all', action='flag', var_colname='passed_threshold_missing_values')#

Filter features based on missing values.

Operates globally, or per-group when group_column is set.

Parameters:
  • adata (AnnData) – AnnData object

  • max_missing (float) – Maximum fraction of missing values allowed to pass filtering in the interval [0.0, 1.0]. Features with a fraction of missing values greater than (>) max_missing are filtered out.

  • group_column (str | None (default: None)) – Column name in adata.obs defining groups for group-wise filtering. If None (default), computes missingness across all samples. If specified, computes statistics separately for each group. This is useful to retain features that are exclusive to a specific sample group.

  • groups (list[str] | None (default: None)) – List of levels of the group_column to consider in filtering. E.g. if the column has the levels ['A', 'B', 'C'], and groups = ['A', 'B'], only missingness of features in these groups is considered. If None, all groups are considered.

  • keep_strategy (Literal['any', 'all'] (default: 'all')) – Only relevant for groupwise filtering. - all : keep a feature only if it passes in every group. - any : keep a feature if it passes the threshold in at least one group.

  • action (Literal['flag', 'drop'] (default: 'flag')) – Action to perform. Can be flag (default) or drop. If flag, a boolean column in adata.var is added to indicate whether the feature passed the missingness threshold. If drop, features that do not pass the threshold are dropped from the AnnData object.

  • var_colname (str (default: 'passed_threshold_missing_values')) – Name of the adata.var boolean column to add if action is flag.

Return type:

AnnData

Returns:

AnnData AnnData object with either a new adata.var column added (if flag) or filtered features (if drop).

Examples

Flag features with too many missing values:

import anndata as ad
import pandas as pd
import numpy as np
import alphapepttools as apt

# Create data with missing values
X = np.array(
    [[1.0, np.nan, 3.0, 4.0], [2.0, np.nan, 6.0, 8.0], [3.0, 5.0, np.nan, 12.0], [4.0, 6.0, 9.0, 16.0]]
)
adata = ad.AnnData(
    X=X,
    obs=pd.DataFrame({"group": ["A", "A", "B", "B"]}),
    var=pd.DataFrame(index=["prot1", "prot2", "prot3", "prot4"]),
)

# Flag features with >30% missing values
adata = apt.pp.filter_data_completeness(adata, max_missing=0.3, action="flag")

# Drop features with >30% missing in group A only
adata = apt.pp.filter_data_completeness(
    adata, max_missing=0.3, group_column="group", groups=["A"], action="drop"
)

Groupwise filtering — keep_strategy controls how per-group results are combined:

# No grouping: keep features with ≤50% missingness across the whole study
apt.pp.filter_data_completeness(adata, max_missing=0.5)

# Logical AND (default): keep features with ≤50% missingness in *every* condition.
# A feature with 5/5 missing in condition A and 0/995 missing in condition B is removed
# despite being 99.5% complete overall.
apt.pp.filter_data_completeness(adata, max_missing=0.5, group_column="condition", keep_strategy="all")

# Logical OR: keep features with ≤50% missingness in *at least one* condition.
# Retains condition-specific features, which are often the most interesting
# candidates in clinical studies.
apt.pp.filter_data_completeness(adata, max_missing=0.5, group_column="condition", keep_strategy="any")