alphapepttools.pp.filter_data_completeness#
- alphapepttools.pp.filter_data_completeness(adata, max_missing, group_column=None, groups=None, action='flag', var_colname='passed_threshold_missing_values')#
Filter features based on missing values.
Filters AnnData features (columns) based on the fraction of missing values. If group_column and groups are provided, only missingness of certain metadata levels is considered. This is especially useful for imbalanced classes, where filtering by global missingness may leave too many missing values in the smaller class.
(In case rows should be filtered, it is recommended to transpose the adata object prior to calling this function and reverting the transpose afterwards.)
- Parameters:
max_missing (
float) – Maximum fraction of missing values allowed. Compared with the fraction of missing values in a “greater than” fashion, i.e. if max_missing is 0.6 and the fraction of missing values is 0.6, the sample or feature is kept. Greater than comparison is used here since the missing fraction may be 0.0, which equals filtering for 100 % data completeness.group_column (
str|None(default:None)) – Column in obs to determine groups for filtering.groups (
list[str] |None(default:None)) – List of levels of the group_column to consider in filtering. E.g. if the column has the levels [‘A’, ‘B’, ‘C’], and groups = [‘A’, ‘B’], only missingness of features in these groups is considered. If None, all groups are considered.action (
str(default:'flag')) – Action to perform. can be ‘flag’ (default) or ‘drop’. If ‘flag’, a boolean column inadata.varis added to indicate whether the feature passed the missingness threshold. If ‘drop’, features that do not pass the threshold are dropped from the AnnData object.var_colname (
str(default:'passed_threshold_missing_values')) – Name of theadata.varboolean column to add if action is ‘flag’. Default is ‘passed_threshold_missing_values’.
- Return type:
- Returns:
AnnData AnnData object with either a new
adata.varcolumn added (ifflag) or filtered features (ifdrop).
Examples
Flag features with too many missing values:
import anndata as ad import pandas as pd import numpy as np from alphapepttools.pp.data import filter_data_completeness # Create data with missing values X = np.array( [[1.0, np.nan, 3.0, 4.0], [2.0, np.nan, 6.0, 8.0], [3.0, 5.0, np.nan, 12.0], [4.0, 6.0, 9.0, 16.0]] ) adata = ad.AnnData( X=X, obs=pd.DataFrame({"group": ["A", "A", "B", "B"]}), var=pd.DataFrame(index=["prot1", "prot2", "prot3", "prot4"]), ) # Flag features with >30% missing values adata = filter_data_completeness(adata, max_missing=0.3, action="flag") # Drop features with >30% missing in group A only adata = filter_data_completeness(adata, max_missing=0.3, group_column="group", groups=["A"], action="drop")
Filter by group-specific completeness:
# Consider missingness only in specific groups adata = filter_data_completeness(adata, max_missing=0.5, group_column="group", groups=["A", "B"], action="flag")