extraction#
HDF5CellExtraction#
- class scportrait.pipeline.extraction.HDF5CellExtraction(*args, **kwargs)#
Bases:
ProcessingStepA class to extracts single cell images from a segmented scPortrait project and save the results to an HDF5 file.
Initialize a processing step and normalize configuration handling.
- Parameters:
config – Parsed configuration dictionary or path to a config file. If the top-level config contains a key matching the concrete step class name, that sub-dictionary is used as the step config.
directory – Working directory for this step.
project_location – Project root when running as part of
Project.debug – Enable verbose stdout logging in addition to file logging.
overwrite – If
True, existing step output may be removed before processing.project – Active
Projectinstance when this step is project-managed.filehandler – Shared SpatialData file handler for project-managed runs.
from_project – Whether this step is invoked from a project-managed execution context.
- process(partial: bool = False, n_cells: int = None, seed: int = 42, output_folder_name: str | None = None) None#
Extracts single cell images from a segmented scPortrait project and saves the results to a standardized HDF5 file.
- Parameters:
input_segmentation_path – Path of the segmentation HDF5 file. If this class is used as part of a project processing workflow, this argument will be provided automatically.
partial – if set to True only a random subset of n_cells will be extracted.
n_cells – Number of cells to extract if partial is set to True.
seed – Seed for random sampling of cells for reproducibility if partial is set to True.
Important
If this class is used as part of a project processing workflow, all of the arguments will be provided by the
Projectclass based on the previous segmentation. The Project class will automatically provide the most recent segmentation forward together with the supplied parameters.Examples
# After project is initialized and input data has been loaded and segmented project.extract()
Notes
The following parameters are required in the config file when running this method:
HDF5CellExtraction: # threads used in multithreading threads: 80 # image size in pixels image_size: 128 # directory where intermediate results should be saved cache: "/mnt/temp/cache"
The following optional parameters can also be configured:
Parameter
Default
Description
normalize_outputTrueEnable percentile normalization of extracted image channels.
normalization_range(0.001, 0.999)Lower and upper percentiles used for normalization.
compressionTrueCompression mode for the output HDF5 dataset.
Truemaps tolzf.gzipandFalseare also supported.target_ram_utilization0.85Fraction of total system RAM the extraction job should aim to stay within when calibrating buffered result batches.
max_inflight_result_batchesauto-calibrated
Explicit override for the number of buffered multiprocessing result batches. If omitted, scPortrait calibrates this from the first worker wave.
flush_everyderived from effective in-flight batch limit
Flush cadence for HDF5 output and garbage collection during extraction. If omitted, it is derived from the effective in-flight batch limit.
max_batch_size1000Upper bound used when building multiprocessing mini-batches.
Normalization settings deserve special attention because they directly affect the dynamic range of the extracted single-cell images that are stored for downstream analysis. If you are unsure how to choose
normalize_outputornormalization_range, refer to the single-cell extraction tutorial for a more detailed walkthrough.During extraction, scPortrait can use multiple worker processes to prepare single-cell image batches while the main process writes results to the output HDF5 file. On large datasets with multiple threads, preparing batches can be faster than writing them to disk, which would otherwise allow completed batch results to accumulate in memory. To keep this manageable, the extraction workflow can automatically limit how many completed batch results are allowed to be buffered in memory at the same time.
In multiprocessing mode,
max_inflight_result_batchesis calibrated automatically when it is not provided explicitly. The calibration uses the first wave of worker batches to estimate returned batch payload size together with the parent-process RSS, then chooses an in-flight batch limit that aims to respecttarget_ram_utilization. If the calculated value would fall below the active worker count, the worker count is used as a minimum and a warning is written to the log.