extraction#

HDF5CellExtraction#

class scportrait.pipeline.extraction.HDF5CellExtraction(*args, **kwargs)#

Bases: ProcessingStep

A class to extracts single cell images from a segmented scPortrait project and save the results to an HDF5 file.

Initialize a processing step and normalize configuration handling.

Parameters:
  • config – Parsed configuration dictionary or path to a config file. If the top-level config contains a key matching the concrete step class name, that sub-dictionary is used as the step config.

  • directory – Working directory for this step.

  • project_location – Project root when running as part of Project.

  • debug – Enable verbose stdout logging in addition to file logging.

  • overwrite – If True, existing step output may be removed before processing.

  • project – Active Project instance when this step is project-managed.

  • filehandler – Shared SpatialData file handler for project-managed runs.

  • from_project – Whether this step is invoked from a project-managed execution context.

process(partial: bool = False, n_cells: int = None, seed: int = 42, output_folder_name: str | None = None) None#

Extracts single cell images from a segmented scPortrait project and saves the results to a standardized HDF5 file.

Parameters:
  • input_segmentation_path – Path of the segmentation HDF5 file. If this class is used as part of a project processing workflow, this argument will be provided automatically.

  • partial – if set to True only a random subset of n_cells will be extracted.

  • n_cells – Number of cells to extract if partial is set to True.

  • seed – Seed for random sampling of cells for reproducibility if partial is set to True.

Important

If this class is used as part of a project processing workflow, all of the arguments will be provided by the Project class based on the previous segmentation. The Project class will automatically provide the most recent segmentation forward together with the supplied parameters.

Examples

# After project is initialized and input data has been loaded and segmented
project.extract()

Notes

The following parameters are required in the config file when running this method:

HDF5CellExtraction:
    # threads used in multithreading
    threads: 80

    # image size in pixels
    image_size: 128

    # directory where intermediate results should be saved
    cache: "/mnt/temp/cache"

The following optional parameters can also be configured:

Parameter

Default

Description

normalize_output

True

Enable percentile normalization of extracted image channels.

normalization_range

(0.001, 0.999)

Lower and upper percentiles used for normalization.

compression

True

Compression mode for the output HDF5 dataset. True maps to lzf. gzip and False are also supported.

target_ram_utilization

0.85

Fraction of total system RAM the extraction job should aim to stay within when calibrating buffered result batches.

max_inflight_result_batches

auto-calibrated

Explicit override for the number of buffered multiprocessing result batches. If omitted, scPortrait calibrates this from the first worker wave.

flush_every

derived from effective in-flight batch limit

Flush cadence for HDF5 output and garbage collection during extraction. If omitted, it is derived from the effective in-flight batch limit.

max_batch_size

1000

Upper bound used when building multiprocessing mini-batches.

Normalization settings deserve special attention because they directly affect the dynamic range of the extracted single-cell images that are stored for downstream analysis. If you are unsure how to choose normalize_output or normalization_range, refer to the single-cell extraction tutorial for a more detailed walkthrough.

During extraction, scPortrait can use multiple worker processes to prepare single-cell image batches while the main process writes results to the output HDF5 file. On large datasets with multiple threads, preparing batches can be faster than writing them to disk, which would otherwise allow completed batch results to accumulate in memory. To keep this manageable, the extraction workflow can automatically limit how many completed batch results are allowed to be buffered in memory at the same time.

In multiprocessing mode, max_inflight_result_batches is calibrated automatically when it is not provided explicitly. The calibration uses the first wave of worker batches to estimate returned batch payload size together with the parent-process RSS, then chooses an in-flight batch limit that aims to respect target_ram_utilization. If the calculated value would fall below the active worker count, the worker count is used as a minimum and a warning is written to the log.