pipeline

Contents

pipeline#

project#

Within scPortrait, all operations are centered around the concept of a Project. A Project is a python class which manages all of the scPortrait processing steps and is the central element through which all operations are performed. Each Project directly maps to a directory on the file system which contains all of the inputs to a specific scPortrait run as well as the generated outputs. Depending on the structure of the data that is to be processed a different Project class is required. Please see here for more information.

Project#

class scportrait.pipeline.project.Project(project_location, config_path, segmentation_f=None, extraction_f=None, classification_f=None, selection_f=None, overwrite=False, debug=False)#

Bases: Logable

DEFAULT_IMAGE_DTYPE#

alias of uint16

DEFAULT_SEGMENTATION_DTYPE#

alias of uint32

DEFAULT_SINGLE_CELL_IMAGE_DTYPE#

alias of float16

update_classification_f(classification_f) None#

Update the classification method chosen for the project without reinitializing the entire project.

Parameters:

classification_f (class) – The classification method that should be used for the project.

load_input_from_tif_files(file_paths, channel_names=None, crop=None, overwrite=None, remap=None, cache=None)#

Load input image from a list of files. The channels need to be specified in the following order: nucleus, cytosol other channels.

Parameters:
  • file_paths (List[str]) – List containing paths to each channel like [“path1/img.tiff”, “path2/img.tiff”, “path3/img.tiff”]. Expects a list of file paths with length “input_channel” as defined in the config.yml.

  • crop (List[Tuple], optional) – When set, it can be used to crop the input image. The first element refers to the first dimension of the image and so on. For example use “[(0,1000),(0,2000)]” to crop the image to 1000 px height and 2000 px width from the top left corner.

load_input_from_sdata(sdata_path, input_image_name='input_image', nucleus_segmentation_name=None, cytosol_segmentation_name=None, overwrite=None)#

Load input image from a spatialdata object.

select(cell_sets: list[dict], calibration_marker: ndarray | None = None, segmentation_name: str = 'seg_all_nucleus', name: str | None = None)#

Select specified classes using the defined selection method.

segmentation#

Segmentation#

class scportrait.pipeline.segmentation.Segmentation(config, directory, nuc_seg_name, cyto_seg_name, _tmp_image_path, project_location, debug, overwrite, project, filehandler, **kwargs)#

Bases: ProcessingStep

Segmentation helper class used for creating segmentation workflows.

maps#

Segmentation workflows based on the Segmentation class can use maps for saving and loading checkpoints and perform. Maps can be numpy arrays

Type:

dict(str)

DEFAULT_FILTER_ADDTIONAL_FILE#
Type:

str, default filtered_classes.csv

PRINT_MAPS_ON_DEBUG#
Type:

bool, default False

identifier#

Only set if called by ShardedSegmentation. Unique index of the shard.

Type:

int, default None

window#

Only set if called by ShardedSegmentation. Defines the window which is assigned to the shard. The window will be applied to the input. The first element refers to the first dimension of the image and so on. For example use [(0,1000),(0,2000)] To crop the image to 1000 px height and 2000 px width from the top left corner.

Type:

list(tuple), default None

input_path#

Only set if called by ShardedSegmentation. Location of the input hdf5 file. During sharded segmentation the ShardedSegmentation derived helper class will save the input image in form of a hdf5 file. This makes the input image available for parallel reading by the segmentation processes.

Type:

str, default None

Example

def process(self):
    # two maps are initialized
    self.maps = {"map0": None, "map1": None}

    # its checked if the segmentation directory already contains these maps and they are then loaded. The index of the first map which has not been found is returned. It indicates the step where computation needs to resume
    current_step = self.load_maps_from_disk()

    if current_step <= 0:
        # do stuff and generate map0
        self.save_map("map0")

    if current_step <= 1:
        # do stuff and generate map1
        self.save_map("map1")
save_map(map_name)#

Saves newly computed map.

Args

map_name (str): name of the map to be saved, as defined in self.maps.

Example

# declare all intermediate maps
self.maps = {"myMap": None}

# load intermediate maps if possible and get current processing step
current_step = self.load_maps_from_disk()

if current_step <= 0:
    # do some computations

    self.maps["myMap"] = myNumpyArray

    # save map
    self.save_map("myMap")
process(input_image)#

Process the input image with the segmentation method.

ShardedSegmentation#

class scportrait.pipeline.segmentation.ShardedSegmentation(*args, **kwargs)#

Bases: Segmentation

To perform a sharded segmentation where the input image is split into individual tiles (with overlap) that are processed idnividually before the results are joined back together.

process(input_image)#

Process the input image with the sharded segmentation method.

Important

This function is called automatically when a Segmentation Class is executed.

Parameters:

input_image (np.array) – Input image to be processed. The input image should be a numpy array of shape (C, H, W) where C is the number of channels, H is the height of the image and W is the width of the image.

complete_segmentation(input_image, force_run=False)#

Complete an already started sharded segmentation of the provided input image.

Parameters:
  • input_image (np.array) – Input image to be processed. The input image should be a numpy array of shape (C, H, W) where C is the number of channels, H is the height of the image and W is the width of the image.

  • force_run (bool) – If set to True the segmentation will be run even if a completed segmentation is already found in the sdata object. Default is False.

segmentation workflows#

class scportrait.pipeline.segmentation.workflows.WGASegmentation(*args, **kwargs)#
process(input_image)#

Process the input image with the segmentation method.

class scportrait.pipeline.segmentation.workflows.ShardedWGASegmentation(*args, **kwargs)#
method#

alias of WGASegmentation

class scportrait.pipeline.segmentation.workflows.DAPISegmentation(*args, **kwargs)#
process(input_image)#

Process the input image with the segmentation method.

class scportrait.pipeline.segmentation.workflows.ShardedDAPISegmentation(*args, **kwargs)#
method#

alias of DAPISegmentation

class scportrait.pipeline.segmentation.workflows.DAPISegmentationCellpose(*args, **kwargs)#
process(input_image)#

Process the input image with the segmentation method.

class scportrait.pipeline.segmentation.workflows.ShardedDAPISegmentationCellpose(*args, **kwargs)#
method#

alias of DAPISegmentationCellpose

class scportrait.pipeline.segmentation.workflows.CytosolSegmentationCellpose(*args, **kwargs)#
process(input_image)#

Process the input image with the segmentation method.

class scportrait.pipeline.segmentation.workflows.ShardedCytosolSegmentationCellpose(*args, **kwargs)#
method#

alias of CytosolSegmentationCellpose

class scportrait.pipeline.segmentation.workflows.CytosolSegmentationDownsamplingCellpose(*args, **kwargs)#
process(input_image)#

Process the input image with the segmentation method.

class scportrait.pipeline.segmentation.workflows.ShardedCytosolSegmentationDownsamplingCellpose(*args, **kwargs)#
method#

alias of CytosolSegmentationDownsamplingCellpose

class scportrait.pipeline.segmentation.workflows.CytosolOnlySegmentationCellpose(*args, **kwargs)#
class scportrait.pipeline.segmentation.workflows.Sharded_CytosolOnly_Cellpose_Segmentation(*args, **kwargs)#
method#

alias of CytosolOnlySegmentationCellpose

class scportrait.pipeline.segmentation.workflows.CytosolOnly_Segmentation_Downsampling_Cellpose(*args, **kwargs)#
process(input_image) None#

Process the input image with the segmentation method.

class scportrait.pipeline.segmentation.workflows.Sharded_CytosolOnly_Segmentation_Downsampling_Cellpose(*args, **kwargs)#
method#

alias of CytosolOnly_Segmentation_Downsampling_Cellpose

extraction#

HDF5CellExtraction#

class scportrait.pipeline.extraction.HDF5CellExtraction(*args, **kwargs)#

Bases: ProcessingStep

A class to extracts single cell images from a segmented scPortrait project and save the results to an HDF5 file.

process(partial=False, n_cells=None, seed=42)#

Extracts single cell images from a segmented scPortrait project and saves the results to an HDF5 file.

Parameters:
  • input_segmentation_path (str) – Path of the segmentation HDF5 file. If this class is used as part of a project processing workflow, this argument will be provided automatically.

  • filtered_classes_path (str, optional) – Path to the filtered classes that should be used for extraction. Default is None. If not provided, will use the automatically generated paths.

Important

If this class is used as part of a project processing workflow, all of the arguments will be provided by the Project class based on the previous segmentation. The Project class will automatically provide the most recent segmentation forward together with the supplied parameters.

Examples

# After project is initialized and input data has been loaded and segmented
project.extract()

Notes

The following parameters are required in the config file when running this method:

HDF5CellExtraction:

    compression: True

    # threads used in multithreading
    threads: 80

    # image size in pixels
    image_size: 128

    # directory where intermediate results should be saved
    cache: "/mnt/temp/cache"

    # specs to define how HDF5 data should be chunked and saved
    hdf5_rdcc_nbytes: 5242880000 # 5GB 1024 * 1024 * 5000
    hdf5_rdcc_w0: 1
    hdf5_rdcc_nslots: 50000

classification#

MLClusterClassifier#

class scportrait.pipeline.classification.MLClusterClassifier(*args, **kwargs)#

Class for classifying single cells using a pre-trained machine learning model.

This class takes a pre-trained model and uses it to classify single cells, using the model’s forward function or encoder function, depending on the user’s choice. The classification results are saved to a CSV file.

__call__(*args, debug=None, overwrite=None, **kwargs)#

Call the processing step.

Parameters:
  • debug (bool, optional, default None) – Allows overriding the value set on initiation. When set to True debug outputs will be printed where applicable.

  • overwrite (bool, optional, default None) – Allows overriding the value set on initiation. When set to True, the processing step directory will be completely deleted and newly created when called.

DEFAULT_MODEL_CLASS#

alias of MultilabelSupervisedModel

DEFAULT_DATA_LOADER#

alias of HDF5SingleCellDataset

process(extraction_dir: str, size: int = 0)#

Perform classification on the provided HDF5 dataset.

Parameters:
  • extraction_dir (str) – Directory containing the extracted HDF5 files from the project. If this class is used as part of a project processing workflow, this argument will be provided automatically.

  • size (int, optional) – How many cells should be selected for inference. Default is 0, which means all cells are selected.

Returns:

Results are written to CSV files located in the project directory.

Return type:

None

Important

If this class is used as part of a project processing workflow, the first argument will be provided by the Project class based on the previous single-cell extraction. Therefore, only the second and third arguments need to be provided. The Project class will automatically provide the most recent extracted single-cell dataset together with the supplied parameters.

Examples

project.classify()

Notes

The following parameters are required in the config file:

MLClusterClassifier:
    # Channel number on which the classification should be performed
    channel_classification: 4

    # Number of threads to use for dataloader
    dataloader_worker_number: 24

    # Batch size to pass to GPU
    batch_size: 900

    # Path to PyTorch checkpoint that should be used for inference
    network: "path/to/model/"

    # Classifier architecture implemented in scPortrait
    # Choose one of VGG1, VGG2, VGG1_old, VGG2_old
    classifier_architecture: "VGG2_old"

    # If more than one checkpoint is provided in the network directory, which checkpoint should be chosen
    # Should either be "max" or a numeric value indicating the epoch number
    epoch: "max"

    # Name of the classifier used for saving the classification results to a directory
    label: "Autophagy_15h_classifier1"

    # List of which inference methods should be performed
    # Available: "forward" and "encoder"
    # If "forward": images are passed through all layers of the model and the final inference results are written to file
    # If "encoder": activations at the end of the CNN are written to file
    encoders: ["forward", "encoder"]

    # On which device inference should be performed
    # For speed, should be "cuda"
    inference_device: "cuda"

    #define dataset transforms
    transforms:
        resize: 128

CellFeaturizer#

class scportrait.pipeline.classification.CellFeaturizer(*args, **kwargs)#

Class for extracting general image features from SPARCS single-cell image datasets. The extracted features are saved to a CSV file. The features are calculated on the basis of a specified channel.

The features which are calculated are:

  • Area of the masks in pixels

  • Mean intensity of the chosen channel in the regions labelled by each of the masks

  • Median intensity of the chosen channel in the regions labelled by each of the masks

  • 75% quantile of the chosen channel in the regions labelled by each of the masks

  • 25% quantile of the chosen channel in the regions labelled by each of the masks

  • Summed intensity of the chosen channel in the regions labelled by each of the masks

  • Summed intensity of the chosen channel in the region labelled by each of the masks normalized for area

The features are outputed in this order in the CSV file.

__call__(*args, debug=None, overwrite=None, **kwargs)#

Call the processing step.

Parameters:
  • debug (bool, optional, default None) – Allows overriding the value set on initiation. When set to True debug outputs will be printed where applicable.

  • overwrite (bool, optional, default None) – Allows overriding the value set on initiation. When set to True, the processing step directory will be completely deleted and newly created when called.

process(extraction_dir, size=0)#

Perform featurization on the provided HDF5 dataset.

Parameters:
  • extraction_dir (str) – Directory containing the extracted HDF5 files from the project. If this class is used as part of a project processing workflow this argument will be provided automatically.

  • size (int, optional, default=0) – How many cells should be selected for inference. Default is 0, meaning all cells are selected.

Returns:

Results are written to CSV files located in the project directory.

Return type:

None

Important

If this class is used as part of a project processing workflow, the first argument will be provided by the Project class based on the previous single-cell extraction. Therefore, only the second and third argument need to be provided. The Project class will automatically provide the most recent extraction results together with the supplied parameters.

Examples

# Define accessory dataset: additional HDF5 datasets that you want to perform an inference on
# Leave empty if you only want to infer on all extracted cells in the current project

project.classify()

Notes

The following parameters are required in the config file:

CellFeaturizer:
    # Channel number on which the featurization should be performed
    channel_classification: 4

    # Number of threads to use for dataloader
    dataloader_worker_number: 0 # needs to be 0 if using CPU

    # Batch size to pass to GPU
    batch_size: 900

    # On which device inference should be performed
    # For speed should be "cuda"
    inference_device: "cpu"

    # Label under which the results should be saved
    screen_label: "Ch3_Featurization"

selection#

LMDSelection#

class scportrait.pipeline.selection.LMDSelection(*args, **kwargs)#

Bases: ProcessingStep

Select single cells from a segmented sdata file and generate cutting data for the Leica LMD microscope. This method class relies on the functionality of the pylmd library.

process(segmentation_name: str, cell_sets: list[dict], calibration_marker: array, name: str | None = None)#

Process function for selecting cells and generating their XML. Under the hood this method relies on the pylmd library and utilizies its SegmentationLoader Class.

Parameters:
  • segmentation_name (str) – Name of the segmentation to be used for shape generation in the sdata object.

  • cell_sets (list of dict) – List of dictionaries containing the sets of cells which should be sorted into a single well. Mandatory keys for each dictionary are: name, classes. Optional keys are: well.

  • calibration_marker (numpy.array) – Array of size ‘(3,2)’ containing the calibration marker coordinates in the ‘(row, column)’ format.

Example

# Calibration marker should be defined as (row, column).
marker_0 = np.array([-10, -10])
marker_1 = np.array([-10, 1100])
marker_2 = np.array([1100, 505])

# A numpy Array of shape (3, 2) should be passed.
calibration_marker = np.array([marker_0, marker_1, marker_2])


# Sets of cells can be defined by providing a name and a list of classes in a dictionary.
cells_to_select = [{"name": "dataset1", "classes": [1, 2, 3]}]

# Alternatively, a path to a csv file can be provided.
# If a relative path is provided, it is accessed relativ to the projects base directory.
cells_to_select += [{"name": "dataset2", "classes": "segmentation/class_subset.csv"}]

# If desired, wells can be passed with the individual sets.
cells_to_select += [{"name": "dataset3", "classes": [4, 5, 6], "well": "A1"}]

project.select(cells_to_select, calibration_marker)

Note

The following parameters are required in the config file:

LMDSelection:
    #the number of threads with which multithreaded tasks should be executed
    threads: 10

    # the number of parallel processes to use for generation of cell sets each set
    # will processed with the designated number of threads
    processes_cell_sets: 1

    # defines the channel used for generating cutting masks
    # segmentation.hdf5 => labels => segmentation_channel
    # When using WGA segmentation:
    #    0 corresponds to nuclear masks
    #    1 corresponds to cytosolic masks.
    segmentation_channel: 0

    # dilation of the cutting mask in pixel
    shape_dilation: 10

    # Cutting masks are transformed by binary dilation and erosion
    binary_smoothing: 3

    # number of datapoints which are averaged for smoothing
    # the number of datapoints over an distance of n pixel is 2*n
    convolution_smoothing: 25

    # fold reduction of datapoints for compression
    poly_compression_factor: 30

    # Optimization of the cutting path inbetween shapes
    # optimized paths improve the cutting time and the microscopes focus
    # valid options are ["none", "hilbert", "greedy"]
    path_optimization: "hilbert"

    # Paramter required for hilbert curve based path optimization.
    # Defines the order of the hilbert curve used, which needs to be tuned with the total cutting area.
    # For areas of 1 x 1 mm we recommend at least p = 4,  for whole slides we recommend p = 7.
    hilbert_p: 7

    # Parameter required for greedy path optimization.
    # Instead of a global distance matrix, the k nearest neighbours are approximated.
    # The optimization problem is then greedily solved for the known set of nearest neighbours until the first set of neighbours is exhausted.
    # Established edges are then removed and the nearest neighbour approximation is recursivly repeated.
    greedy_k: 20

    # The LMD reads coordinates as integers which leads to rounding of decimal places.
    # Points spread between two whole coordinates are therefore collapsed to whole coordinates.
    # This can be mitigated by scaling the entire coordinate system by a defined factor.
    # For a resolution of 0.6 um / px a factor of 100 is recommended.
    xml_decimal_transform: 100

    # Overlapping shapes are merged based on a nearest neighbour heuristic.
    # All selected shapes closer than distance_heuristic pixel are checked for overlap.
    distance_heuristic: 300