pipeline

base

A collection of base classes from which other classes in the SPARCSpy environment can inherit that manage base functionalities like logging or directory creation.

Logable

Base Class which generates framework for logging.

class sparcscore.pipeline.base.Logable(debug=False)

object which can create log entries.

Parameters

debug (bool, default False) – When set to True log entries will be printed to the console.

directory

A directory must be set in every descendant before log can be called.

Type

str

DEFAULT_LOG_NAME

Default log file name.

Type

str, default processing.log

DEFAULT_FORMAT

Date and time format used for logging. See datetime.strftime.

Type

str

log(message)

log a message

Parameters

message (str, list(str), dict(str)) – Strings are s

get_timestamp()

Get the current timestamp in the DEFAULT_FORMAT.

Returns

Formatted timestamp.

Return type

str

ProcessingStep

Starting point for all processing steps. Reads a config file that contains the parameters used to set up a processing method and generates the folder structure necessary for saving the generated outputs.

class sparcscore.pipeline.base.ProcessingStep(config, directory, debug=False, intermediate_output=False, overwrite=True)

Bases: sparcscore.pipeline.base.Logable

Processing step. Can load a configuration file and create a subdirectory under the project class for the processing step.

Parameters
  • config (dict) – Config file which is passed by the Project class when called. Is loaded from the project based on the name of the class.

  • directory (str) – Directory which should be used by the processing step. The directory will be newly created if it does not exist yet. When used with the vipercore.pipeline.project.Project class, a subdirectory of the project directory is passed.

  • intermediate_output (bool, default False) – When set to True intermediate outputs will be saved where applicable.

  • debug (bool, default False) – When set to True debug outputs will be printed where applicable.

  • overwrite (bool, default False) – When set to True, the processing step directory will be completely deleted and newly created when called.

register_parameter(key, value)

Registers a new parameter by updating the configuration dictionary if the key didn’t exist.

Parameters
  • key (str) – Name of the parameter.

  • value – Value of the parameter.

get_directory()

Get the directory for this processing step.

Returns

Directory path.

Return type

str

project

Within SPARCSpy, all operations are centered around the concept of a Project. A Project is a python class which manages all of the SPARCSpy processing steps and is the central element through which all operations are performed. Each Project directly maps to a directory on the file system which contains all of the inputs to a specific SPARCSpy run as well as the generated outputs. Depending on the structure of the data that is to be processed a different Project class is required. Please see here for more information.

Project

class sparcscore.pipeline.project.Project(location_path, config_path='', *args, intermediate_output=False, debug=False, overwrite=False, segmentation_f=None, extraction_f=None, classification_f=None, selection_f=None, **kwargs)

Bases: sparcscore.pipeline.base.Logable

Project base class used to create a SPARCSpy project. This class manages all of the SPARCSpy processing steps. It directly maps to a directory on the file system which contains all of the project inputs as well as the generated outputs.

Parameters
  • location_path (str) – Path to the folder where to project should be created. The folder is created in case the specified folder does not exist.

  • config_path (str, optional, default "") – Path pointing to a valid configuration file. The file will be copied to the project directory and renamed to the name specified in DEFAULT_CLASSIFICATION_DIR_NAME. If no config is specified, the existing config in the project directory will be used, if possible. See the section configuration to find out more about the config file.

  • intermediate_output (bool, default False) – When set to True intermediate outputs will be saved where applicable.

  • debug (bool, default False) – When set to True debug outputs will be printed where applicable.

  • overwrite (bool, default False) – When set to True, the processing step directory will be completely deleted and newly created when called.

  • segmentation_f (Class, default None) – Class containing segmentation workflow.

  • extraction_f (Class, default None) – Class containing extraction workflow.

  • classification_f (Class, default None) – Class containing classification workflow.

  • selection_f (Class, default None) – Class containing selection workflow.

DEFAULT_CONFIG_NAME

Default config name which is used for the config file in the project directory. This name needs to be used when no config is supplied and the config is manually created in the project folder.

Type

str, default “config.yml”

DEFAULT_SEGMENTATION_DIR_NAME

Default foldername for the segmentation process.

Type

str, default “segmentation”

DEFAULT_EXTRACTION_DIR_NAME

Default foldername for the extraction process.

Type

str, default “extraction”

DEFAULT_CLASSIFICATION_DIR_NAME

Default foldername for the classification process.

Type

str, default “selection”

DEFAULT_SELECTION_DIR_NAME

Default foldername for the selection process.

Type

str, default “classification”

load_input_from_file(file_paths, crop=[(0, - 1), (0, - 1)])

Load input image from a list of files. The channels need to be specified in the following order: nucleus, cytosol other channels.

Parameters
  • file_paths (list(str)) – List containing paths to each channel like [“path1/img.tiff”, “path2/img.tiff”, “path3/img.tiff”]. Expects a list of file paths with length “input_channel” as defined in the config.yml.

  • crop (list(tuple), optional) – When set, it can be used to crop the input image. The first element refers to the first dimension of the image and so on. For example use “[(0,1000),(0,2000)]” to crop the image to 1000 px height and 2000 px width from the top left corner.

load_input_from_array(array, remap=None)

Load input image from an already loaded numpy array. The numpy array needs to have the following shape: CXY. The channels need to be in the following order: nucleus, cellmembrane channel, other channnels or a remapping needs to be defined.

Parameters
  • array (numpy.ndarray) – Numpy array of shape “[channels, height, width]”.

  • remap (list(int), optional) – Define remapping of channels. For example use “[1, 0, 2]” to change the order of the first and the second channel. The expected order is Nucleus Channel, Cellmembrane Channel followed by other channels.

segment(*args, **kwargs)

Segment project with the selected segmentation method.

extract(*args, **kwargs)

Extract single cells with the defined extraction method.

classify(*args, **kwargs)

Classify extracted single cells with the defined classification method.

select(*args, **kwargs)

Select specified classes using the defined selection method.

TimecourseProject

class sparcscore.pipeline.project.TimecourseProject(*args, **kwargs)

Bases: sparcscore.pipeline.project.Project

TimecourseProject class used to create a SPARCSpy project for datasets that have multiple fields of view that should be processed and analysed together. It is also capable of handling multiple timepoints for the same field of view or a combiantion of both. Like the base SPARCSpy Project, it manages all of the SPARCSpy processing steps. Because the input data has a different dimensionality than the base SPARCSpy Project class, it requires the use of specialized processing classes that are able to handle this additional dimensionality.

Parameters
  • location_path (str) – Path to the folder where to project should be created. The folder is created in case the specified folder does not exist.

  • config_path (str, optional, default "") – Path pointing to a valid configuration file. The file will be copied to the project directory and renamed to the name specified in DEFAULT_CLASSIFICATION_DIR_NAME. If no config is specified, the existing config in the project directory will be used, if possible. See the section configuration to find out more about the config file.

  • intermediate_output (bool, default False) – When set to True intermediate outputs will be saved where applicable.

  • debug (bool, default False) – When set to True debug outputs will be printed where applicable.

  • overwrite (bool, default False) – When set to True, the processing step directory will be completely deleted and newly created when called.

  • segmentation_f (Class, default None) – Class containing segmentation workflow.

  • extraction_f (Class, default None) – Class containing extraction workflow.

  • classification_f (Class, default None) – Class containing classification workflow.

  • selection_f (Class, default None) – Class containing selection workflow.

DEFAULT_CONFIG_NAME

Default config name which is used for the config file in the project directory. This name needs to be used when no config is supplied and the config is manually created in the project folder.

Type

str, default “config.yml”

DEFAULT_INPUT_IMAGE_NAME

Default file name for loading the input image.

Type

str, default “input_segmentation.h5”

DEFAULT_SEGMENTATION_DIR_NAME

Default foldername for the segmentation process.

Type

str, default “segmentation”

DEFAULT_EXTRACTION_DIR_NAME

Default foldername for the extraction process.

Type

str, default “extraction”

DEFAULT_CLASSIFICATION_DIR_NAME

Default foldername for the classification process.

Type

str, default “selection”

DEFAULT_SELECTION_DIR_NAME

Default foldername for the selection process.

Type

str, default “classification”

load_input_from_array(img, label, overwrite=False)

Function to load imaging data from an array into the TimecourseProject.

The provided array needs to fullfill the following conditions: - shape: NCYX - all images need to have the same dimensions and the same number of channels - channels need to be in the following order: nucleus, cytosol other channels - dtype uint16.

Parameters
  • img (numpy.ndarray) – Numpy array of shape “[num_images, channels, height, width]”.

  • label (numpy.ndarray) – Numpy array of shape “[num_images, num_labels]” containing the labels for each image. The labels need to have the following structure: “image_index”, “unique_image_identifier”, “…”

  • overwrite (bool, default False) – If set to True, the function will overwrite the existing input image.

load_input_from_files(input_dir, channels, timepoints, plate_layout, img_size=1080, overwrite=False)

Function to load timecourse experiments recorded with an opera phenix into the TimecourseProject.

Before being able to use this function the exported images from the opera phenix first need to be parsed, sorted and renamed using the sparcstools package.

In addition a plate layout file needs to be created that contains the information on imaged experiment and the experimental conditions for each well. This file needs to be in the following format, using the well notation RowXX_WellXX:

Well

Condition1

Condition2

RowXX_WellXX

A

B

A tab needs to be used as a seperator and the file saved as a .tsv file.

Parameters
  • input_dir (str) – Path to the directory containing the sorted images from the opera phenix.

  • channels (list(str)) – List containing the names of the channels that should be loaded.

  • timepoints (list(str)) – List containing the names of the timepoints that should be loaded. Will return a warning if you try to load a timepoint that is not found in the data.

  • plate_layout (str) – Path to the plate layout file. For the format please see above.

  • img_size (int, default 1080) – Size of the images that should be loaded. All images will be cropped to this size.

  • overwrite (bool, default False) – If set to True, the function will overwrite the existing input image.

Example

>>> channels = ["DAPI", "Alexa488", "mCherry"]
>>> timepoints = ["Timepoint"+str(x).zfill(3) for x in list(range(1, 3))]
>>> input_dir = "path/to/sorted/outputs/from/sparcstools"
>>> plate_layout = "plate_layout.tsv"
>>> project.load_input_from_files(input_dir = input_dir,  channels = channels,  timepoints = timepoints, plate_layout = plate_layout, overwrite = True)
load_input_from_files_and_merge(input_dir, channels, timepoints, plate_layout, img_size=1080, stitching_channel='Alexa488', overlap=0.1, max_shift=10, overwrite=False, nucleus_channel='DAPI', cytosol_channel='Alexa488')

Function to load timecourse experiments recorded with an opera phenix into a TimecourseProject. In addition to loading the images, this wrapper function also stitches images acquired in the same well (this assumes that the tiles were aquired with overlap and in a rectangular shape) using the sparcstools package. Implementation of this function is currently still slow for many wells/timepoints as stitching is handled consecutively and not in parallel. This will be fixed in the future.

Parameters
  • input_dir (str) – Path to the directory containing the sorted images from the opera phenix.

  • channels (list(str)) – List containing the names of the channels that should be loaded.

  • timepoints (list(str)) – List containing the names of the timepoints that should be loaded. Will return a warning if you try to load a timepoint that is not found in the data.

  • plate_layout (str) – Path to the plate layout file. For the format please see above.

  • img_size (int, default 1080) – Size of the images that should be loaded. All images will be cropped to this size.

  • stitching_channel (str, default "Alexa488") – string indicated on which channel the stitching should be calculated.

  • overlap (float, default 0.1) – float indicating the overlap between the tiles that were aquired.

  • max_shift (int, default 10) – int indicating the maximum shift that is allowed when stitching the tiles. If a calculated shift is larger than this threshold between two tiles then the position of these tiles is not updated and is set according to the calculated position based on the overlap.

  • overwrite (bool, default False) – If set to True, the function will overwrite the existing input image.

  • nucleus_channel (str, default "DAPI") – string indicating the channel that should be used for the nucleus channel.

  • cytosol_channel (str, default "Alexa488") – string indicating the channel that should be used for the cytosol channel.

segment(overwrite=False, *args, **kwargs)

segment timecourse project with the defined segmentation method.

extract(*args, **kwargs)

extract single cells from a timecourse project with the defined extraction method.

segmentation

Segmentation

class sparcscore.pipeline.segmentation.Segmentation(*args, **kwargs)

Bases: sparcscore.pipeline.base.ProcessingStep

Segmentation helper class used for creating segmentation workflows. .. attribute:: maps

Segmentation workflows based on the Segmentation class can use maps for saving and loading checkpoints and perform. Maps can be numpy arrays

type

dict(str)

DEFAULT_OUTPUT_FILE
Type

str, default segmentation.h5

DEFAULT_FILTER_FILE
Type

str, default classes.csv

PRINT_MAPS_ON_DEBUG
Type

bool, default False

identifier

Only set if called by ShardedSegmentation. Unique index of the shard.

Type

int, default None

window

Only set if called by ShardedSegmentation. Defines the window which is assigned to the shard. The window will be applied to the input. The first element refers to the first dimension of the image and so on. For example use [(0,1000),(0,2000)] To crop the image to 1000 px height and 2000 px width from the top left corner.

Type

list(tuple), default None

input_path

Only set if called by ShardedSegmentation. Location of the input hdf5 file. During sharded segmentation the ShardedSegmentation derived helper class will save the input image in form of a hdf5 file. This makes the input image available for parallel reading by the segmentation processes.

Type

str, default None

Example

def process(self):
    # two maps are initialized
    self.maps = {"map0": None,
                 "map1": None}

    # its checked if the segmentation directory already contains these maps and they are then loaded. The index of the first map which has not been found is returned. It indicates the step where computation needs to resume
    current_step = self.load_maps_from_disk()

    if current_step <= 0:
        # do stuff and generate map0
        self.save_map("map0")

    if current_step <= 1:
        # do stuff and generate map1
        self.save_map("map1")
initialize_as_shard(identifier, window, input_path)

Initialize Segmentation Step with further parameters needed for federated segmentation.

Important

This function is intented for internal use by the ShardedSegmentation helper class. In most cases it is not relevant to the creation of custom segmentation workflows.

Parameters
  • identifier (int) – Unique index of the shard.

  • window (list(tuple)) – Defines the window which is assigned to the shard. The window will be applied to the input. The first element refers to the first dimension of the image and so on. For example use [(0,1000),(0,2000)] To crop the image to 1000 px height and 2000 px width from the top left corner.

  • input_path (str) – Location of the input hdf5 file. During sharded segmentation the ShardedSegmentation derived helper class will save the input image in form of a hdf5 file. This makes the input image available for parallel reading by the segmentation processes.

call_as_shard()

Wrapper function for calling a sharded segmentation.

Important

This function is intented for internal use by the ShardedSegmentation helper class. In most cases it is not relevant to the creation of custom segmentation workflows.

save_segmentation(channels, labels, classes)

Saves the results of a segmentation at the end of the process.

Parameters
  • channels (np.array) – Numpy array of shape (height, width) or``(channels, height, width)``. Channels are all data which are saved as floating point values e.g. images.

  • labels (np.array) – Numpy array of shape (height, width). Labels are all data which are saved as integer values. These are mostly segmentation maps with integer values corresponding to the labels of cells.

  • classes (list(int)) – List of all classes in the labels array, which have passed the filtering step. All classes contained in this list will be extracted.

save_segmentation_zarr(channels, labels)

Saves the results of a segemtnation at the end of the process to ome.zarr

load_maps_from_disk()

Tries to load all maps which were defined in self.maps and returns the current state of processing.

Returns

(int): Index of the first map which could not be loaded. An index of zero indicates that computation needs to start at the first map.

save_map(map_name)

Saves newly computed map.

Args

map_name (str): name of the map to be saved, as defined in self.maps.

Example

# declare all intermediate maps
self.maps = {"myMap": None}

# load intermediate maps if possible and get current processing step
current_step = self.load_maps_from_disk()

if current_step <= 0:

    # do some computations

    self.maps["myMap"] = myNumpyArray

    # save map
    self.save_map("myMap")

ShardedSegmentation

class sparcscore.pipeline.segmentation.ShardedSegmentation(*args, **kwargs)

Bases: sparcscore.pipeline.segmentation.Segmentation

object which can create log entries.

DEFAULT_OUTPUT_FILE

Default output file name for segmentations.

Type

str, default segmentation.h5

DEFAULT_FILTER_FILE

Default file with filtered class IDs.

Type

str, default classes.csv

DEFAULT_INPUT_IMAGE_NAME

Default name for the input image, which is written to disk as hdf5 file.

Type

str, default input_image.h5

DEFAULT_SHARD_FOLDER

Date and time format used for logging.

Type

str, default tiles

resolve_sharding(sharding_plan)

The function iterates over a sharding plan and generates a new stitched hdf5 based segmentation.

TimecourseSegmentation

class sparcscore.pipeline.segmentation.TimecourseSegmentation(*args, **kwargs)

Bases: sparcscore.pipeline.segmentation.Segmentation

Segmentation helper class used for creating segmentation workflows working with timecourse data.

initialize_as_shard(index, input_path)

Initialize Segmentation Step with further parameters needed for federated segmentation.

Important

This function is intented for internal use by the ShardedSegmentation helper class. In most cases it is not relevant to the creation of custom segmentation workflows.

Parameters
  • index (int) – Unique indexes of the elements that need to be segmented.

  • input_path (str) – Location of the input hdf5 file. During sharded segmentation the ShardedSegmentation derived helper class will save the input image in form of a hdf5 file. This makes the input image available for parallel reading by the segmentation processes.

call_as_shard()

Wrapper function for calling a sharded segmentation.

Important

This function is intented for internal use by the ShardedSegmentation helper class. In most cases it is not relevant to the creation of custom segmentation workflows.

save_segmentation(input_image, labels, classes)

Saves the results of a segmentation at the end of the process.

Parameters
  • labels (np.array) – Numpy array of shape (height, width). Labels are all data which are saved as integer values. These are mostly segmentation maps with integer values corresponding to the labels of cells.

  • classes (list(int)) – List of all classes in the labels array, which have passed the filtering step. All classes contained in this list will be extracted.

adjust_segmentation_indexes()

The function iterates over all present segmented files and adjusts the indexes so that they are unique throughout.

MultithreadedTimecourseSegmentation

class sparcscore.pipeline.segmentation.MultithreadedSegmentation(*args, **kwargs)

Bases: sparcscore.pipeline.segmentation.TimecourseSegmentation

workflows

class sparcscore.pipeline.workflows.BaseSegmentation(*args, **kwargs)
class sparcscore.pipeline.workflows.WGASegmentation(*args, **kwargs)
class sparcscore.pipeline.workflows.ShardedWGASegmentation(*args, **kwargs)
method

alias of sparcscore.pipeline.workflows.WGASegmentation

class sparcscore.pipeline.workflows.DAPISegmentation(*args, **kwargs)
class sparcscore.pipeline.workflows.ShardedDAPISegmentation(*args, **kwargs)
method

alias of sparcscore.pipeline.workflows.DAPISegmentation

class sparcscore.pipeline.workflows.DAPISegmentationCellpose(*args, **kwargs)
class sparcscore.pipeline.workflows.ShardedDAPISegmentationCellpose(*args, **kwargs)
method

alias of sparcscore.pipeline.workflows.DAPISegmentationCellpose

class sparcscore.pipeline.workflows.CytosolSegmentationCellpose(*args, **kwargs)
class sparcscore.pipeline.workflows.ShardedCytosolSegmentationCellpose(*args, **kwargs)
method

alias of sparcscore.pipeline.workflows.CytosolSegmentationCellpose

class sparcscore.pipeline.workflows.WGATimecourseSegmentation(*args, **kwargs)

Specialized Processing for Timecourse segmentation (i.e. smaller tiles not stitched together from many different wells and or timepoints). No intermediate results are saved and everything is written to one .hdf5 file.

class WGASegmentation_Timecourse(*args, **kwargs)
method

alias of sparcscore.pipeline.workflows.WGASegmentation

method

alias of sparcscore.pipeline.workflows.WGATimecourseSegmentation.WGASegmentation_Timecourse

class sparcscore.pipeline.workflows.MultithreadedWGATimecourseSegmentation(*args, **kwargs)
class WGASegmentation_Timecourse(*args, **kwargs)
method

alias of sparcscore.pipeline.workflows.WGASegmentation

method

alias of sparcscore.pipeline.workflows.MultithreadedWGATimecourseSegmentation.WGASegmentation_Timecourse

class sparcscore.pipeline.workflows.Cytosol_Cellpose_TimecourseSegmentation(*args, **kwargs)

Specialized Processing for Timecourse segmentation (i.e. smaller tiles not stitched together from many different wells and or timepoints). No intermediate results are saved and everything is written to one .hdf5 file. Uses Cellpose segmentation models.

class CytosolSegmentationCellpose_Timecourse(*args, **kwargs)
method

alias of sparcscore.pipeline.workflows.CytosolSegmentationCellpose

method

alias of sparcscore.pipeline.workflows.Cytosol_Cellpose_TimecourseSegmentation.CytosolSegmentationCellpose_Timecourse

class sparcscore.pipeline.workflows.Multithreaded_Cytosol_Cellpose_TimecourseSegmentation(*args, **kwargs)
class CytosolSegmentationCellpose_Timecourse(*args, **kwargs)
method

alias of sparcscore.pipeline.workflows.CytosolSegmentationCellpose

method

alias of sparcscore.pipeline.workflows.Multithreaded_Cytosol_Cellpose_TimecourseSegmentation.CytosolSegmentationCellpose_Timecourse

extraction

HDF5CellExtraction

class sparcscore.pipeline.extraction.HDF5CellExtraction(*args, **kwargs)

Bases: sparcscore.pipeline.base.ProcessingStep

A class to extracts single cell images from a segmented SPARCSpy project and save the results to an HDF5 file.

process(input_segmentation_path, filtered_classes_path)

Process function to run the extraction method.

Parameters
  • input_segmentation_path (str) – Path of the segmentation hdf5 file. IF this class is used as part of a project processing workflow this argument will be provided automatically.

  • filtered_classes_path (str) – Path of the filtered classes resulting from segementation. If this class is used as part of a project processing workflow this argument will be provided automatically.

Important

If this class is used as part of a project processing workflow, all of the arguments will be provided by the Project class based on the previous segmentation. The Project class will automaticly provide the most recent segmentation forward together with the supplied parameters.

Example

#after project is initialized and input data has been loaded and segmented
project.extract()

Note

The following parameters are required in the config file when running this method:

HDF5CellExtraction:

    compression: True

    #threads used in multithreading
    threads: 80

    # image size in pixel
    image_size: 128

    # directory where intermediate results should be saved
    cache: "/mnt/temp/cache"

    #specs to define how hdf5 data should be chunked and saved
    hdf5_rdcc_nbytes: 5242880000 # 5gb 1024 * 1024 * 5000
    hdf5_rdcc_w0: 1
    hdf5_rdcc_nslots: 50000

TimecourseHDF5CellExtraction

class sparcscore.pipeline.extraction.TimecourseHDF5CellExtraction(*args, **kwargs)

Bases: sparcscore.pipeline.extraction.HDF5CellExtraction

A class to extracts single cell images from a segmented SPARCSpy Timecourse project and save the results to an HDF5 file.

Functionality is the same as the HDF5CellExtraction except that the class is able to deal with an additional dimension(t) in the input data.

process(input_segmentation_path, filtered_classes_path)

Process function to run the extraction method.

Parameters
  • input_segmentation_path (str) – Path of the segmentation hdf5 file. IF this class is used as part of a project processing workflow this argument will be provided automatically.

  • filtered_classes_path (str) – Path of the filtered classes resulting from segementation. If this class is used as part of a project processing workflow this argument will be provided automatically.

Important

If this class is used as part of a project processing workflow, all of the arguments will be provided by the Project class based on the previous segmentation. The Project class will automaticly provide the most recent segmentation forward together with the supplied parameters.

Example

#after project is initialized and input data has been loaded and segmented
project.extract()

Note

The following parameters are required in the config file when running this method:

HDF5CellExtraction:

    compression: True

    #threads used in multithreading
    threads: 80

    # image size in pixel
    image_size: 128

    # directory where intermediate results should be saved
    cache: "/mnt/temp/cache"

    #specs to define how hdf5 data should be chunked and saved
    hdf5_rdcc_nbytes: 5242880000 # 5gb 1024 * 1024 * 5000
    hdf5_rdcc_w0: 1
    hdf5_rdcc_nslots: 50000

classification

MLClusterClassifier

class sparcscore.pipeline.classification.MLClusterClassifier(config, path, debug=False, overwrite=False, intermediate_output=True)

Class for classifying single cells using a pre-trained machine learning model. This class takes a pre-trained model and uses it to classify single_cells, using the model’s forward function or encoder function, depending on the user’s choice. The classification results are saved to a TSV file.

__call__(extraction_dir, accessory, size=0, project_dataloader=<class 'sparcscore.ml.datasets.HDF5SingleCellDataset'>, accessory_dataloader=<class 'sparcscore.ml.datasets.HDF5SingleCellDataset'>)

Function called to perform classification on the provided HDF5 dataset.

Parameters
  • extraction_dir (str) – directory containing the extracted HDF5 files from the project. If this class is used as part of a project processing workflow this argument will be provided automatically.

  • accessory (list) – list containing accessory datasets on which inference should be performed in addition to the cells contained within the current project

  • size (int, default = 0) – How many cells should be selected for inference. Default is 0, then all cells are selected.

Return type

Writes results to tsv files located in the project directory.

Important

If this class is used as part of a project processing workflow, the first argument will be provided by the Project class based on the previous single-cell extraction. Therefore, only the second and third argument need to be provided. The Project class will automaticly provide the most recent extracted single-cell dataset together with the supplied parameters.

Example

# define acceossory dataset: additional hdf5 datasets that you want to perform an inference on
# leave empty if you only want to infere on all extracted cells in the current project

accessory = ([], [], [])
project.classify(accessory = accessory)

Note

The following parameters are required in the config file:

MLClusterClassifier:
    # channel number on which the classification should be performed
    channel_classification: 4

    #number of threads to use for dataloader
    threads: 24
    dataloader_worker: 24

    #batch size to pass to GPU
    batch_size: 900

    #path to pytorch checkpoint that should be used for inference
    network: "path/to/model/"

    #classifier architecture implemented in SPARCSpy
    # choose one of VGG1, VGG2, VGG1_old, VGG2_old
    classifier_architecture: "VGG2_old"

    #if more than one checkpoint is provided in the network directory which checkpoint should be chosen
    # should either be "max" or a numeric value indicating the epoch number
    epoch: "max"

    #name of the classifier used for saving the classification results to a directory
    screen_label: "Autophagy_15h_classifier1"

    # list of which inference methods shoudl be performed
    # available: "forward" and "encoder"
    # if "forward": images are passed through all layers of the modela nd the final inference results are written to file
    # if "encoder": activations at the end of the CNN is written to file
    encoders: ["forward", "encoder"]

    # on which device inference should be performed
    # for speed should be "cuda"
    inference_device: "cuda"

CellFeaturizer

class sparcscore.pipeline.classification.CellFeaturizer(config, path, debug=False, overwrite=False, intermediate_output=True)

Class for extracting general image features from SPARCS single-cell image datasets. The extracted features are saved to a TSV file. The features are calculated on the basis of a specified channel.

The features which are calculated are:

  • area of the nucleus in px,

  • area of the cytosol in px,

  • mean intensity of chosen channel

  • median intensity of chosen channel,

  • 75% quantile of chosen channel,

  • 25% quantile of chosen channel,

  • summed intensity of the chosen channel in the region labeled as nucleus,

  • summed intensity of the chosen channel in the region labeled as cyotosl,

  • summed intensity of the chosen channel in the region labelled as nucleus normalized to the nucleus area,

  • summed intensity of the chosen channel in the region labelled as cytosol normalized to the cytosol area, nucleus_area

The features are outputed in this order in the tsv file.

__call__(extraction_dir, accessory, size=0, project_dataloader=<class 'sparcscore.ml.datasets.HDF5SingleCellDataset'>, accessory_dataloader=<class 'sparcscore.ml.datasets.HDF5SingleCellDataset'>)

Function called to perform featurization on the provided HDF5 dataset.

Parameters
  • extraction_dir (str) – directory containing the extracted HDF5 files from the project. If this class is used as part of a project processing workflow this argument will be provided automatically.

  • accessory (list) – list containing accessory datasets on which inference should be performed in addition to the cells contained within the current project

  • size (int, default = 0) – How many cells should be selected for inference. Default is 0, then all cells are selected.

Return type

Writes results to tsv files located in the project directory.

Important

If this class is used as part of a project processing workflow, the first argument will be provided by the Project class based on the previous single-cell extraction. Therefore, only the second and third argument need to be provided. The Project class will automaticly provide the most recent extraction results together with the supplied parameters.

Example

# define acceossory dataset: additional hdf5 datasets that you want to perform an inference on
# leave empty if you only want to infere on all extracted cells in the current project

accessory = ([], [], [])
project.classify(accessory = accessory)

Note

The following parameters are required in the config file:

CellFeaturizer:
    # channel number on which the featurization should be performed
    channel_classification: 4

    #number of threads to use for dataloader
    dataloader_worker: 0 #needs to be 0 if using cpu

    #batch size to pass to GPU
    batch_size: 900

    # on which device inference should be performed
    # for speed should be "cuda"
    inference_device: "cpu"

    #label under which the results should be saved
    screen_label: "Ch3_Featurization"

selection

LMDSelection

class sparcscore.pipeline.selection.LMDSelection(*args, **kwargs)

Bases: sparcscore.pipeline.base.ProcessingStep

Select single cells from a segmented hdf5 file and generate cutting data for the Leica LMD microscope. This method class relies on the functionality of the pylmd library.

process(hdf_location, cell_sets, calibration_marker)

Process function for selecting cells and generating their XML. Under the hood this method relies on the pylmd library and utilizies its SegmentationLoader Class.

Parameters
  • hdf_location (str) – Path of the segmentation hdf5 file. If this class is used as part of a project processing workflow, this argument will be provided.

  • cell_sets (list of dict) – List of dictionaries containing the sets of cells which should be sorted into a single well.

  • calibration_marker (numpy.array) – Array of size ‘(3,2)’ containing the calibration marker coordinates in the ‘(row, column)’ format.

Important

If this class is used as part of a project processing workflow, the first argument will be provided by the Project class based on the previous segmentation. Therefore, only the second and third argument need to be provided. The Project class will automaticly provide the most recent segmentation forward together with the supplied parameters.

Example

# Calibration marker should be defined as (row, column).
marker_0 = np.array([-10,-10])
marker_1 = np.array([-10,1100])
marker_2 = np.array([1100,505])

# A numpy Array of shape (3, 2) should be passed.
calibration_marker = np.array([marker_0, marker_1, marker_2])


# Sets of cells can be defined by providing a name and a list of classes in a dictionary.
cells_to_select = [{"name": "dataset1", "classes": [1,2,3]}]

# Alternatively, a path to a csv file can be provided.
# If a relative path is provided, it is accessed relativ to the projects base directory.
cells_to_select += [{"name": "dataset2", "classes": "segmentation/class_subset.csv"}]

# If desired, wells can be passed with the individual sets.
cells_to_select += [{"name": "dataset3", "classes": [4,5,6], "well":"A1"}]

project.select(cells_to_select, calibration_marker)

Note

The following parameters are required in the config file:

LMDSelection:
    threads: 10

    # defines the channel used for generating cutting masks
    # segmentation.hdf5 => labels => segmentation_channel
    # When using WGA segmentation:
    #    0 corresponds to nuclear masks
    #    1 corresponds to cytosolic masks.
    segmentation_channel: 0

    # dilation of the cutting mask in pixel
    shape_dilation: 10

    # Cutting masks are transformed by binary dilation and erosion
    binary_smoothing: 3

    # number of datapoints which are averaged for smoothing
    # the number of datapoints over an distance of n pixel is 2*n
    convolution_smoothing: 25

    # fold reduction of datapoints for compression
    poly_compression_factor: 30

    # Optimization of the cutting path inbetween shapes
    # optimized paths improve the cutting time and the microscopes focus
    # valid options are ["none", "hilbert", "greedy"]
    path_optimization: "hilbert"

    # Paramter required for hilbert curve based path optimization.
    # Defines the order of the hilbert curve used, which needs to be tuned with the total cutting area.
    # For areas of 1 x 1 mm we recommend at least p = 4,  for whole slides we recommend p = 7.
    hilbert_p: 7

    # Parameter required for greedy path optimization.
    # Instead of a global distance matrix, the k nearest neighbours are approximated.
    # The optimization problem is then greedily solved for the known set of nearest neighbours until the first set of neighbours is exhausted.
    # Established edges are then removed and the nearest neighbour approximation is recursivly repeated.
    greedy_k: 20

    # The LMD reads coordinates as integers which leads to rounding of decimal places.
    # Points spread between two whole coordinates are therefore collapsed to whole coordinates.
    # This can be mitigated by scaling the entire coordinate system by a defined factor.
    # For a resolution of 0.6 um / px a factor of 100 is recommended.
    xml_decimal_transform: 100

    # Overlapping shapes are merged based on a nearest neighbour heuristic.
    # All selected shapes closer than distance_heuristic pixel are checked for overlap.
    distance_heuristic: 300