pipeline
base
A collection of base classes from which other classes in the SPARCSpy environment can inherit that manage base functionalities like logging or directory creation.
Logable
Base Class which generates framework for logging.
- class sparcscore.pipeline.base.Logable(debug=False)
object which can create log entries.
- Parameters
debug (bool, default
False
) – When set toTrue
log entries will be printed to the console.
- directory
A directory must be set in every descendant before log can be called.
- Type
str
- DEFAULT_LOG_NAME
Default log file name.
- Type
str, default
processing.log
- DEFAULT_FORMAT
Date and time format used for logging. See datetime.strftime.
- Type
str
- log(message)
log a message
- Parameters
message (str, list(str), dict(str)) – Strings are s
- get_timestamp()
Get the current timestamp in the DEFAULT_FORMAT.
- Returns
Formatted timestamp.
- Return type
str
ProcessingStep
Starting point for all processing steps. Reads a config file that contains the parameters used to set up a processing method and generates the folder structure necessary for saving the generated outputs.
- class sparcscore.pipeline.base.ProcessingStep(config, directory, debug=False, intermediate_output=False, overwrite=True)
Bases:
sparcscore.pipeline.base.Logable
Processing step. Can load a configuration file and create a subdirectory under the project class for the processing step.
- Parameters
config (dict) – Config file which is passed by the Project class when called. Is loaded from the project based on the name of the class.
directory (str) – Directory which should be used by the processing step. The directory will be newly created if it does not exist yet. When used with the
vipercore.pipeline.project.Project
class, a subdirectory of the project directory is passed.intermediate_output (bool, default
False
) – When set to True intermediate outputs will be saved where applicable.debug (bool, default
False
) – When set to True debug outputs will be printed where applicable.overwrite (bool, default
False
) – When set to True, the processing step directory will be completely deleted and newly created when called.
- register_parameter(key, value)
Registers a new parameter by updating the configuration dictionary if the key didn’t exist.
- Parameters
key (str) – Name of the parameter.
value – Value of the parameter.
- get_directory()
Get the directory for this processing step.
- Returns
Directory path.
- Return type
str
project
Within SPARCSpy, all operations are centered around the concept of a Project
. A Project
is a python class which manages all of the SPARCSpy processing steps and is the central element through which all operations are performed. Each Project
directly maps to a directory on the file system which contains all of the inputs to a specific SPARCSpy run as well as the generated outputs. Depending on the structure of the data that is to be processed a different Project class is required. Please see here for more information.
Project
- class sparcscore.pipeline.project.Project(location_path, config_path='', *args, intermediate_output=False, debug=False, overwrite=False, segmentation_f=None, extraction_f=None, classification_f=None, selection_f=None, **kwargs)
Bases:
sparcscore.pipeline.base.Logable
Project base class used to create a SPARCSpy project. This class manages all of the SPARCSpy processing steps. It directly maps to a directory on the file system which contains all of the project inputs as well as the generated outputs.
- Parameters
location_path (str) – Path to the folder where to project should be created. The folder is created in case the specified folder does not exist.
config_path (str, optional, default "") – Path pointing to a valid configuration file. The file will be copied to the project directory and renamed to the name specified in
DEFAULT_CLASSIFICATION_DIR_NAME
. If no config is specified, the existing config in the project directory will be used, if possible. See the section configuration to find out more about the config file.intermediate_output (bool, default False) – When set to True intermediate outputs will be saved where applicable.
debug (bool, default False) – When set to True debug outputs will be printed where applicable.
overwrite (bool, default False) – When set to True, the processing step directory will be completely deleted and newly created when called.
segmentation_f (Class, default None) – Class containing segmentation workflow.
extraction_f (Class, default None) – Class containing extraction workflow.
classification_f (Class, default None) – Class containing classification workflow.
selection_f (Class, default None) – Class containing selection workflow.
- DEFAULT_CONFIG_NAME
Default config name which is used for the config file in the project directory. This name needs to be used when no config is supplied and the config is manually created in the project folder.
- Type
str, default “config.yml”
- DEFAULT_SEGMENTATION_DIR_NAME
Default foldername for the segmentation process.
- Type
str, default “segmentation”
- DEFAULT_EXTRACTION_DIR_NAME
Default foldername for the extraction process.
- Type
str, default “extraction”
- DEFAULT_CLASSIFICATION_DIR_NAME
Default foldername for the classification process.
- Type
str, default “selection”
- DEFAULT_SELECTION_DIR_NAME
Default foldername for the selection process.
- Type
str, default “classification”
- load_input_from_file(file_paths, crop=[(0, - 1), (0, - 1)])
Load input image from a list of files. The channels need to be specified in the following order: nucleus, cytosol other channels.
- Parameters
file_paths (list(str)) – List containing paths to each channel like [“path1/img.tiff”, “path2/img.tiff”, “path3/img.tiff”]. Expects a list of file paths with length “input_channel” as defined in the config.yml.
crop (list(tuple), optional) – When set, it can be used to crop the input image. The first element refers to the first dimension of the image and so on. For example use “[(0,1000),(0,2000)]” to crop the image to 1000 px height and 2000 px width from the top left corner.
- load_input_from_array(array, remap=None)
Load input image from an already loaded numpy array. The numpy array needs to have the following shape: CXY. The channels need to be in the following order: nucleus, cellmembrane channel, other channnels or a remapping needs to be defined.
- Parameters
array (numpy.ndarray) – Numpy array of shape “[channels, height, width]”.
remap (list(int), optional) – Define remapping of channels. For example use “[1, 0, 2]” to change the order of the first and the second channel. The expected order is Nucleus Channel, Cellmembrane Channel followed by other channels.
- segment(*args, **kwargs)
Segment project with the selected segmentation method.
- extract(*args, **kwargs)
Extract single cells with the defined extraction method.
- classify(*args, **kwargs)
Classify extracted single cells with the defined classification method.
- select(*args, **kwargs)
Select specified classes using the defined selection method.
TimecourseProject
- class sparcscore.pipeline.project.TimecourseProject(*args, **kwargs)
Bases:
sparcscore.pipeline.project.Project
TimecourseProject class used to create a SPARCSpy project for datasets that have multiple fields of view that should be processed and analysed together. It is also capable of handling multiple timepoints for the same field of view or a combiantion of both. Like the base SPARCSpy
Project
, it manages all of the SPARCSpy processing steps. Because the input data has a different dimensionality than the base SPARCSpyProject
class, it requires the use of specialized processing classes that are able to handle this additional dimensionality.- Parameters
location_path (str) – Path to the folder where to project should be created. The folder is created in case the specified folder does not exist.
config_path (str, optional, default "") – Path pointing to a valid configuration file. The file will be copied to the project directory and renamed to the name specified in
DEFAULT_CLASSIFICATION_DIR_NAME
. If no config is specified, the existing config in the project directory will be used, if possible. See the section configuration to find out more about the config file.intermediate_output (bool, default False) – When set to True intermediate outputs will be saved where applicable.
debug (bool, default False) – When set to True debug outputs will be printed where applicable.
overwrite (bool, default False) – When set to True, the processing step directory will be completely deleted and newly created when called.
segmentation_f (Class, default None) – Class containing segmentation workflow.
extraction_f (Class, default None) – Class containing extraction workflow.
classification_f (Class, default None) – Class containing classification workflow.
selection_f (Class, default None) – Class containing selection workflow.
- DEFAULT_CONFIG_NAME
Default config name which is used for the config file in the project directory. This name needs to be used when no config is supplied and the config is manually created in the project folder.
- Type
str, default “config.yml”
- DEFAULT_INPUT_IMAGE_NAME
Default file name for loading the input image.
- Type
str, default “input_segmentation.h5”
- DEFAULT_SEGMENTATION_DIR_NAME
Default foldername for the segmentation process.
- Type
str, default “segmentation”
- DEFAULT_EXTRACTION_DIR_NAME
Default foldername for the extraction process.
- Type
str, default “extraction”
- DEFAULT_CLASSIFICATION_DIR_NAME
Default foldername for the classification process.
- Type
str, default “selection”
- DEFAULT_SELECTION_DIR_NAME
Default foldername for the selection process.
- Type
str, default “classification”
- load_input_from_array(img, label, overwrite=False)
Function to load imaging data from an array into the TimecourseProject.
The provided array needs to fullfill the following conditions: - shape: NCYX - all images need to have the same dimensions and the same number of channels - channels need to be in the following order: nucleus, cytosol other channels - dtype uint16.
- Parameters
img (numpy.ndarray) – Numpy array of shape “[num_images, channels, height, width]”.
label (numpy.ndarray) – Numpy array of shape “[num_images, num_labels]” containing the labels for each image. The labels need to have the following structure: “image_index”, “unique_image_identifier”, “…”
overwrite (bool, default False) – If set to True, the function will overwrite the existing input image.
- load_input_from_files(input_dir, channels, timepoints, plate_layout, img_size=1080, overwrite=False)
Function to load timecourse experiments recorded with an opera phenix into the TimecourseProject.
Before being able to use this function the exported images from the opera phenix first need to be parsed, sorted and renamed using the sparcstools package.
In addition a plate layout file needs to be created that contains the information on imaged experiment and the experimental conditions for each well. This file needs to be in the following format, using the well notation
RowXX_WellXX
:Well
Condition1
Condition2
…
RowXX_WellXX
A
B
…
A tab needs to be used as a seperator and the file saved as a .tsv file.
- Parameters
input_dir (str) – Path to the directory containing the sorted images from the opera phenix.
channels (list(str)) – List containing the names of the channels that should be loaded.
timepoints (list(str)) – List containing the names of the timepoints that should be loaded. Will return a warning if you try to load a timepoint that is not found in the data.
plate_layout (str) – Path to the plate layout file. For the format please see above.
img_size (int, default 1080) – Size of the images that should be loaded. All images will be cropped to this size.
overwrite (bool, default False) – If set to True, the function will overwrite the existing input image.
Example
>>> channels = ["DAPI", "Alexa488", "mCherry"] >>> timepoints = ["Timepoint"+str(x).zfill(3) for x in list(range(1, 3))] >>> input_dir = "path/to/sorted/outputs/from/sparcstools" >>> plate_layout = "plate_layout.tsv"
>>> project.load_input_from_files(input_dir = input_dir, channels = channels, timepoints = timepoints, plate_layout = plate_layout, overwrite = True)
- load_input_from_files_and_merge(input_dir, channels, timepoints, plate_layout, img_size=1080, stitching_channel='Alexa488', overlap=0.1, max_shift=10, overwrite=False, nucleus_channel='DAPI', cytosol_channel='Alexa488')
Function to load timecourse experiments recorded with an opera phenix into a TimecourseProject. In addition to loading the images, this wrapper function also stitches images acquired in the same well (this assumes that the tiles were aquired with overlap and in a rectangular shape) using the sparcstools package. Implementation of this function is currently still slow for many wells/timepoints as stitching is handled consecutively and not in parallel. This will be fixed in the future.
- Parameters
input_dir (str) – Path to the directory containing the sorted images from the opera phenix.
channels (list(str)) – List containing the names of the channels that should be loaded.
timepoints (list(str)) – List containing the names of the timepoints that should be loaded. Will return a warning if you try to load a timepoint that is not found in the data.
plate_layout (str) – Path to the plate layout file. For the format please see above.
img_size (int, default 1080) – Size of the images that should be loaded. All images will be cropped to this size.
stitching_channel (str, default "Alexa488") – string indicated on which channel the stitching should be calculated.
overlap (float, default 0.1) – float indicating the overlap between the tiles that were aquired.
max_shift (int, default 10) – int indicating the maximum shift that is allowed when stitching the tiles. If a calculated shift is larger than this threshold between two tiles then the position of these tiles is not updated and is set according to the calculated position based on the overlap.
overwrite (bool, default False) – If set to True, the function will overwrite the existing input image.
nucleus_channel (str, default "DAPI") – string indicating the channel that should be used for the nucleus channel.
cytosol_channel (str, default "Alexa488") – string indicating the channel that should be used for the cytosol channel.
- segment(overwrite=False, *args, **kwargs)
segment timecourse project with the defined segmentation method.
- extract(*args, **kwargs)
extract single cells from a timecourse project with the defined extraction method.
segmentation
Segmentation
- class sparcscore.pipeline.segmentation.Segmentation(*args, **kwargs)
Bases:
sparcscore.pipeline.base.ProcessingStep
Segmentation helper class used for creating segmentation workflows. .. attribute:: maps
Segmentation workflows based on the
Segmentation
class can use maps for saving and loading checkpoints and perform. Maps can be numpy arrays- type
dict(str)
- DEFAULT_OUTPUT_FILE
- Type
str, default
segmentation.h5
- DEFAULT_FILTER_FILE
- Type
str, default
classes.csv
- PRINT_MAPS_ON_DEBUG
- Type
bool, default
False
- identifier
Only set if called by
ShardedSegmentation
. Unique index of the shard.- Type
int, default
None
- window
Only set if called by
ShardedSegmentation
. Defines the window which is assigned to the shard. The window will be applied to the input. The first element refers to the first dimension of the image and so on. For example use[(0,1000),(0,2000)]
To crop the image to 1000 px height and 2000 px width from the top left corner.- Type
list(tuple), default
None
- input_path
Only set if called by
ShardedSegmentation
. Location of the input hdf5 file. During sharded segmentation theShardedSegmentation
derived helper class will save the input image in form of a hdf5 file. This makes the input image available for parallel reading by the segmentation processes.- Type
str, default
None
Example
def process(self): # two maps are initialized self.maps = {"map0": None, "map1": None} # its checked if the segmentation directory already contains these maps and they are then loaded. The index of the first map which has not been found is returned. It indicates the step where computation needs to resume current_step = self.load_maps_from_disk() if current_step <= 0: # do stuff and generate map0 self.save_map("map0") if current_step <= 1: # do stuff and generate map1 self.save_map("map1")
- initialize_as_shard(identifier, window, input_path)
Initialize Segmentation Step with further parameters needed for federated segmentation.
Important
This function is intented for internal use by the
ShardedSegmentation
helper class. In most cases it is not relevant to the creation of custom segmentation workflows.- Parameters
identifier (int) – Unique index of the shard.
window (list(tuple)) – Defines the window which is assigned to the shard. The window will be applied to the input. The first element refers to the first dimension of the image and so on. For example use
[(0,1000),(0,2000)]
To crop the image to 1000 px height and 2000 px width from the top left corner.input_path (str) – Location of the input hdf5 file. During sharded segmentation the
ShardedSegmentation
derived helper class will save the input image in form of a hdf5 file. This makes the input image available for parallel reading by the segmentation processes.
- call_as_shard()
Wrapper function for calling a sharded segmentation.
Important
This function is intented for internal use by the
ShardedSegmentation
helper class. In most cases it is not relevant to the creation of custom segmentation workflows.
- save_segmentation(channels, labels, classes)
Saves the results of a segmentation at the end of the process.
- Parameters
channels (np.array) – Numpy array of shape
(height, width)
or``(channels, height, width)``. Channels are all data which are saved as floating point values e.g. images.labels (np.array) – Numpy array of shape
(height, width)
. Labels are all data which are saved as integer values. These are mostly segmentation maps with integer values corresponding to the labels of cells.classes (list(int)) – List of all classes in the labels array, which have passed the filtering step. All classes contained in this list will be extracted.
- save_segmentation_zarr(channels, labels)
Saves the results of a segemtnation at the end of the process to ome.zarr
- load_maps_from_disk()
Tries to load all maps which were defined in
self.maps
and returns the current state of processing.- Returns
(int): Index of the first map which could not be loaded. An index of zero indicates that computation needs to start at the first map.
- save_map(map_name)
Saves newly computed map.
- Args
map_name (str): name of the map to be saved, as defined in
self.maps
.
Example
# declare all intermediate maps self.maps = {"myMap": None} # load intermediate maps if possible and get current processing step current_step = self.load_maps_from_disk() if current_step <= 0: # do some computations self.maps["myMap"] = myNumpyArray # save map self.save_map("myMap")
TimecourseSegmentation
- class sparcscore.pipeline.segmentation.TimecourseSegmentation(*args, **kwargs)
Bases:
sparcscore.pipeline.segmentation.Segmentation
Segmentation helper class used for creating segmentation workflows working with timecourse data.
- initialize_as_shard(index, input_path)
Initialize Segmentation Step with further parameters needed for federated segmentation.
Important
This function is intented for internal use by the
ShardedSegmentation
helper class. In most cases it is not relevant to the creation of custom segmentation workflows.- Parameters
index (int) – Unique indexes of the elements that need to be segmented.
input_path (str) – Location of the input hdf5 file. During sharded segmentation the
ShardedSegmentation
derived helper class will save the input image in form of a hdf5 file. This makes the input image available for parallel reading by the segmentation processes.
- call_as_shard()
Wrapper function for calling a sharded segmentation.
Important
This function is intented for internal use by the
ShardedSegmentation
helper class. In most cases it is not relevant to the creation of custom segmentation workflows.
- save_segmentation(input_image, labels, classes)
Saves the results of a segmentation at the end of the process.
- Parameters
labels (np.array) – Numpy array of shape
(height, width)
. Labels are all data which are saved as integer values. These are mostly segmentation maps with integer values corresponding to the labels of cells.classes (list(int)) – List of all classes in the labels array, which have passed the filtering step. All classes contained in this list will be extracted.
- adjust_segmentation_indexes()
The function iterates over all present segmented files and adjusts the indexes so that they are unique throughout.
MultithreadedTimecourseSegmentation
- class sparcscore.pipeline.segmentation.MultithreadedSegmentation(*args, **kwargs)
Bases:
sparcscore.pipeline.segmentation.TimecourseSegmentation
workflows
- class sparcscore.pipeline.workflows.BaseSegmentation(*args, **kwargs)
- class sparcscore.pipeline.workflows.WGASegmentation(*args, **kwargs)
- class sparcscore.pipeline.workflows.DAPISegmentation(*args, **kwargs)
- class sparcscore.pipeline.workflows.DAPISegmentationCellpose(*args, **kwargs)
- class sparcscore.pipeline.workflows.ShardedDAPISegmentationCellpose(*args, **kwargs)
- method
alias of
sparcscore.pipeline.workflows.DAPISegmentationCellpose
- class sparcscore.pipeline.workflows.CytosolSegmentationCellpose(*args, **kwargs)
- class sparcscore.pipeline.workflows.ShardedCytosolSegmentationCellpose(*args, **kwargs)
- method
alias of
sparcscore.pipeline.workflows.CytosolSegmentationCellpose
- class sparcscore.pipeline.workflows.WGATimecourseSegmentation(*args, **kwargs)
Specialized Processing for Timecourse segmentation (i.e. smaller tiles not stitched together from many different wells and or timepoints). No intermediate results are saved and everything is written to one .hdf5 file.
- class sparcscore.pipeline.workflows.MultithreadedWGATimecourseSegmentation(*args, **kwargs)
- class sparcscore.pipeline.workflows.Cytosol_Cellpose_TimecourseSegmentation(*args, **kwargs)
Specialized Processing for Timecourse segmentation (i.e. smaller tiles not stitched together from many different wells and or timepoints). No intermediate results are saved and everything is written to one .hdf5 file. Uses Cellpose segmentation models.
- class CytosolSegmentationCellpose_Timecourse(*args, **kwargs)
- method
alias of
sparcscore.pipeline.workflows.CytosolSegmentationCellpose
- class sparcscore.pipeline.workflows.Multithreaded_Cytosol_Cellpose_TimecourseSegmentation(*args, **kwargs)
- class CytosolSegmentationCellpose_Timecourse(*args, **kwargs)
- method
alias of
sparcscore.pipeline.workflows.CytosolSegmentationCellpose
extraction
HDF5CellExtraction
- class sparcscore.pipeline.extraction.HDF5CellExtraction(*args, **kwargs)
Bases:
sparcscore.pipeline.base.ProcessingStep
A class to extracts single cell images from a segmented SPARCSpy project and save the results to an HDF5 file.
- process(input_segmentation_path, filtered_classes_path)
Process function to run the extraction method.
- Parameters
input_segmentation_path (str) – Path of the segmentation hdf5 file. IF this class is used as part of a project processing workflow this argument will be provided automatically.
filtered_classes_path (str) – Path of the filtered classes resulting from segementation. If this class is used as part of a project processing workflow this argument will be provided automatically.
Important
If this class is used as part of a project processing workflow, all of the arguments will be provided by the
Project
class based on the previous segmentation. The Project class will automaticly provide the most recent segmentation forward together with the supplied parameters.Example
#after project is initialized and input data has been loaded and segmented project.extract()
Note
The following parameters are required in the config file when running this method:
HDF5CellExtraction: compression: True #threads used in multithreading threads: 80 # image size in pixel image_size: 128 # directory where intermediate results should be saved cache: "/mnt/temp/cache" #specs to define how hdf5 data should be chunked and saved hdf5_rdcc_nbytes: 5242880000 # 5gb 1024 * 1024 * 5000 hdf5_rdcc_w0: 1 hdf5_rdcc_nslots: 50000
TimecourseHDF5CellExtraction
- class sparcscore.pipeline.extraction.TimecourseHDF5CellExtraction(*args, **kwargs)
Bases:
sparcscore.pipeline.extraction.HDF5CellExtraction
A class to extracts single cell images from a segmented SPARCSpy Timecourse project and save the results to an HDF5 file.
Functionality is the same as the HDF5CellExtraction except that the class is able to deal with an additional dimension(t) in the input data.
- process(input_segmentation_path, filtered_classes_path)
Process function to run the extraction method.
- Parameters
input_segmentation_path (str) – Path of the segmentation hdf5 file. IF this class is used as part of a project processing workflow this argument will be provided automatically.
filtered_classes_path (str) – Path of the filtered classes resulting from segementation. If this class is used as part of a project processing workflow this argument will be provided automatically.
Important
If this class is used as part of a project processing workflow, all of the arguments will be provided by the
Project
class based on the previous segmentation. The Project class will automaticly provide the most recent segmentation forward together with the supplied parameters.Example
#after project is initialized and input data has been loaded and segmented project.extract()
Note
The following parameters are required in the config file when running this method:
HDF5CellExtraction: compression: True #threads used in multithreading threads: 80 # image size in pixel image_size: 128 # directory where intermediate results should be saved cache: "/mnt/temp/cache" #specs to define how hdf5 data should be chunked and saved hdf5_rdcc_nbytes: 5242880000 # 5gb 1024 * 1024 * 5000 hdf5_rdcc_w0: 1 hdf5_rdcc_nslots: 50000
classification
MLClusterClassifier
- class sparcscore.pipeline.classification.MLClusterClassifier(config, path, debug=False, overwrite=False, intermediate_output=True)
Class for classifying single cells using a pre-trained machine learning model. This class takes a pre-trained model and uses it to classify single_cells, using the model’s forward function or encoder function, depending on the user’s choice. The classification results are saved to a TSV file.
- __call__(extraction_dir, accessory, size=0, project_dataloader=<class 'sparcscore.ml.datasets.HDF5SingleCellDataset'>, accessory_dataloader=<class 'sparcscore.ml.datasets.HDF5SingleCellDataset'>)
Function called to perform classification on the provided HDF5 dataset.
- Parameters
extraction_dir (str) – directory containing the extracted HDF5 files from the project. If this class is used as part of a project processing workflow this argument will be provided automatically.
accessory (list) – list containing accessory datasets on which inference should be performed in addition to the cells contained within the current project
size (int, default = 0) – How many cells should be selected for inference. Default is 0, then all cells are selected.
- Return type
Writes results to tsv files located in the project directory.
Important
If this class is used as part of a project processing workflow, the first argument will be provided by the
Project
class based on the previous single-cell extraction. Therefore, only the second and third argument need to be provided. The Project class will automaticly provide the most recent extracted single-cell dataset together with the supplied parameters.Example
# define acceossory dataset: additional hdf5 datasets that you want to perform an inference on # leave empty if you only want to infere on all extracted cells in the current project accessory = ([], [], []) project.classify(accessory = accessory)
Note
The following parameters are required in the config file:
MLClusterClassifier: # channel number on which the classification should be performed channel_classification: 4 #number of threads to use for dataloader threads: 24 dataloader_worker: 24 #batch size to pass to GPU batch_size: 900 #path to pytorch checkpoint that should be used for inference network: "path/to/model/" #classifier architecture implemented in SPARCSpy # choose one of VGG1, VGG2, VGG1_old, VGG2_old classifier_architecture: "VGG2_old" #if more than one checkpoint is provided in the network directory which checkpoint should be chosen # should either be "max" or a numeric value indicating the epoch number epoch: "max" #name of the classifier used for saving the classification results to a directory screen_label: "Autophagy_15h_classifier1" # list of which inference methods shoudl be performed # available: "forward" and "encoder" # if "forward": images are passed through all layers of the modela nd the final inference results are written to file # if "encoder": activations at the end of the CNN is written to file encoders: ["forward", "encoder"] # on which device inference should be performed # for speed should be "cuda" inference_device: "cuda"
CellFeaturizer
- class sparcscore.pipeline.classification.CellFeaturizer(config, path, debug=False, overwrite=False, intermediate_output=True)
Class for extracting general image features from SPARCS single-cell image datasets. The extracted features are saved to a TSV file. The features are calculated on the basis of a specified channel.
The features which are calculated are:
area of the nucleus in px,
area of the cytosol in px,
mean intensity of chosen channel
median intensity of chosen channel,
75% quantile of chosen channel,
25% quantile of chosen channel,
summed intensity of the chosen channel in the region labeled as nucleus,
summed intensity of the chosen channel in the region labeled as cyotosl,
summed intensity of the chosen channel in the region labelled as nucleus normalized to the nucleus area,
summed intensity of the chosen channel in the region labelled as cytosol normalized to the cytosol area, nucleus_area
The features are outputed in this order in the tsv file.
- __call__(extraction_dir, accessory, size=0, project_dataloader=<class 'sparcscore.ml.datasets.HDF5SingleCellDataset'>, accessory_dataloader=<class 'sparcscore.ml.datasets.HDF5SingleCellDataset'>)
Function called to perform featurization on the provided HDF5 dataset.
- Parameters
extraction_dir (str) – directory containing the extracted HDF5 files from the project. If this class is used as part of a project processing workflow this argument will be provided automatically.
accessory (list) – list containing accessory datasets on which inference should be performed in addition to the cells contained within the current project
size (int, default = 0) – How many cells should be selected for inference. Default is 0, then all cells are selected.
- Return type
Writes results to tsv files located in the project directory.
Important
If this class is used as part of a project processing workflow, the first argument will be provided by the
Project
class based on the previous single-cell extraction. Therefore, only the second and third argument need to be provided. The Project class will automaticly provide the most recent extraction results together with the supplied parameters.Example
# define acceossory dataset: additional hdf5 datasets that you want to perform an inference on # leave empty if you only want to infere on all extracted cells in the current project accessory = ([], [], []) project.classify(accessory = accessory)
Note
The following parameters are required in the config file:
CellFeaturizer: # channel number on which the featurization should be performed channel_classification: 4 #number of threads to use for dataloader dataloader_worker: 0 #needs to be 0 if using cpu #batch size to pass to GPU batch_size: 900 # on which device inference should be performed # for speed should be "cuda" inference_device: "cpu" #label under which the results should be saved screen_label: "Ch3_Featurization"
selection
LMDSelection
- class sparcscore.pipeline.selection.LMDSelection(*args, **kwargs)
Bases:
sparcscore.pipeline.base.ProcessingStep
Select single cells from a segmented hdf5 file and generate cutting data for the Leica LMD microscope. This method class relies on the functionality of the pylmd library.
- process(hdf_location, cell_sets, calibration_marker)
Process function for selecting cells and generating their XML. Under the hood this method relies on the pylmd library and utilizies its SegmentationLoader Class.
- Parameters
hdf_location (str) – Path of the segmentation hdf5 file. If this class is used as part of a project processing workflow, this argument will be provided.
cell_sets (list of dict) – List of dictionaries containing the sets of cells which should be sorted into a single well.
calibration_marker (numpy.array) – Array of size ‘(3,2)’ containing the calibration marker coordinates in the ‘(row, column)’ format.
Important
If this class is used as part of a project processing workflow, the first argument will be provided by the
Project
class based on the previous segmentation. Therefore, only the second and third argument need to be provided. The Project class will automaticly provide the most recent segmentation forward together with the supplied parameters.Example
# Calibration marker should be defined as (row, column). marker_0 = np.array([-10,-10]) marker_1 = np.array([-10,1100]) marker_2 = np.array([1100,505]) # A numpy Array of shape (3, 2) should be passed. calibration_marker = np.array([marker_0, marker_1, marker_2]) # Sets of cells can be defined by providing a name and a list of classes in a dictionary. cells_to_select = [{"name": "dataset1", "classes": [1,2,3]}] # Alternatively, a path to a csv file can be provided. # If a relative path is provided, it is accessed relativ to the projects base directory. cells_to_select += [{"name": "dataset2", "classes": "segmentation/class_subset.csv"}] # If desired, wells can be passed with the individual sets. cells_to_select += [{"name": "dataset3", "classes": [4,5,6], "well":"A1"}] project.select(cells_to_select, calibration_marker)
Note
The following parameters are required in the config file:
LMDSelection: threads: 10 # defines the channel used for generating cutting masks # segmentation.hdf5 => labels => segmentation_channel # When using WGA segmentation: # 0 corresponds to nuclear masks # 1 corresponds to cytosolic masks. segmentation_channel: 0 # dilation of the cutting mask in pixel shape_dilation: 10 # Cutting masks are transformed by binary dilation and erosion binary_smoothing: 3 # number of datapoints which are averaged for smoothing # the number of datapoints over an distance of n pixel is 2*n convolution_smoothing: 25 # fold reduction of datapoints for compression poly_compression_factor: 30 # Optimization of the cutting path inbetween shapes # optimized paths improve the cutting time and the microscopes focus # valid options are ["none", "hilbert", "greedy"] path_optimization: "hilbert" # Paramter required for hilbert curve based path optimization. # Defines the order of the hilbert curve used, which needs to be tuned with the total cutting area. # For areas of 1 x 1 mm we recommend at least p = 4, for whole slides we recommend p = 7. hilbert_p: 7 # Parameter required for greedy path optimization. # Instead of a global distance matrix, the k nearest neighbours are approximated. # The optimization problem is then greedily solved for the known set of nearest neighbours until the first set of neighbours is exhausted. # Established edges are then removed and the nearest neighbour approximation is recursivly repeated. greedy_k: 20 # The LMD reads coordinates as integers which leads to rounding of decimal places. # Points spread between two whole coordinates are therefore collapsed to whole coordinates. # This can be mitigated by scaling the entire coordinate system by a defined factor. # For a resolution of 0.6 um / px a factor of 100 is recommended. xml_decimal_transform: 100 # Overlapping shapes are merged based on a nearest neighbour heuristic. # All selected shapes closer than distance_heuristic pixel are checked for overlap. distance_heuristic: 300