ml

datasets

class sparcscore.ml.datasets.HDF5SingleCellDataset(*args: Any, **kwargs: Any)

Class for handling SPARCSpy single cell datasets stored in HDF5 files.

This class provides a convenient interface for SPARCSpy formated hdf5 files containing single cell datasets. It supports loading data from multiple hdf5 files within specified directories, applying transformations on the data, and returning the required information, such as label or id, along with the single cell data.

root_dir

Root directory where the hdf5 files are located.

Type: str

dir_labels

List of labels corresponding to the directories in dir_list.

Type: list of int

dir_list

List of path(s) where the hdf5 files are stored. Supports specifying a path to a specific hdf5 file or directory containing hdf5 files.

Type: list of str

transform

A optional user-defined function to apply transformations to the data. Default is None.

Type: callable, optional

max_level

Maximum levels of directory to search for hdf5 files. Default is 5.

Type: int, optional

return_id

Whether to return the index of the cell with the data. Default is False.

Type: bool, optional

return_fake_id

Whether to return a fake index (0) with the data. Default is False.

Type: bool, optional

select_channel

Specify a specific channel to select from the data. Default is None, which returns all channels.

Type: int, optional

add_hdf_to_index(current_label, path): Adds single cell data from the hdf5 file located at ‘path’ with the specified ‘current_label’ to the index.

scan_directory(path, current_label, levels_left): Scans directories for hdf5 files and adds their data to the index with the specified ‘current_label’.

stats(): Prints dataset statistics including total count and count per label.

len(): Returns the total number of single cells in the dataset.

getitem(idx): Returns the data, label, and optional id/fake_id of the single cell specified by the index ‘idx’.

Examples

>>> hdf5_data = HDF5SingleCellDataset(dir_list=[‘data1.hdf5’, ‘data2.hdf5’],
dir_labels=[0, 1],
root_dir=‘/path/to/data’,
transform=None,
return_id=True)
>>> len(hdf5_data)
2000
>>> sample = hdf5_data[0]
>>> sample[0].shape
torch.Size([1, 128, 128])
>>> sample[1]
tensor(0)
>>> sample[2]
tensor(0)

metrics

sparcscore.ml.metrics.precision(predictions, labels, pos_label=0)

Calculate precision for predicting class pos_label.

Parameters

predictions (torch.Tensor) – Model predictions.
labels (torch.Tensor) – Ground truth labels.
pos_label (int, optional, default = 0) – The positive label for which to calculate precision.

Returns

precision – Precision for predicting class pos_label.

Return type

float

sparcscore.ml.metrics.recall(predictions, labels, pos_label=0)

Calculate recall for predicting class pos_label.

Parameters

predictions (torch.Tensor) – Model predictions.
labels (torch.Tensor) – Ground truth labels.
pos_label (int, optional, default = 0) – The positive label for which to calculate precision.

Returns

recall – Recall for predicting class pos_label.

Return type

float

models

class sparcscore.ml.models.VGGBase(*args: Any, **kwargs: Any)

Bases: torch.nn.Module

Base Implementation of VGG Model Architecture. Can be implemented with varying number of convolutional neural layers and fully connected layers.

make_layers(cfg, in_channels, batch_norm=True)

Create sequential models layers according to the chosen configuration provided in cfg with optional batch normalization for the CNN.

Parameters

cfg (list) – A list of integers and “M” representing the specific VGG architecture.
in_channels (int) – Number of input channels for the first convolutional layer.
batch_norm (bool, optional, default=True) – Whether to include batch normalization layers, by default True.

Returns

A sequential model representing the VGG architecture.

Return type

nn.Sequential

make_layers_MLP(cfg_MLP, cfg)

Create sequential models layers according to the chosen configuration provided in cfg for the MLP.

Parameters

cfg (list) – A list of integers and “M” representing the specific VGG architecture of the CNN
cfg_MLP (list) – A list of integers and “M” representing the specific VGG architecture of the MLP

Returns

A sequential model representing the VGG architecture.

Return type

nn.Sequential

class sparcscore.ml.models.VGG1(*args: Any, **kwargs: Any)

Bases: sparcscore.ml.models.VGGBase

Instance of VGGBase with the model architecture 1.

class sparcscore.ml.models.VGG2(*args: Any, **kwargs: Any)

Bases: sparcscore.ml.models.VGGBase

Instance of VGGBase with the model architecture 1.

plmodels

class sparcscore.ml.plmodels.MultilabelSupervisedModel(*args: Any, **kwargs: Any)

Bases: pytorch_lightning.LightningModule

A pytorch lightning network module to use a multi-label supervised Model.

Parameters

type (str, optional, default = "VGG2") – Network architecture to used in model. Architectures are defined in sparcspy.ml.models Valid options: “VGG1”, “VGG2”, “VGG1_old”, “VGG2_old”.
kwargs (dict) – Additional parameters passed to the model.

network

The selected network architecture.

Type: torch.nn.Module

train_metrics

MetricCollection for evaluating model on training data.

Type: torchmetrics.MetricCollection

val_metrics

MetricCollection for evaluating model on validation data.

Type: torchmetrics.MetricCollection

test_metrics

MetricCollection for evaluating model on test data.

Type: torchmetrics.MetricCollection

forward(x): perform forward pass of model.

configure_optimizers(): Optimization function

on_train_epoch_end(): Callback function after each training epoch

on_validation_epoch_end(): Callback function after each validation epoch

confusion_plot(matrix): Generate confusion matrix plot

training_step(batch, batch_idx): Perform a single training step

validation_step(batch, batch_idx): Perform a single validation step

test_step(batch, batch_idx): Perform a single test step

test_epoch_end(outputs): Callback function after testing epochs

pretrained_models

Collection of functions to load pretrained models to use in the SPARCSpy environment.

sparcscore.ml.pretrained_models.autophagy_classifier1_0(device='cuda'): Load binary autophagy classification model published as Model 1.0 in original SPARCSpy publication.

sparcscore.ml.pretrained_models.autophagy_classifier2_0(device='cuda'): Load binary autophagy classification model published as Model 2.0 in original SPARCSpy publication.

sparcscore.ml.pretrained_models.autophagy_classifier2_1(device='cuda'): Load binary autophagy classification model published as Model 2.1 in original SPARCSpy publication.

transforms

class sparcscore.ml.transforms.RandomRotation(choices=4, include_zero=True): Randomly rotate input image in 90 degree steps.

class sparcscore.ml.transforms.GaussianNoise(sigma=0.1, channels_to_exclude=[]): Add gaussian noise to the input image.

class sparcscore.ml.transforms.GaussianBlur(kernel_size=[1, 1, 1, 1, 5, 5, 7, 9], sigma=(0.1, 2), channels=[]): Apply a gaussian blur to the input image.

class sparcscore.ml.transforms.ChannelReducer(channels=5): can reduce an imaging dataset dataset to 5, 3 or 1 channel 5: nuclei_mask, cell_mask, channel_nucleus, channel_cellmask, channel_of_interest 3: nuclei_mask, cell_mask, channel_of_interestå 1: channel_of_interestå

class sparcscore.ml.transforms.ChannelSelector(channels=[0, 1, 2, 3, 4], num_channels=5): select the channel used for prediction.

utils

sparcscore.ml.utils.combine_datasets_balanced(list_of_datasets, class_labels, train_per_class, val_per_class, test_per_class, seed=None)

Combine multiple datasets to create a single balanced dataset with a specified number of samples per class for train, validation, and test set. A balanced dataset means that from each label source an equal number of data instances are used.

Parameters

list_of_datasets (list[torch.utils.data.Dataset]) – List of datasets to be combined.
class_labels (list[str|int]) – List of class labels present in the datasets.
train_per_class (int) – Number of samples per class in the train set.
val_per_class (int) – Number of samples per class in the validation set.
test_per_class (int) – Number of samples per class in the test set.
seed (None | int) – Seed for the random number generator. Defaults to None.

Returns

Combined train dataset with balanced samples per class. torch.utils.data.Dataset: Combined validation dataset with balanced samples per class. torch.utils.data.Dataset: Combined test dataset with balanced samples per class.

Return type

torch.utils.data.Dataset

Raises

ValueError – If a dataset’s length is too small to be split according to the provided sizes.