Featurization#

MLClusterClassifier#

class scportrait.pipeline.featurization.MLClusterClassifier(*args, **kwargs)#

Perform classification on scPortrait’s single-cell image datasets using a pretrained machine learning model.

Parameters:
  • config – Configuration for the extraction passed over from the pipeline.Project.

  • directory – Directory for the extraction log and results. Will be created if not existing yet.

  • debug – Flag used to output debug information and map images.

  • overwrite – Flag used to overwrite existing results.

process(dataset_paths: str | list[str], dataset_labels: int | list[int] = 0, size: int = 0, return_results: bool = False) None | list[DataFrame]#
Parameters:
  • dataset_paths – Path(s) to the single-cell dataset files on which inference should be performed. If this class is used as part of a project processing workflow this argument will be provided automatically.

  • dataset_labels – Int Label(s) for the dataset(s) provided in dataset_paths

  • size – number of cells that should be selected for inference. Default is 0, which means all cells are selected.

  • return_results – boolean value indicating if the classification results should be returned as a list of pandas DataFrames or directly written to disk.

Returns:

None unless return_results is True, then the results are returned as a list of pandas DataFrames. Otherwise, the results are written to directly to file.

Important

If this class is used as part of a project processing workflow, the Project class will automatically provide the most recent extracted single-cell dataset. Therefore, only the second and third arguments need to be provided.

Example

Note

The following parameters are required in the config file:

MLClusterClassifier:

    # channel number on which the classification should be performed
    channel_selection: 4

    # batch size for inference
    batch_size: 900

    # device on which the inference should be performed
    inference_device: "cpu"

    # number of workers for the dataloader
    dataloader_worker_number: 10 #needs to be 0 if using cpu

    # pretrained model to use for classification
    network: "autophagy_classifier"

    # label that should be applied to the results
    label: "Autophagy_15h_classifier2_1"

    # which output of the model should be returned
    encoders: ["forward"]

EnsembleClassifier#

class scportrait.pipeline.featurization.EnsembleClassifier(*args, **kwargs)#

Perform classification on scPortrait’s single-cell image datasets using an ensemble of pretrained machine learning models.

Parameters:
  • config – Configuration for the extraction passed over from the pipeline.Project.

  • directory – Directory for the extraction log and results. Will be created if not existing yet.

  • debug – Flag used to output debug information and map images.

  • overwrite – Flag used to overwrite existing results.

process(dataset_paths: str, dataset_labels: int | list[int] = 0, size: int = 0, return_results: bool = False) None | dict#
Parameters:
  • dataset_paths – Path(s) to the single-cell dataset files on which inference should be performed. If this class is used as part of a project processing workflow this argument will be provided automatically.

  • dataset_labels – Int Label(s) for the dataset(s) provided in dataset_paths

  • size – number of cells that should be selected for inference. Default is 0, which means all cells are selected.

  • return_results – boolean value indicating if the classification results should be returned as a list of pandas DataFrames or directly written to disk.

Returns:

None unless return_results is True, then the results are returned as a list of pandas DataFrames. Otherwise, the results are written to directly

Important

If this class is used as part of a project processing workflow, the first argument will be provided by the Project class based on the previous single-cell extraction. Therefore, no parameters need to be provided

Example

Note

The following parameters are required in the config file:

EnsembleClassifier:
    # channel number on which the classification should be performed
    channel_selection: 4

    #number of threads to use for dataloader
    dataloader_worker_number: 24

    #batch size to pass to GPU
    batch_size: 900

    #path to pytorch checkpoint that should be used for inference
    networks:
        model1: "path/to/model1/"
        model2: "path/to/model2/"

    #specify input size that the models expect, provided images will be rescaled to this size
    input_image_px: 128

    #label under which the results will be saved
    classification_label: "Autophagy_15h_classifier1"

    # on which device inference should be performed
    # for speed should be "cuda"
    inference_device: "cuda"

CellFeaturizer#

class scportrait.pipeline.featurization.CellFeaturizer(*args, **kwargs)#

Class for extracting general image features from scPortrait’s single-cell image datasets. The extracted features are saved to a CSV file. The features are calculated on the basis of all channels.

The features which are calculated are:

  • Area of the masks in pixels

  • Mean intensity in the regions labelled by each of the masks

  • Median intensity in the regions labelled by each of the masks

  • 75% quantile in the regions labelled by each of the masks

  • 25% quantile in the regions labelled by each of the masks

  • Summed intensity in the regions labelled by each of the masks

  • Summed intensity in the region labelled by each of the masks normalized for area

Parameters:
  • config – Configuration for the extraction passed over from the pipeline.Project.

  • directory – Directory for the extraction log and results. Will be created if not existing yet.

  • debug – Flag used to output debug information and map images.

  • overwrite – Flag used to overwrite existing results.

process(dataset_paths: str | list[str], dataset_labels: int | list[int] = 0, size: int = 0, return_results: bool = False) None | DataFrame#
Parameters:
  • dataset_paths – Paths to the single-cell dataset files on which inference should be performed. If this class is used as part of a project processing workflow this argument will be provided automatically.

  • dataset_labels – labels for the provided single-cell image datasets

  • size – How many cells should be selected for inference. Default is 0, meaning all cells are selected.

  • return_results – If True, the results are returned as a pandas DataFrame. Otherwise the results are written out to file.

Returns:

None if return_results is False, otherwise a pandas DataFrame containing the results.

Important

If this class is used as part of a project processing workflow, the first argument will be provided by the Project class based on the previous single-cell extraction. Therefore, only the second and third argument need to be provided. The Project class will automatically provide the most recent extraction results together with the supplied parameters.

Note

The following parameters are required in the config file:

CellFeaturizer:
    # Number of threads to use for dataloader
    dataloader_worker_number: 0 # needs to be 0 if using CPU

    # Batch size to pass to GPU
    batch_size: 900

    # On which device inference should be performed
    # For speed should be "cuda"
    inference_device: "cpu"

    # Label under which the results should be saved
    screen_label: "all_channels"
class scportrait.pipeline.featurization.CellFeaturizer_single_channel(*args, **kwargs)#

Class for extracting general image features from scPortrait’s single-cell image datasets. The extracted features are saved to a CSV file. The features are calculated on the basis of a single specified channel.

The features which are calculated are:

  • Area of the masks in pixels

  • Mean intensity of the chosen channel in the regions labelled by each of the masks

  • Median intensity of the chosen channel in the regions labelled by each of the masks

  • 75% quantile of the chosen channel in the regions labelled by each of the masks

  • 25% quantile of the chosen channel in the regions labelled by each of the masks

  • Summed intensity of the chosen channel in the regions labelled by each of the masks

  • Summed intensity of the chosen channel in the region labelled by each of the masks normalized for area

Parameters:
  • config – Configuration for the extraction passed over from the pipeline.Project.

  • directory – Directory for the extraction log and results. Will be created if not existing yet.

  • debug – Flag used to output debug information and map images.

  • overwrite – Flag used to overwrite existing results.

process(dataset_paths: str | list[str], dataset_labels: int | list[int] = 0, size=0, return_results: bool = False) None | DataFrame#
Parameters:
  • dataset_paths – Paths to the single-cell dataset files on which inference should be performed. If this class is used as part of a project processing workflow this argument will be provided automatically.

  • dataset_labels – labels for the provided single-cell image datasets

  • size – How many cells should be selected for inference. Default is 0, meaning all cells are selected.

  • return_results – If True, the results are returned as a pandas DataFrame. Otherwise the results are written out to file.

Returns:

None if return_results is False, otherwise a pandas DataFrame containing the results.

Important

If this class is used as part of a project processing workflow, the first argument will be provided by the Project class based on the previous single-cell extraction. Therefore, only the second and third argument need to be provided. The Project class will automatically provide the most recent extraction results together with the supplied parameters.

Note

The following parameters are required in the config file:

CellFeaturizer:
    # Channel number on which the featurization should be performed
    channel_selection: 4

    # Number of threads to use for dataloader
    dataloader_worker_number: 0 # needs to be 0 if using CPU

    # Batch size to pass to GPU
    batch_size: 900

    # On which device inference should be performed
    # For speed should be "cuda"
    inference_device: "cpu"

    # Label under which the results should be saved
    screen_label: "Ch3_Featurization"

ConvNeXtFeaturizer#

class scportrait.pipeline.featurization.ConvNeXtFeaturizer(*args, **kwargs)#
CLEAN_LOG = True#

Compute ConvNeXt features from scPortrait’s single-cell image datasets.

This class uses the pretrained ConvNeXt model available from the Huggingface transformers library to extract features from single-cell image datasets. To be able to use this class you will need to install the optional dependenices for the transformers library. You can do this with pip install “scportrait[convnext]”.

This method will not work with Python 3.12 or later as the required version of the transformers library is not compatible with these Python Versions.

Parameters:
  • config – Configuration for the extraction passed over from the pipeline.Project.

  • directory – Directory for the extraction log and results. Will be created if not existing yet.

  • debug – Flag used to output debug information and map images.

  • overwrite – Flag used to overwrite existing results.

process(dataset_paths: str | list[str], dataset_labels: int | list[int] = 0, size: int = 0, return_results: bool = False) None | DataFrame#
Args

dataset_paths: Path(s) to the single-cell dataset files on which inference should be performed. If this class is used as part of a project processing workflow this argument will be provided automatically. dataset_labels: Int Label(s) for the dataset(s) provided in dataset_paths size: number of cells that should be selected for inference. Default is 0, which means all cells are selected. return_results: boolean value indicating if the classification results should be returned as a list of pandas DataFrames or directly written to disk.

Returns:

None if return_results is False, otherwise a pandas DataFrame containing the results.

Important

If this class is used as part of a project processing workflow, the first argument will be provided by the Project class based on the previous single-cell extraction. Therefore, only the second and third arguments need to be provided. The Project class will automatically provide the most recent extracted single-cell dataset together with the supplied parameters.

Example

Note

The following parameters are required in the config file:

ConvNeXtFeaturizer:
    # number of cells in a minibatch
    batch_size: 900

    # number of threads to use for dataloader
    dataloader_worker_number: 10 #needs to be 0 if using cpu

    # what device should be used for inference
    inference_device: "auto"

    # how the results should be saved
    label: "ConvNeXtFeaturizer"

    # which channels to run inference on
    channel_selection: 4