Data¶

DataModule¶

The datamodule encapsulates all the data-related functionalities. It defines both the pre-processing and data augmentation strategies and it is ultimately responsible for the definition of the train/test/validation data loaders. It is a self contained piece of code that ensures reproducibility of all the steps related to the data manipulation process.

For most users it suffices to use the predefined class tissue_purifier.data.datamodule.AnndataFolderDM. This is the simplest way to create a datamodule starting from a folder containing anndata objects in .h5ad format. More advanced users can subclass either tissue_purifier.data.datamodule.SslDM or tissue_purifier.data.datamodule.SparseSslDM to have extra flexibility.

Our datamodules include the definition of the cropping strategy (both at train and test time) and the data-augmentation strategy. In the tissue_purifier.models.ssl_model.dino.DinoModel self supervised learning framework, the model is trained using multiple global and local crops from each image. Accordingly the datamodule accounts for the definition of different augmentation for gloabl and local crops. Other model, such as tissue_purifier.models.ssl_model.vae.VaeModel, tissue_purifier.models.ssl_model.simclr.SimclrModel and tissue_purifier.models.ssl_model.barlow.BarlowModel do not use local crops.

class SslDM(*args, **kwargs)[source]¶: Base class to inherit from to make a DataModule which can be used with any Self Supervised Learning framework

class SparseSslDM(global_size=96, local_size=64, n_local_crops=2, n_global_crops=2, global_scale=(0.8, 1.0), local_scale=(0.5, 0.8), global_intensity=(0.8, 1.2), n_element_min_for_crop=200, drop_spot_probs=(0.1, 0.2, 0.3), rasterize_sigmas=(1.0, 1.5), occlusion_fraction=(0.1, 0.3), drop_channel_prob=0.0, drop_channel_relative_freq=None, n_crops_for_tissue_test=50, n_crops_for_tissue_train=50, batch_size_per_gpu=64, **kargs)[source]¶

Bases: tissue_purifier.data.datamodule.SslDM

Datamodule for sparse Images with the parameter for the transform (i.e. data augmentation) specified. If you are inheriting from this class then you only have to overwrite: ‘prepara_data’, ‘setup’, ‘get_metadata_to_classify’ and ‘get_metadata_to_regress’.

classmethod add_specific_args(parent_parser)[source]¶

Utility functions which add parameters to argparse to simplify setting up a CLI

Example

>>> import sys
>>> import argparse
>>> parser = argparse.ArgumentParser(add_help=False, conflict_handler='resolve')
>>> parser = SslDM.add_specific_args(parser)
>>> args = parser.parse_args(sys.argv[1:])

Return type: ArgumentParser

property cropper_test: tissue_purifier.data.dataset.CropperSparseTensor¶

Cropper to be used at test time. This specify the cropping strategy to use at test time.

Return type: CropperSparseTensor

property cropper_train: tissue_purifier.data.dataset.CropperSparseTensor¶

Cropper to be used at train time. This specify the cropping strategy to use at train time.

Return type: CropperSparseTensor

property global_size: int¶

Size in pixel of the global crops. This specify the size of the patch processed by the ssl model.

Return type: int

property local_size: int¶

Size in pixel of the local crops (used only for Dino). This specify the size of the patch processed by the ssl model.

Return type: int

property n_global_crops: int¶

Number of global crops for each image to use for training (used only for Dino).

Return type: int

property n_local_crops: int¶

Number of local crops for each image to use for training (used only for Dino).

Return type: int

property trsfm_test: tissue_purifier.data.transforms.TransformForList¶

Transformation to be applied at test time. This specify the data-augmentation at test time.

Return type: TransformForList

property trsfm_train_global: tissue_purifier.data.transforms.TransformForList¶

Global Transformation to be applied at train time. This specify the data augmentation for the global crops.

Return type: TransformForList

property trsfm_train_local: tissue_purifier.data.transforms.TransformForList¶

Local Transformation to be applied at train time. This specify the data augmentation for the local crops. Used by Dino only.

Return type: TransformForList

class AnndataFolderDM(data_folder, pixel_size, x_key, y_key, category_key, categories_to_channels, metadata_to_classify, metadata_to_regress, num_workers, gpus, n_neighbours_moran, **kargs)[source]¶

Bases: tissue_purifier.data.datamodule.SparseSslDM

Create a Datamodule ready for Self-supervised learning starting from a folder full of anndata files in .h5ad format.

classmethod add_specific_args(parent_parser)[source]¶

Utility functions which add parameters to argparse to simplify setting up a CLI

Example

>>> import sys
>>> import argparse
>>> parser = argparse.ArgumentParser(add_help=False, conflict_handler='resolve')
>>> parser = AnndataFolderDM.add_specific_args(parser)
>>> args = parser.parse_args(sys.argv[1:])

Return type: ArgumentParser

anndata_to_sparseimage(anndata)[source]¶: Converts a anndata object to SparseImage.

property ch_in: int¶

How many channels will be present in the images returned by the train/test/val dataloaders?

Return type: int

get_metadata_to_classify(metadata)[source]¶

Extract one or more quantities to classify from the metadata

Return type: Dict[str, int]

get_metadata_to_regress(metadata)[source]¶

Extract one or more quantities to regress from the metadata

Return type: Dict[str, float]

SparseImage¶

The SparseImage is the most important concept in the TissuePurifier library. It has easy interoperability with Anndata which is a data-structure specifically designed for transcriptomic data. Contrary to Anndata, which stores the data in the form of a panda Dataframe, SparseImage stores the data in a sparse torch tensor for fast (GPU enabled) processing.

SparseImage keeps information at three level of description: 1. the spot-level description. This is similar to Anndata. Cell-level annotations are stored at this level. 2. the patch-level description. For example when an image-patch is processed by a self-supervised learning model the resulting embedding (which describes property of the entire patch) is stored at this level of description. 3. the image-level description which contains image-level properties.

SparseImage provides built-in methods for transferring information between different levels of description. For example a collection of patch-level properties can be glued together to obtain image-level properties (note that we can deal with overlapping patches) and image-level properties can be evaluated at discrete location to obtain spot-level properties.

Finally, SparseImage provides two methods tissue_purifier.data.sparse_image.SparseImage.compute_ncv() and tissue_purifier.data.sparse_image.SparseImage.compute_patch_features() and to easily extract information about the cellular micro-environment.

class SparseImage(spot_properties_dict, x_key, y_key, category_key, categories_to_codes, pixel_size, padding=10, patch_properties_dict=None, image_properties_dict=None, anndata=None)[source]¶

Sparse torch tensor containing the spatial data (for example spatial gene expression or spatial cell types).

It has 3 dictionaries with spot, patch and image properties.

__init__(spot_properties_dict, x_key, y_key, category_key, categories_to_codes, pixel_size, padding=10, patch_properties_dict=None, image_properties_dict=None, anndata=None)[source]¶

The user can initialize a SparseImage using this constructor or the from_anndata().

Parameters

spot_properties_dict (dict) – the dictionary with the spot properties (at the minimum x,y,category)
x_key (str) – str, the key where the x_coordinates are stored in the spot_properties_dict
y_key (str) – str, the key where the y_coordinates are stored in the spot_properties_dict
category_key (str) – str, the key where the category are stored in the spot_properties_dict
categories_to_codes (dict) – dictionary with the mapping from categories (keys) to codes (values). The codes must be integers starting from zero. For example {“macrophage” : 0, “t-cell”: 1}.
pixel_size (float) – float, size of the pixel. It used in the conversion between raw coordinates and pixel coordinates.
padding (int) – int, padding of the image (must be >= 1)
patch_properties_dict (Optional[dict]) – the dictionary with the patch properties. If None (defaults) an empty dict is generated.
image_properties_dict (Optional[dict]) – the dictionary with the image properties. If None (defaults) an empty dict is generated.
anndata (Optional[AnnData]) – the anndata object with other information (such as count_matrix etc)

property anndata: scanpy.AnnData¶

The anndata object used to create the image. Might be None.

Return type: AnnData

property cat_raw: numpy.ndarray¶

The categorical labels (gene-identities or cell-identities) from the original data

Return type: ndarray

property category_to_channels: dict¶

The mapping between categories and channels in the image.

Return type: dict

property channels_to_category: numpy.ndarray¶

The mapping between channels in the image and categories. Note that a channel can represent more than one category. For example CD+ and CD- cells can both be shown in channel 7. In that case channels_to_category[7] -> “CD+_CD-“.

Return type: ndarray

clear_dicts(patch_dict=True, image_dict=True)[source]¶

Clear the patch_properties_dict and image_properties_dict in their entirety. Useful to restart the analysis from scratch. It will never modify the spot_properties_dict.

Parameters

patch_dict (bool) – If True (defaults) the patch_properties_dictionary is cleared
image_dict (bool) – If True (defaults) the image_properties_dictionary is cleared

compute_ncv(feature_name=None, k=None, r=None, overwrite=False)[source]¶

Compute the neighborhood composition vectors (ncv) of every spot and store the results in the spot_properties_dictionary under feature_name.

Parameters

feature_name (Optional[str]) – the key under which the results will be stored.
k (Optional[int]) – if specified the k nearest neighbours are used to compute the ncv.
r (Optional[float]) – if specified the neighbours at a distance less than r (in raw units) are used to compute the ncv.
overwrite (bool) – if the feature_name is already present in the spot_properties_dict, this variable controls when to overwrite it.

compute_patch_features(feature_name, datamodule, model, apply_transform=True, batch_size=64, n_patches_max=100, overwrite=False, return_crops=False)¶

Split the sparse image into (possibly overlapping) patches. Each patch is analyzed by the (pretrained) model. The features are stored in the patch_properties_dict under the feature_name.

Parameters

feature_name (str) – the key under which the results will be stored.
datamodule (AnndataFolderDM) – Datamodule used for training the model. Passing it here guarantees that the cropping strategy and the data augmentations are identical to the one used during training.
model (Module) – the trained model will ingest the patch and produce the features.
apply_transform (bool) – if True (defaults) the datamodule.test_trasform will be applied to the crops before feeding them into the model. If False no transformation is applied and the sparse tensors are fed into the model.
batch_size (int) – how many crops to process simultaneously (default = 64). Use to adjust the GPU memory footprint.
n_patches_max (int) – maximum number of patches generated to analyze the current picture (default = 100)
overwrite (bool) – if the :attr:’feature_names’ are already present in the patch_properties_dict, this variable controls when to overwrite them.
return_crops (bool) – if True the model returns a (batched) torch.Tensor of shape \((\text{n_patches_max}, c, w, h)\) with all the crops which were fed to the model. Default is False.

Returns

patches – If return_crops is True returns tensor of shape \((N, C, W, H)\) with all the crops which were cropped and analyzed. Else returns None.

Return type

Optional[Tensor]

crops(crop_size, strategy='random', n_crops=10, n_element_min=0, stride=200, random_order=True)[source]¶

Wrapper around tissue_purifier.dataset.CropperSparseTensor.

Returns

sp_img – list of crops represented as sparse torch Tensor
x_list – list with the x coordinates of the bottom left corners of the crops
y_list – list with the y coordinates of the bottom left corners of the crops

Return type

Tuple[List[Tensor], List[int], List[int]]

classmethod from_anndata(anndata, x_key, y_key, category_key, pixel_size=None, categories_to_channels=None, padding=10)[source]¶

Create a SparseImage object from an AnnData object.

Note

The minimal adata object must have a categorical variable (such as cell_type or gene_identities) and the spatial coordinates. Additional fields can be present.

Parameters

anndata (AnnData) – the AnnData object with the spatial data
x_key (str) – str, tha key associated with the x_coordinate in the AnnData object
y_key (str) – str, tha key associated with the y_coordinate in the AnnData object
category_key (str) – str, tha key associated with the categorical values (cell_types or gene_identities)
pixel_size (Optional[float]) – float, pixel_size used to convert from raw coordinates to pixel coordinates. If it is not specified it will be chosen to be 1/3 of the median of the Nearest Neighbour distances between spots. Explicitely setting this attribute ensures that the pixel_size will be consistent across multiple images
categories_to_channels (Optional[dict]) – dictionary with the mapping from the names (of cell_types or genes) to channels. Explicitely setting this attribute ensures that the encoding between category and channels codes will be consistent across multiple images. If not given, the mapping will be inferred from the anndata object.
padding (int) – int, padding of the image so that the image has a bit of black around it

Returns

sp_img – A sparse image instance

Examples

>>> # create an AnnData object and a sparse image from it
>>> anndata = AnnData(obs={"cell_type": cell_type}, obsm={"spatial": spot_xy_coordinates})
>>> cell_types = ("ES", "Endothelial", "Leydig", "Macrophage", "Myoid", "RS", "SPC", "SPG", "Sertoli")
>>> categories_to_codes = dict(zip(cell_types, range(len(cell_types))))
>>> sparse_image = SparseImage.from_anndata(
>>>     anndata=anndata,
>>>     x_key="spatial",
>>>     y_key="spatial",
>>>     category_key="cell_type",
>>>     categories_to_channels=categories_to_channels)

Examples:

>>> # create an AnnData object and a sparse image from it
>>> anndata = AnnDatae(obs={"gene": gene, "x": gene_location_x, "y": gene_location_y})
>>> sparse_image = SparseImage.from_anndata(
>>>     anndata=anndata,
>>>     x_key="gene_location_x",
>>>     y_key="gene_location_y",
>>>     category_key="gene",
>>>     pixel_size=6.5,
>>>     padding=8)

classmethod from_state_dict(state_dict)[source]¶

Create a sparse image from the state_dictionary which was obtained by the get_state_dict()

Returns: sp_img – A sparse_image instance

Example

>>> state_dict_v1 = sparse_image_old.get_state_dict(include_anndata=True)
>>> torch.save(state_dict_v1, "ckpt.pt")
>>> state_dict_v2 = torch.load("ckpt.pt")
>>> sparse_image_new = SparseImage.from_state_dict(state_dict_v2)

Return type: SparseImage

get_state_dict(include_anndata=True)[source]¶

Get a dictionary with the state of the system

Parameters: include_anndata (bool) – If True (default) the anndata is included
Returns: state_dict – A dictionary with the state
Return type: dict

gready(n_neighbours=6, radius=None, neigh_correct=False)[source]¶

Wrapper around tissue_purifier.models.patch_analyzer.SpatialAutocorrelation.

Returns: score – array of shape C with the Gready’s score for each channels (i.e. how one channel is mixed with all the others)
Return type: Tensor

inspect()[source]¶: Describe the content of the spot, patch and image properties dictionaries

moran(n_neighbours=6, radius=None, neigh_correct=False)[source]¶

Wrapper around tissue_purifier.models.patch_analyzer.SpatialAutocorrelation.

Returns: score – array of shape C with the Moran’s I score for each channels (i.e. how one channel is mixed with all the others)
Return type: Tensor

pixel_to_raw(x_pixel, y_pixel)[source]¶

Utility to convert the pixel coordinates to the raw_coordinates. This is a simple scale and shift transformation.

Parameters

x_pixel (Tensor) – tensor of arbitrary shape with the x_index of the pixels
y_pixel (Tensor) – tensor of arbitrary shape with the x_index of the pixels

Returns

x_raw – tensor with the x_coordinates in raw unit. It has the same shape as input.
y_raw – tensor with the y_coordinates in raw unit. It has the same shape as input.

Return type

Tuple[Tensor, Tensor]

raw_to_pixel(x_raw, y_raw)[source]¶

Utility to convert the raw coordinates to pixel coordinates. This is a simple scale and shift transformation.

Parameters

x_raw (Tensor) – tensor of arbitrary shape with the x_raw coordinates
y_raw (Tensor) – tensor of arbitrary shape with the y_raw coordinates

Returns

x_pixel – tensor with the x_coordinates in pixel unit. It has the same shape as input.
y_pixel – tensor with the y_coordinates in pixel unit. It has the same shape as input.

Return type

Tuple[Tensor, Tensor]

read_from_image_dictionary(key)[source]¶

Helper function to read from the patch dictionary.

Parameters: key (str) – the key corresponding to the information to read
Returns: values – array of shape \((ch, w, h)\) with the image-level information
Return type: Tensor

read_from_patch_dictionary(key)[source]¶

Helper function to read from the patch dictionary.

Parameters

key – the key corresponding to the information to read

Returns

values – array of shape \((n_\text{patches}, *, *, *)\) with the patch-level information
patches_xywh – array of shape \((n_\text{patches}, 4)\) with the coordinates of the patches

read_from_spot_dictionary(key)[source]¶

Helper function to read from the spot dictionary.

Parameters: key (str) – the key corresponding to the information to read
Returns: values – array of shape \((n_\text{spot}, *)\) with the spot-level information
Return type: Tensor

to(*args, **kwargs)[source]¶

Move the data to a device or cast it to a different type

Return type: SparseImage

to_anndata(export_full_state=False, verbose=False)[source]¶

Export the spot_properties (and optionally the entire state dict) to the anndata object.

Parameters

export_full_state (bool) – if True (default is False) the entire state_dict is exported into the anndata.uns
verbose (bool) – if True (default is False) prints some intermediate statements

Returns

AnnData – object containing the spot_properties_dict (and optionally the full state).

Note

This will make a copy of the anndata object that was used to create the sparse image (if any)

Examples

>>> adata = sparse_image.to_anndata()
>>> sparse_image_new = SparseImage.from_anndata(adata, x_key="x", y_key="y", category_key="cell_type")

to_dense()[source]¶

Create a dense torch tensor of shape \((C, W, H)\) where the number of channels is equal to the number of categories of the underlying spatial data.

Returns: dense_img – A dense representation of the sparse image

Note

This will convert the sparse array into a dense array and might lead to a very large memory footprint.

Note

It is useful for visualization of the data.

Return type: Tensor

to_rgb(spot_size=1.0, cmap=None, figsize=(8, 8), show_colorbar=True, contrast=1.0)[source]¶

Make a 3 channel RGB image.

Parameters

spot_size – size of sigma of gaussian kernel for rendering the spots
cmap – the colormap to use
figsize – the size of the figure
show_colorbar – If True show the colorbar
contrast – change to increase/decrease the contrast in the figure. It does not affect the returned tensor. It changes only the way to figure is displayed.

Returns

dense_img – A torch.Tensor of size \((3, W, H)\) with the rgb rendering of the image
fig – matplotlib figure.

transfer_image_to_spot(keys_to_transfer, overwrite=False, verbose=False, strategy='bilinear')[source]¶

Evaluate the image_properties_dict at the spots location. Store the results in the spot_properties_dict under the same name.

Parameters

keys_to_transfer (List[str]) – the keys of the quantity to transfer from image_properties_dict to spot_properties_dict.
overwrite (bool) – bool, in case of collision between the keys this variable controls when the value will be overwritten.
verbose (bool) – bool, if true intermediate messages are displayed.
strategy (str) – str, either ‘closest’ or ‘bilinear’ (default). This described the interpolation method.

transfer_patch_to_image(keys_to_transfer, overwrite=False, verbose=False, strategy='average')[source]¶

Collect the properties computed separately for each patch and stored in patch_properties_dict to create an image properties which will be stored in image_properties_dict under the same name.

Parameters

keys_to_transfer (List[str]) – keys of the quantity to transfer from patch_properties_dict to image_properties_dict. The patch_quantity can be: a scalar, a vector, a scalar field or a vector field. This corresponds to patch_quantity having shapes: (N_patches), (N_patches, ch), (N_patches, w, h) or (N_patches, ch, w, h) respectively.
overwrite (bool) – bool, in case of collision between keys this variable controls when to overwrite the values in the image_properties_dict.
strategy (str) – str, either ‘average’ (default) or ‘closest’. If ‘average’ the value of each pixel in the image is obtained by averaging the contribution of all patches which contain that pixel. If ‘nearest’ each pixel takes the value from the patch whose center is closets to the pixel.
verbose (bool) – bool, if true print intermediate messages

transfer_patch_to_spot(keys_to_transfer, overwrite=False, verbose=False, strategy_patch_to_image='average', strategy_image_to_spot='bilinear')[source]¶: Utility function which sequentially transfer annotations from patch -> image -> spot

trim_image_dictionary(keys)[source]¶

Clear selective entries in the image_properties_dictionary.

Parameters: keys (List[str]) – the list of keys to remove from the image dictionary

trim_patch_dictionary(keys)[source]¶

Clear selective entries in the patch_properties_dictionary.

Parameters: keys (List[str]) – the list of keys to remove from the patch dictionary

trim_spot_dictionary(keys)[source]¶

Clear selective entries in the spot_properties_dictionary.

Parameters: keys (List[str]) – the list of keys to remove from the spot dictionary

write_to_image_dictionary(key, values, overwrite=False)[source]¶

Helper function to write information to the image dictionary.

Parameters

key – the key corresponding to the information to write
values (Union[Tensor, ndarray]) – array of shape \((*, w, h)\) with the image-level information
overwrite (bool) – If True (default is False) overwrite the value if already present

write_to_patch_dictionary(key, values, patches_xywh, overwrite=False)[source]¶

Helper function to write info to the patch dictionary.

Parameters

key (str) – the name under which to save the information
values (Union[Tensor, ndarray]) – the patch-level information of shape \((n_\text{patches}, *, *, *)\)
patches_xywh (Union[Tensor, ndarray]) – the location from where the patches were taken of shape \((n_\text{patches}, 4)\)
overwrite (bool) – If True (default is False) overwrite the value if already present

write_to_spot_dictionary(key, values, overwrite=False)[source]¶

Helper function to write info to the spot dictionary.

Parameters

key (str) – the key corresponding to the information to write
values (Union[Tensor, ndarray]) – array of shape \((n, *)\) with the spot-level information
overwrite (bool) – If True (default is False) overwrite the value if already present

property x_raw: numpy.ndarray¶

The x_coordinates (in raw units) of the spots used to create the sparse image

Return type: ndarray

property y_raw: numpy.ndarray¶

The y_coordinates (in raw units) of the spots used to create the sparse image

Return type: ndarray