# Data¶

## DataModule¶

The datamodule encapsulates all the data-related functionalities. It defines both the pre-processing and data augmentation strategies and it is ultimately responsible for the definition of the train/test/validation data loaders. It is a self contained piece of code that ensures reproducibility of all the steps related to the data manipulation process.

For most users it suffices to use the predefined class tissue_purifier.data.datamodule.AnndataFolderDM. This is the simplest way to create a datamodule starting from a folder containing anndata objects in .h5ad format. More advanced users can subclass either tissue_purifier.data.datamodule.SslDM or tissue_purifier.data.datamodule.SparseSslDM to have extra flexibility.

Our datamodules include the definition of the cropping strategy (both at train and test time) and the data-augmentation strategy. In the tissue_purifier.models.ssl_model.dino.DinoModel self supervised learning framework, the model is trained using multiple global and local crops from each image. Accordingly the datamodule accounts for the definition of different augmentation for gloabl and local crops. Other model, such as tissue_purifier.models.ssl_model.vae.VaeModel, tissue_purifier.models.ssl_model.simclr.SimclrModel and tissue_purifier.models.ssl_model.barlow.BarlowModel do not use local crops.

class SslDM(*args, **kwargs)[source]

Base class to inherit from to make a DataModule which can be used with any Self Supervised Learning framework

class SparseSslDM(global_size=96, local_size=64, n_local_crops=2, n_global_crops=2, global_scale=(0.8, 1.0), local_scale=(0.5, 0.8), global_intensity=(0.8, 1.2), n_element_min_for_crop=200, drop_spot_probs=(0.1, 0.2, 0.3), rasterize_sigmas=(1.0, 1.5), occlusion_fraction=(0.1, 0.3), drop_channel_prob=0.0, drop_channel_relative_freq=None, n_crops_for_tissue_test=50, n_crops_for_tissue_train=50, batch_size_per_gpu=64, **kargs)[source]

Datamodule for sparse Images with the parameter for the transform (i.e. data augmentation) specified. If you are inheriting from this class then you only have to overwrite: ‘prepara_data’, ‘setup’, ‘get_metadata_to_classify’ and ‘get_metadata_to_regress’.

Utility functions which add parameters to argparse to simplify setting up a CLI

Example

>>> import sys
>>> import argparse
>>> args = parser.parse_args(sys.argv[1:])

Return type

ArgumentParser

property cropper_test: tissue_purifier.data.dataset.CropperSparseTensor

Cropper to be used at test time. This specify the cropping strategy to use at test time.

Return type

CropperSparseTensor

property cropper_train: tissue_purifier.data.dataset.CropperSparseTensor

Cropper to be used at train time. This specify the cropping strategy to use at train time.

Return type

CropperSparseTensor

property global_size: int

Size in pixel of the global crops. This specify the size of the patch processed by the ssl model.

Return type

int

property local_size: int

Size in pixel of the local crops (used only for Dino). This specify the size of the patch processed by the ssl model.

Return type

int

property n_global_crops: int

Number of global crops for each image to use for training (used only for Dino).

Return type

int

property n_local_crops: int

Number of local crops for each image to use for training (used only for Dino).

Return type

int

property trsfm_test: tissue_purifier.data.transforms.TransformForList

Transformation to be applied at test time. This specify the data-augmentation at test time.

Return type

TransformForList

property trsfm_train_global: tissue_purifier.data.transforms.TransformForList

Global Transformation to be applied at train time. This specify the data augmentation for the global crops.

Return type

TransformForList

property trsfm_train_local: tissue_purifier.data.transforms.TransformForList

Local Transformation to be applied at train time. This specify the data augmentation for the local crops. Used by Dino only.

Return type

TransformForList

class AnndataFolderDM(data_folder, pixel_size, x_key, y_key, category_key, categories_to_channels, metadata_to_classify, metadata_to_regress, num_workers, gpus, n_neighbours_moran, **kargs)[source]

Create a Datamodule ready for Self-supervised learning starting from a folder full of anndata files in .h5ad format.

Utility functions which add parameters to argparse to simplify setting up a CLI

Example

>>> import sys
>>> import argparse
>>> args = parser.parse_args(sys.argv[1:])

Return type

ArgumentParser

anndata_to_sparseimage(anndata)[source]

Converts a anndata object to SparseImage.

property ch_in: int

How many channels will be present in the images returned by the train/test/val dataloaders?

Return type

int

Extract one or more quantities to classify from the metadata

Return type

Dict[str, int]

Extract one or more quantities to regress from the metadata

Return type

Dict[str, float]

## SparseImage¶

The SparseImage is the most important concept in the TissuePurifier library. It has easy interoperability with Anndata which is a data-structure specifically designed for transcriptomic data. Contrary to Anndata, which stores the data in the form of a panda Dataframe, SparseImage stores the data in a sparse torch tensor for fast (GPU enabled) processing.

SparseImage keeps information at three level of description: 1. the spot-level description. This is similar to Anndata. Cell-level annotations are stored at this level. 2. the patch-level description. For example when an image-patch is processed by a self-supervised learning model the resulting embedding (which describes property of the entire patch) is stored at this level of description. 3. the image-level description which contains image-level properties.

SparseImage provides built-in methods for transferring information between different levels of description. For example a collection of patch-level properties can be glued together to obtain image-level properties (note that we can deal with overlapping patches) and image-level properties can be evaluated at discrete location to obtain spot-level properties.

Finally, SparseImage provides two methods tissue_purifier.data.sparse_image.SparseImage.compute_ncv() and tissue_purifier.data.sparse_image.SparseImage.compute_patch_features() and to easily extract information about the cellular micro-environment.

class SparseImage(spot_properties_dict, x_key, y_key, category_key, categories_to_codes, pixel_size, padding=10, patch_properties_dict=None, image_properties_dict=None, anndata=None)[source]

Sparse torch tensor containing the spatial data (for example spatial gene expression or spatial cell types).

It has 3 dictionaries with spot, patch and image properties.

__init__(spot_properties_dict, x_key, y_key, category_key, categories_to_codes, pixel_size, padding=10, patch_properties_dict=None, image_properties_dict=None, anndata=None)[source]

The user can initialize a SparseImage using this constructor or the from_anndata().

Parameters
• spot_properties_dict (dict) – the dictionary with the spot properties (at the minimum x,y,category)

• x_key (str) – str, the key where the x_coordinates are stored in the spot_properties_dict

• y_key (str) – str, the key where the y_coordinates are stored in the spot_properties_dict

• category_key (str) – str, the key where the category are stored in the spot_properties_dict

• categories_to_codes (dict) – dictionary with the mapping from categories (keys) to codes (values). The codes must be integers starting from zero. For example {“macrophage” : 0, “t-cell”: 1}.

• pixel_size (float) – float, size of the pixel. It used in the conversion between raw coordinates and pixel coordinates.

• padding (int) – int, padding of the image (must be >= 1)

• patch_properties_dict (Optional[dict]) – the dictionary with the patch properties. If None (defaults) an empty dict is generated.

• image_properties_dict (Optional[dict]) – the dictionary with the image properties. If None (defaults) an empty dict is generated.

• anndata (Optional[AnnData]) – the anndata object with other information (such as count_matrix etc)

property anndata: scanpy.AnnData

The anndata object used to create the image. Might be None.

Return type

AnnData

property cat_raw: numpy.ndarray

The categorical labels (gene-identities or cell-identities) from the original data

Return type

ndarray

property category_to_channels: dict

The mapping between categories and channels in the image.

Return type

dict

property channels_to_category: numpy.ndarray

The mapping between channels in the image and categories. Note that a channel can represent more than one category. For example CD+ and CD- cells can both be shown in channel 7. In that case channels_to_category[7] -> “CD+_CD-“.

Return type

ndarray

clear_dicts(patch_dict=True, image_dict=True)[source]

Clear the patch_properties_dict and image_properties_dict in their entirety. Useful to restart the analysis from scratch. It will never modify the spot_properties_dict.

Parameters
• patch_dict (bool) – If True (defaults) the patch_properties_dictionary is cleared

• image_dict (bool) – If True (defaults) the image_properties_dictionary is cleared

compute_ncv(feature_name=None, k=None, r=None, overwrite=False)[source]

Compute the neighborhood composition vectors (ncv) of every spot and store the results in the spot_properties_dictionary under feature_name.

Parameters
• feature_name (Optional[str]) – the key under which the results will be stored.

• k (Optional[int]) – if specified the k nearest neighbours are used to compute the ncv.

• r (Optional[float]) – if specified the neighbours at a distance less than r (in raw units) are used to compute the ncv.

• overwrite (bool) – if the feature_name is already present in the spot_properties_dict, this variable controls when to overwrite it.

compute_patch_features(feature_name, datamodule, model, apply_transform=True, batch_size=64, n_patches_max=100, overwrite=False, return_crops=False)

Split the sparse image into (possibly overlapping) patches. Each patch is analyzed by the (pretrained) model. The features are stored in the patch_properties_dict under the feature_name.

Parameters
• feature_name (str) – the key under which the results will be stored.

• datamodule (AnndataFolderDM) – Datamodule used for training the model. Passing it here guarantees that the cropping strategy and the data augmentations are identical to the one used during training.

• model (Module) – the trained model will ingest the patch and produce the features.

• apply_transform (bool) – if True (defaults) the datamodule.test_trasform will be applied to the crops before feeding them into the model. If False no transformation is applied and the sparse tensors are fed into the model.

• batch_size (int) – how many crops to process simultaneously (default = 64). Use to adjust the GPU memory footprint.

• n_patches_max (int) – maximum number of patches generated to analyze the current picture (default = 100)

• overwrite (bool) – if the :attr:’feature_names’ are already present in the patch_properties_dict, this variable controls when to overwrite them.

• return_crops (bool) – if True the model returns a (batched) torch.Tensor of shape $$(\text{n_patches_max}, c, w, h)$$ with all the crops which were fed to the model. Default is False.

Returns

patches – If return_crops is True returns tensor of shape $$(N, C, W, H)$$ with all the crops which were cropped and analyzed. Else returns None.

Return type

Optional[Tensor]

crops(crop_size, strategy='random', n_crops=10, n_element_min=0, stride=200, random_order=True)[source]

Wrapper around tissue_purifier.dataset.CropperSparseTensor.

Returns
• sp_img – list of crops represented as sparse torch Tensor

• x_list – list with the x coordinates of the bottom left corners of the crops

• y_list – list with the y coordinates of the bottom left corners of the crops

Return type

Tuple[List[Tensor], List[int], List[int]]

classmethod from_anndata(anndata, x_key, y_key, category_key, pixel_size=None, categories_to_channels=None, padding=10)[source]

Create a SparseImage object from an AnnData object.

Note

The minimal adata object must have a categorical variable (such as cell_type or gene_identities) and the spatial coordinates. Additional fields can be present.

Parameters
• anndata (AnnData) – the AnnData object with the spatial data

• x_key (str) – str, tha key associated with the x_coordinate in the AnnData object

• y_key (str) – str, tha key associated with the y_coordinate in the AnnData object

• category_key (str) – str, tha key associated with the categorical values (cell_types or gene_identities)

• pixel_size (Optional[float]) – float, pixel_size used to convert from raw coordinates to pixel coordinates. If it is not specified it will be chosen to be 1/3 of the median of the Nearest Neighbour distances between spots. Explicitely setting this attribute ensures that the pixel_size will be consistent across multiple images

• categories_to_channels (Optional[dict]) – dictionary with the mapping from the names (of cell_types or genes) to channels. Explicitely setting this attribute ensures that the encoding between category and channels codes will be consistent across multiple images. If not given, the mapping will be inferred from the anndata object.

• padding (int) – int, padding of the image so that the image has a bit of black around it

Returns

sp_img – A sparse image instance

Examples

>>> # create an AnnData object and a sparse image from it
>>> anndata = AnnData(obs={"cell_type": cell_type}, obsm={"spatial": spot_xy_coordinates})
>>> cell_types = ("ES", "Endothelial", "Leydig", "Macrophage", "Myoid", "RS", "SPC", "SPG", "Sertoli")
>>> categories_to_codes = dict(zip(cell_types, range(len(cell_types))))
>>> sparse_image = SparseImage.from_anndata(
>>>     anndata=anndata,
>>>     x_key="spatial",
>>>     y_key="spatial",
>>>     category_key="cell_type",
>>>     categories_to_channels=categories_to_channels)

Examples:
>>> # create an AnnData object and a sparse image from it
>>> anndata = AnnDatae(obs={"gene": gene, "x": gene_location_x, "y": gene_location_y})
>>> sparse_image = SparseImage.from_anndata(
>>>     anndata=anndata,
>>>     x_key="gene_location_x",
>>>     y_key="gene_location_y",
>>>     category_key="gene",
>>>     pixel_size=6.5,

classmethod from_state_dict(state_dict)[source]

Create a sparse image from the state_dictionary which was obtained by the get_state_dict()

Returns

sp_img – A sparse_image instance

Example

>>> state_dict_v1 = sparse_image_old.get_state_dict(include_anndata=True)
>>> torch.save(state_dict_v1, "ckpt.pt")
>>> sparse_image_new = SparseImage.from_state_dict(state_dict_v2)

Return type

SparseImage

get_state_dict(include_anndata=True)[source]

Get a dictionary with the state of the system

Parameters

include_anndata (bool) – If True (default) the anndata is included

Returns

state_dict – A dictionary with the state

Return type

dict

Wrapper around tissue_purifier.models.patch_analyzer.SpatialAutocorrelation.

Returns

score – array of shape C with the Gready’s score for each channels (i.e. how one channel is mixed with all the others)

Return type

Tensor

inspect()[source]

Describe the content of the spot, patch and image properties dictionaries

Wrapper around tissue_purifier.models.patch_analyzer.SpatialAutocorrelation.

Returns

score – array of shape C with the Moran’s I score for each channels (i.e. how one channel is mixed with all the others)

Return type

Tensor

pixel_to_raw(x_pixel, y_pixel)[source]

Utility to convert the pixel coordinates to the raw_coordinates. This is a simple scale and shift transformation.

Parameters
• x_pixel (Tensor) – tensor of arbitrary shape with the x_index of the pixels

• y_pixel (Tensor) – tensor of arbitrary shape with the x_index of the pixels

Returns
• x_raw – tensor with the x_coordinates in raw unit. It has the same shape as input.

• y_raw – tensor with the y_coordinates in raw unit. It has the same shape as input.

Return type

Tuple[Tensor, Tensor]

raw_to_pixel(x_raw, y_raw)[source]

Utility to convert the raw coordinates to pixel coordinates. This is a simple scale and shift transformation.

Parameters
• x_raw (Tensor) – tensor of arbitrary shape with the x_raw coordinates

• y_raw (Tensor) – tensor of arbitrary shape with the y_raw coordinates

Returns
• x_pixel – tensor with the x_coordinates in pixel unit. It has the same shape as input.

• y_pixel – tensor with the y_coordinates in pixel unit. It has the same shape as input.

Return type

Tuple[Tensor, Tensor]

Helper function to read from the patch dictionary.

Parameters

key (str) – the key corresponding to the information to read

Returns

values – array of shape $$(ch, w, h)$$ with the image-level information

Return type

Tensor

Helper function to read from the patch dictionary.

Parameters

key – the key corresponding to the information to read

Returns
• values – array of shape $$(n_\text{patches}, *, *, *)$$ with the patch-level information

• patches_xywh – array of shape $$(n_\text{patches}, 4)$$ with the coordinates of the patches

Helper function to read from the spot dictionary.

Parameters

key (str) – the key corresponding to the information to read

Returns

values – array of shape $$(n_\text{spot}, *)$$ with the spot-level information

Return type

Tensor

to(*args, **kwargs)[source]

Move the data to a device or cast it to a different type

Return type

SparseImage

to_anndata(export_full_state=False, verbose=False)[source]

Export the spot_properties (and optionally the entire state dict) to the anndata object.

Parameters
• export_full_state (bool) – if True (default is False) the entire state_dict is exported into the anndata.uns

• verbose (bool) – if True (default is False) prints some intermediate statements

Returns

AnnData – object containing the spot_properties_dict (and optionally the full state).

Note

This will make a copy of the anndata object that was used to create the sparse image (if any)

Examples

>>> adata = sparse_image.to_anndata()
>>> sparse_image_new = SparseImage.from_anndata(adata, x_key="x", y_key="y", category_key="cell_type")

to_dense()[source]

Create a dense torch tensor of shape $$(C, W, H)$$ where the number of channels is equal to the number of categories of the underlying spatial data.

Returns

dense_img – A dense representation of the sparse image

Note

This will convert the sparse array into a dense array and might lead to a very large memory footprint.

Note

It is useful for visualization of the data.

Return type

Tensor

to_rgb(spot_size=1.0, cmap=None, figsize=(8, 8), show_colorbar=True, contrast=1.0)[source]

Make a 3 channel RGB image.

Parameters
• spot_size – size of sigma of gaussian kernel for rendering the spots

• cmap – the colormap to use

• figsize – the size of the figure

• show_colorbar – If True show the colorbar

• contrast – change to increase/decrease the contrast in the figure. It does not affect the returned tensor. It changes only the way to figure is displayed.

Returns
• dense_img – A torch.Tensor of size $$(3, W, H)$$ with the rgb rendering of the image

• fig – matplotlib figure.

transfer_image_to_spot(keys_to_transfer, overwrite=False, verbose=False, strategy='bilinear')[source]

Evaluate the image_properties_dict at the spots location. Store the results in the spot_properties_dict under the same name.

Parameters
• keys_to_transfer (List[str]) – the keys of the quantity to transfer from image_properties_dict to spot_properties_dict.

• overwrite (bool) – bool, in case of collision between the keys this variable controls when the value will be overwritten.

• verbose (bool) – bool, if true intermediate messages are displayed.

• strategy (str) – str, either ‘closest’ or ‘bilinear’ (default). This described the interpolation method.

transfer_patch_to_image(keys_to_transfer, overwrite=False, verbose=False, strategy='average')[source]

Collect the properties computed separately for each patch and stored in patch_properties_dict to create an image properties which will be stored in image_properties_dict under the same name.

Parameters
• keys_to_transfer (List[str]) – keys of the quantity to transfer from patch_properties_dict to image_properties_dict. The patch_quantity can be: a scalar, a vector, a scalar field or a vector field. This corresponds to patch_quantity having shapes: (N_patches), (N_patches, ch), (N_patches, w, h) or (N_patches, ch, w, h) respectively.

• overwrite (bool) – bool, in case of collision between keys this variable controls when to overwrite the values in the image_properties_dict.

• strategy (str) – str, either ‘average’ (default) or ‘closest’. If ‘average’ the value of each pixel in the image is obtained by averaging the contribution of all patches which contain that pixel. If ‘nearest’ each pixel takes the value from the patch whose center is closets to the pixel.

• verbose (bool) – bool, if true print intermediate messages

transfer_patch_to_spot(keys_to_transfer, overwrite=False, verbose=False, strategy_patch_to_image='average', strategy_image_to_spot='bilinear')[source]

Utility function which sequentially transfer annotations from patch -> image -> spot

trim_image_dictionary(keys)[source]

Clear selective entries in the image_properties_dictionary.

Parameters

keys (List[str]) – the list of keys to remove from the image dictionary

trim_patch_dictionary(keys)[source]

Clear selective entries in the patch_properties_dictionary.

Parameters

keys (List[str]) – the list of keys to remove from the patch dictionary

trim_spot_dictionary(keys)[source]

Clear selective entries in the spot_properties_dictionary.

Parameters

keys (List[str]) – the list of keys to remove from the spot dictionary

write_to_image_dictionary(key, values, overwrite=False)[source]

Helper function to write information to the image dictionary.

Parameters
• key – the key corresponding to the information to write

• values (Union[Tensor, ndarray]) – array of shape $$(*, w, h)$$ with the image-level information

• overwrite (bool) – If True (default is False) overwrite the value if already present

write_to_patch_dictionary(key, values, patches_xywh, overwrite=False)[source]

Helper function to write info to the patch dictionary.

Parameters
• key (str) – the name under which to save the information

• values (Union[Tensor, ndarray]) – the patch-level information of shape $$(n_\text{patches}, *, *, *)$$

• patches_xywh (Union[Tensor, ndarray]) – the location from where the patches were taken of shape $$(n_\text{patches}, 4)$$

• overwrite (bool) – If True (default is False) overwrite the value if already present

write_to_spot_dictionary(key, values, overwrite=False)[source]

Helper function to write info to the spot dictionary.

Parameters
• key (str) – the key corresponding to the information to write

• values (Union[Tensor, ndarray]) – array of shape $$(n, *)$$ with the spot-level information

• overwrite (bool) – If True (default is False) overwrite the value if already present

property x_raw: numpy.ndarray

The x_coordinates (in raw units) of the spots used to create the sparse image

Return type

ndarray

property y_raw: numpy.ndarray

The y_coordinates (in raw units) of the spots used to create the sparse image

Return type

ndarray