Data¶
DataModule¶
The datamodule encapsulates all the data-related functionalities. It defines both the pre-processing and data augmentation strategies and it is ultimately responsible for the definition of the train/test/validation data loaders. It is a self contained piece of code that ensures reproducibility of all the steps related to the data manipulation process.
For most users it suffices to use the predefined class
tissue_purifier.data.datamodule.AnndataFolderDM
. This is the simplest way to
create a datamodule starting from a folder containing anndata objects in .h5ad format.
More advanced users can subclass either tissue_purifier.data.datamodule.SslDM
or tissue_purifier.data.datamodule.SparseSslDM
to have extra flexibility.
Our datamodules include the definition of
the cropping strategy (both at train and test time)
and the data-augmentation strategy.
In the tissue_purifier.models.ssl_model.dino.DinoModel
self supervised learning framework,
the model is trained using multiple global and local crops from each image.
Accordingly the datamodule accounts for the definition of different augmentation for gloabl and local crops.
Other model, such as tissue_purifier.models.ssl_model.vae.VaeModel
,
tissue_purifier.models.ssl_model.simclr.SimclrModel
and
tissue_purifier.models.ssl_model.barlow.BarlowModel
do not use local crops.
- class SslDM(*args, **kwargs)[source]¶
Base class to inherit from to make a DataModule which can be used with any Self Supervised Learning framework
- class SparseSslDM(global_size=96, local_size=64, n_local_crops=2, n_global_crops=2, global_scale=(0.8, 1.0), local_scale=(0.5, 0.8), global_intensity=(0.8, 1.2), n_element_min_for_crop=200, drop_spot_probs=(0.1, 0.2, 0.3), rasterize_sigmas=(1.0, 1.5), occlusion_fraction=(0.1, 0.3), drop_channel_prob=0.0, drop_channel_relative_freq=None, n_crops_for_tissue_test=50, n_crops_for_tissue_train=50, batch_size_per_gpu=64, **kargs)[source]¶
Bases:
tissue_purifier.data.datamodule.SslDM
Datamodule for sparse Images with the parameter for the transform (i.e. data augmentation) specified. If you are inheriting from this class then you only have to overwrite: ‘prepara_data’, ‘setup’, ‘get_metadata_to_classify’ and ‘get_metadata_to_regress’.
- classmethod add_specific_args(parent_parser)[source]¶
Utility functions which add parameters to argparse to simplify setting up a CLI
Example
>>> import sys >>> import argparse >>> parser = argparse.ArgumentParser(add_help=False, conflict_handler='resolve') >>> parser = SslDM.add_specific_args(parser) >>> args = parser.parse_args(sys.argv[1:])
- Return type
ArgumentParser
- property cropper_test: tissue_purifier.data.dataset.CropperSparseTensor¶
Cropper to be used at test time. This specify the cropping strategy to use at test time.
- Return type
CropperSparseTensor
- property cropper_train: tissue_purifier.data.dataset.CropperSparseTensor¶
Cropper to be used at train time. This specify the cropping strategy to use at train time.
- Return type
CropperSparseTensor
- property global_size: int¶
Size in pixel of the global crops. This specify the size of the patch processed by the ssl model.
- Return type
int
- property local_size: int¶
Size in pixel of the local crops (used only for Dino). This specify the size of the patch processed by the ssl model.
- Return type
int
- property n_global_crops: int¶
Number of global crops for each image to use for training (used only for Dino).
- Return type
int
- property n_local_crops: int¶
Number of local crops for each image to use for training (used only for Dino).
- Return type
int
- property trsfm_test: tissue_purifier.data.transforms.TransformForList¶
Transformation to be applied at test time. This specify the data-augmentation at test time.
- Return type
TransformForList
- property trsfm_train_global: tissue_purifier.data.transforms.TransformForList¶
Global Transformation to be applied at train time. This specify the data augmentation for the global crops.
- Return type
TransformForList
- property trsfm_train_local: tissue_purifier.data.transforms.TransformForList¶
Local Transformation to be applied at train time. This specify the data augmentation for the local crops. Used by Dino only.
- Return type
TransformForList
- class AnndataFolderDM(data_folder, pixel_size, x_key, y_key, category_key, categories_to_channels, metadata_to_classify, metadata_to_regress, num_workers, gpus, n_neighbours_moran, **kargs)[source]¶
Bases:
tissue_purifier.data.datamodule.SparseSslDM
Create a Datamodule ready for Self-supervised learning starting from a folder full of anndata files in .h5ad format.
- classmethod add_specific_args(parent_parser)[source]¶
Utility functions which add parameters to argparse to simplify setting up a CLI
Example
>>> import sys >>> import argparse >>> parser = argparse.ArgumentParser(add_help=False, conflict_handler='resolve') >>> parser = AnndataFolderDM.add_specific_args(parser) >>> args = parser.parse_args(sys.argv[1:])
- Return type
ArgumentParser
- property ch_in: int¶
How many channels will be present in the images returned by the train/test/val dataloaders?
- Return type
int
SparseImage¶
The SparseImage
is the most important concept in the TissuePurifier library.
It has easy interoperability with Anndata which is a
data-structure specifically designed for transcriptomic data.
Contrary to Anndata, which stores the data in the form of a panda Dataframe, SparseImage
stores the
data in a sparse torch tensor for fast (GPU enabled) processing.
SparseImage
keeps information at three level of description:
1. the spot-level description. This is similar to Anndata. Cell-level annotations are stored at this level.
2. the patch-level description. For example when an image-patch is processed by a self-supervised learning model
the resulting embedding (which describes property of the entire patch) is stored at this level of description.
3. the image-level description which contains image-level properties.
SparseImage
provides built-in methods for transferring information between different levels of description.
For example a collection of patch-level properties can be glued together to obtain image-level properties
(note that we can deal with overlapping patches) and image-level properties can be evaluated at discrete
location to obtain spot-level properties.
Finally, SparseImage
provides two methods
tissue_purifier.data.sparse_image.SparseImage.compute_ncv()
and
tissue_purifier.data.sparse_image.SparseImage.compute_patch_features()
and to easily extract
information about the cellular micro-environment.
- class SparseImage(spot_properties_dict, x_key, y_key, category_key, categories_to_codes, pixel_size, padding=10, patch_properties_dict=None, image_properties_dict=None, anndata=None)[source]¶
Sparse torch tensor containing the spatial data (for example spatial gene expression or spatial cell types).
It has 3 dictionaries with spot, patch and image properties.
- __init__(spot_properties_dict, x_key, y_key, category_key, categories_to_codes, pixel_size, padding=10, patch_properties_dict=None, image_properties_dict=None, anndata=None)[source]¶
The user can initialize a SparseImage using this constructor or the
from_anndata()
.- Parameters
spot_properties_dict (
dict
) – the dictionary with the spot properties (at the minimum x,y,category)x_key (
str
) – str, the key where the x_coordinates are stored in the spot_properties_dicty_key (
str
) – str, the key where the y_coordinates are stored in the spot_properties_dictcategory_key (
str
) – str, the key where the category are stored in the spot_properties_dictcategories_to_codes (
dict
) – dictionary with the mapping from categories (keys) to codes (values). The codes must be integers starting from zero. For example {“macrophage” : 0, “t-cell”: 1}.pixel_size (
float
) – float, size of the pixel. It used in the conversion between raw coordinates and pixel coordinates.padding (
int
) – int, padding of the image (must be >= 1)patch_properties_dict (
Optional
[dict
]) – the dictionary with the patch properties. If None (defaults) an empty dict is generated.image_properties_dict (
Optional
[dict
]) – the dictionary with the image properties. If None (defaults) an empty dict is generated.anndata (
Optional
[AnnData
]) – the anndata object with other information (such as count_matrix etc)
- property anndata: scanpy.AnnData¶
The anndata object used to create the image. Might be None.
- Return type
AnnData
- property cat_raw: numpy.ndarray¶
The categorical labels (gene-identities or cell-identities) from the original data
- Return type
ndarray
- property category_to_channels: dict¶
The mapping between categories and channels in the image.
- Return type
dict
- property channels_to_category: numpy.ndarray¶
The mapping between channels in the image and categories. Note that a channel can represent more than one category. For example CD+ and CD- cells can both be shown in channel 7. In that case channels_to_category[7] -> “CD+_CD-“.
- Return type
ndarray
- clear_dicts(patch_dict=True, image_dict=True)[source]¶
Clear the patch_properties_dict and image_properties_dict in their entirety. Useful to restart the analysis from scratch. It will never modify the spot_properties_dict.
- Parameters
patch_dict (
bool
) – If True (defaults) the patch_properties_dictionary is clearedimage_dict (
bool
) – If True (defaults) the image_properties_dictionary is cleared
- compute_ncv(feature_name=None, k=None, r=None, overwrite=False)[source]¶
Compute the neighborhood composition vectors (ncv) of every spot and store the results in the spot_properties_dictionary under
feature_name
.- Parameters
feature_name (
Optional
[str
]) – the key under which the results will be stored.k (
Optional
[int
]) – if specified the k nearest neighbours are used to compute the ncv.r (
Optional
[float
]) – if specified the neighbours at a distance less than r (in raw units) are used to compute the ncv.overwrite (
bool
) – if thefeature_name
is already present in the spot_properties_dict, this variable controls when to overwrite it.
- compute_patch_features(feature_name, datamodule, model, apply_transform=True, batch_size=64, n_patches_max=100, overwrite=False, return_crops=False)¶
Split the sparse image into (possibly overlapping) patches. Each patch is analyzed by the (pretrained) model. The features are stored in the patch_properties_dict under the
feature_name
.- Parameters
feature_name (
str
) – the key under which the results will be stored.datamodule (
AnndataFolderDM
) – Datamodule used for training the model. Passing it here guarantees that the cropping strategy and the data augmentations are identical to the one used during training.model (
Module
) – the trained model will ingest the patch and produce the features.apply_transform (
bool
) – if True (defaults) the datamodule.test_trasform will be applied to the crops before feeding them into the model. If False no transformation is applied and the sparse tensors are fed into the model.batch_size (
int
) – how many crops to process simultaneously (default = 64). Use to adjust the GPU memory footprint.n_patches_max (
int
) – maximum number of patches generated to analyze the current picture (default = 100)overwrite (
bool
) – if the :attr:’feature_names’ are already present in the patch_properties_dict, this variable controls when to overwrite them.return_crops (
bool
) – if True the model returns a (batched) torch.Tensor of shape \((\text{n_patches_max}, c, w, h)\) with all the crops which were fed to the model. Default is False.
- Returns
patches – If
return_crops
is True returns tensor of shape \((N, C, W, H)\) with all the crops which were cropped and analyzed. Else returns None.- Return type
Optional
[Tensor
]
- crops(crop_size, strategy='random', n_crops=10, n_element_min=0, stride=200, random_order=True)[source]¶
Wrapper around
tissue_purifier.dataset.CropperSparseTensor
.- Returns
sp_img – list of crops represented as sparse torch Tensor
x_list – list with the x coordinates of the bottom left corners of the crops
y_list – list with the y coordinates of the bottom left corners of the crops
- Return type
Tuple
[List
[Tensor
],List
[int
],List
[int
]]
- classmethod from_anndata(anndata, x_key, y_key, category_key, pixel_size=None, categories_to_channels=None, padding=10)[source]¶
Create a SparseImage object from an AnnData object.
Note
The minimal adata object must have a categorical variable (such as cell_type or gene_identities) and the spatial coordinates. Additional fields can be present.
- Parameters
anndata (
AnnData
) – the AnnData object with the spatial datax_key (
str
) – str, tha key associated with the x_coordinate in the AnnData objecty_key (
str
) – str, tha key associated with the y_coordinate in the AnnData objectcategory_key (
str
) – str, tha key associated with the categorical values (cell_types or gene_identities)pixel_size (
Optional
[float
]) – float, pixel_size used to convert from raw coordinates to pixel coordinates. If it is not specified it will be chosen to be 1/3 of the median of the Nearest Neighbour distances between spots. Explicitely setting this attribute ensures that the pixel_size will be consistent across multiple imagescategories_to_channels (
Optional
[dict
]) – dictionary with the mapping from the names (of cell_types or genes) to channels. Explicitely setting this attribute ensures that the encoding between category and channels codes will be consistent across multiple images. If not given, the mapping will be inferred from the anndata object.padding (
int
) – int, padding of the image so that the image has a bit of black around it
- Returns
sp_img – A sparse image instance
Examples
>>> # create an AnnData object and a sparse image from it >>> anndata = AnnData(obs={"cell_type": cell_type}, obsm={"spatial": spot_xy_coordinates}) >>> cell_types = ("ES", "Endothelial", "Leydig", "Macrophage", "Myoid", "RS", "SPC", "SPG", "Sertoli") >>> categories_to_codes = dict(zip(cell_types, range(len(cell_types)))) >>> sparse_image = SparseImage.from_anndata( >>> anndata=anndata, >>> x_key="spatial", >>> y_key="spatial", >>> category_key="cell_type", >>> categories_to_channels=categories_to_channels)
- Examples:
>>> # create an AnnData object and a sparse image from it >>> anndata = AnnDatae(obs={"gene": gene, "x": gene_location_x, "y": gene_location_y}) >>> sparse_image = SparseImage.from_anndata( >>> anndata=anndata, >>> x_key="gene_location_x", >>> y_key="gene_location_y", >>> category_key="gene", >>> pixel_size=6.5, >>> padding=8)
- classmethod from_state_dict(state_dict)[source]¶
Create a sparse image from the state_dictionary which was obtained by the
get_state_dict()
- Returns
sp_img – A sparse_image instance
Example
>>> state_dict_v1 = sparse_image_old.get_state_dict(include_anndata=True) >>> torch.save(state_dict_v1, "ckpt.pt") >>> state_dict_v2 = torch.load("ckpt.pt") >>> sparse_image_new = SparseImage.from_state_dict(state_dict_v2)
- Return type
- get_state_dict(include_anndata=True)[source]¶
Get a dictionary with the state of the system
- Parameters
include_anndata (
bool
) – If True (default) the anndata is included- Returns
state_dict – A dictionary with the state
- Return type
dict
- gready(n_neighbours=6, radius=None, neigh_correct=False)[source]¶
Wrapper around
tissue_purifier.models.patch_analyzer.SpatialAutocorrelation
.- Returns
score – array of shape C with the Gready’s score for each channels (i.e. how one channel is mixed with all the others)
- Return type
Tensor
- moran(n_neighbours=6, radius=None, neigh_correct=False)[source]¶
Wrapper around
tissue_purifier.models.patch_analyzer.SpatialAutocorrelation
.- Returns
score – array of shape C with the Moran’s I score for each channels (i.e. how one channel is mixed with all the others)
- Return type
Tensor
- pixel_to_raw(x_pixel, y_pixel)[source]¶
Utility to convert the pixel coordinates to the raw_coordinates. This is a simple scale and shift transformation.
- Parameters
x_pixel (
Tensor
) – tensor of arbitrary shape with the x_index of the pixelsy_pixel (
Tensor
) – tensor of arbitrary shape with the x_index of the pixels
- Returns
x_raw – tensor with the x_coordinates in raw unit. It has the same shape as input.
y_raw – tensor with the y_coordinates in raw unit. It has the same shape as input.
- Return type
Tuple
[Tensor
,Tensor
]
- raw_to_pixel(x_raw, y_raw)[source]¶
Utility to convert the raw coordinates to pixel coordinates. This is a simple scale and shift transformation.
- Parameters
x_raw (
Tensor
) – tensor of arbitrary shape with the x_raw coordinatesy_raw (
Tensor
) – tensor of arbitrary shape with the y_raw coordinates
- Returns
x_pixel – tensor with the x_coordinates in pixel unit. It has the same shape as input.
y_pixel – tensor with the y_coordinates in pixel unit. It has the same shape as input.
- Return type
Tuple
[Tensor
,Tensor
]
- read_from_image_dictionary(key)[source]¶
Helper function to read from the patch dictionary.
- Parameters
key (
str
) – the key corresponding to the information to read- Returns
values – array of shape \((ch, w, h)\) with the image-level information
- Return type
Tensor
- read_from_patch_dictionary(key)[source]¶
Helper function to read from the patch dictionary.
- Parameters
key – the key corresponding to the information to read
- Returns
values – array of shape \((n_\text{patches}, *, *, *)\) with the patch-level information
patches_xywh – array of shape \((n_\text{patches}, 4)\) with the coordinates of the patches
- read_from_spot_dictionary(key)[source]¶
Helper function to read from the spot dictionary.
- Parameters
key (
str
) – the key corresponding to the information to read- Returns
values – array of shape \((n_\text{spot}, *)\) with the spot-level information
- Return type
Tensor
- to_anndata(export_full_state=False, verbose=False)[source]¶
Export the spot_properties (and optionally the entire state dict) to the anndata object.
- Parameters
export_full_state (
bool
) – if True (default is False) the entire state_dict is exported into the anndata.unsverbose (
bool
) – if True (default is False) prints some intermediate statements
- Returns
AnnData – object containing the spot_properties_dict (and optionally the full state).
Note
This will make a copy of the anndata object that was used to create the sparse image (if any)
Examples
>>> adata = sparse_image.to_anndata() >>> sparse_image_new = SparseImage.from_anndata(adata, x_key="x", y_key="y", category_key="cell_type")
- to_dense()[source]¶
Create a dense torch tensor of shape \((C, W, H)\) where the number of channels is equal to the number of categories of the underlying spatial data.
- Returns
dense_img – A dense representation of the sparse image
Note
This will convert the sparse array into a dense array and might lead to a very large memory footprint.
Note
It is useful for visualization of the data.
- Return type
Tensor
- to_rgb(spot_size=1.0, cmap=None, figsize=(8, 8), show_colorbar=True, contrast=1.0)[source]¶
Make a 3 channel RGB image.
- Parameters
spot_size – size of sigma of gaussian kernel for rendering the spots
cmap – the colormap to use
figsize – the size of the figure
show_colorbar – If True show the colorbar
contrast – change to increase/decrease the contrast in the figure. It does not affect the returned tensor. It changes only the way to figure is displayed.
- Returns
dense_img – A torch.Tensor of size \((3, W, H)\) with the rgb rendering of the image
fig – matplotlib figure.
- transfer_image_to_spot(keys_to_transfer, overwrite=False, verbose=False, strategy='bilinear')[source]¶
Evaluate the image_properties_dict at the spots location. Store the results in the spot_properties_dict under the same name.
- Parameters
keys_to_transfer (
List
[str
]) – the keys of the quantity to transfer from image_properties_dict to spot_properties_dict.overwrite (
bool
) – bool, in case of collision between the keys this variable controls when the value will be overwritten.verbose (
bool
) – bool, if true intermediate messages are displayed.strategy (
str
) – str, either ‘closest’ or ‘bilinear’ (default). This described the interpolation method.
- transfer_patch_to_image(keys_to_transfer, overwrite=False, verbose=False, strategy='average')[source]¶
Collect the properties computed separately for each patch and stored in patch_properties_dict to create an image properties which will be stored in image_properties_dict under the same name.
- Parameters
keys_to_transfer (
List
[str
]) – keys of the quantity to transfer from patch_properties_dict to image_properties_dict. The patch_quantity can be: a scalar, a vector, a scalar field or a vector field. This corresponds to patch_quantity having shapes: (N_patches), (N_patches, ch), (N_patches, w, h) or (N_patches, ch, w, h) respectively.overwrite (
bool
) – bool, in case of collision between keys this variable controls when to overwrite the values in the image_properties_dict.strategy (
str
) – str, either ‘average’ (default) or ‘closest’. If ‘average’ the value of each pixel in the image is obtained by averaging the contribution of all patches which contain that pixel. If ‘nearest’ each pixel takes the value from the patch whose center is closets to the pixel.verbose (
bool
) – bool, if true print intermediate messages
- transfer_patch_to_spot(keys_to_transfer, overwrite=False, verbose=False, strategy_patch_to_image='average', strategy_image_to_spot='bilinear')[source]¶
Utility function which sequentially transfer annotations from patch -> image -> spot
- trim_image_dictionary(keys)[source]¶
Clear selective entries in the image_properties_dictionary.
- Parameters
keys (
List
[str
]) – the list of keys to remove from the image dictionary
- trim_patch_dictionary(keys)[source]¶
Clear selective entries in the patch_properties_dictionary.
- Parameters
keys (
List
[str
]) – the list of keys to remove from the patch dictionary
- trim_spot_dictionary(keys)[source]¶
Clear selective entries in the spot_properties_dictionary.
- Parameters
keys (
List
[str
]) – the list of keys to remove from the spot dictionary
- write_to_image_dictionary(key, values, overwrite=False)[source]¶
Helper function to write information to the image dictionary.
- Parameters
key – the key corresponding to the information to write
values (
Union
[Tensor
,ndarray
]) – array of shape \((*, w, h)\) with the image-level informationoverwrite (
bool
) – If True (default is False) overwrite the value if already present
- write_to_patch_dictionary(key, values, patches_xywh, overwrite=False)[source]¶
Helper function to write info to the patch dictionary.
- Parameters
key (
str
) – the name under which to save the informationvalues (
Union
[Tensor
,ndarray
]) – the patch-level information of shape \((n_\text{patches}, *, *, *)\)patches_xywh (
Union
[Tensor
,ndarray
]) – the location from where the patches were taken of shape \((n_\text{patches}, 4)\)overwrite (
bool
) – If True (default is False) overwrite the value if already present
- write_to_spot_dictionary(key, values, overwrite=False)[source]¶
Helper function to write info to the spot dictionary.
- Parameters
key (
str
) – the key corresponding to the information to writevalues (
Union
[Tensor
,ndarray
]) – array of shape \((n, *)\) with the spot-level informationoverwrite (
bool
) – If True (default is False) overwrite the value if already present
- property x_raw: numpy.ndarray¶
The x_coordinates (in raw units) of the spots used to create the sparse image
- Return type
ndarray
- property y_raw: numpy.ndarray¶
The y_coordinates (in raw units) of the spots used to create the sparse image
- Return type
ndarray