Getting Started

What is Tissue Purifier?

Tissue Purifier is Python library for the analysis of biological tissue and cellular micro-environments based on self supervised learning. It is built on PyTorch, PytorchLightning, Pyro and Anndata.

Spatially resolved transcriptomic technologies (such as SlideSeq, MerFish, SmFish, BaristaSeq, ExSeq, STARMap and others) allow measuring gene expression with spatial resolution. Deconvolution methods and/or analysis of marker-genes, can be used to assign a discrete cell-type (such as Macrophage, B-Cells, …) to each cell.

This type of data can be nicely organized into anndata objects, which are data-structure specifically designed for transcriptomic data. Each anndata object contains a list of all the cells in a tissue together with (at the minimum):

  1. the gene expression profile

  2. the cell-type label

  3. the spatial coordinates (either in 2D or 3D)

This rich data can unlock interesting scientific discoveries, but it is difficult to analyze. Here is where Tissue Purifier comes in.

In short, tissues are converted into images and cropped into overlapping patches. Semantic features are associated to each patch via self supervised learning (ssl). The learned features are then used in downstream tasks (such as differential gene expression analysis).

What’s appealing about this approach is that it is unbiased, meaning that the researcher does not need to know a priori which features are important. Given enough data and a sufficiently large neural network this approach should be able to extract biological relevant features useful in solving downstream tasks.

Negative results are also interesting because they suggest that the task at hand can not be solved based on cellular co-arrangement alone (i.e. cell-type labels and spatial coordinates). In the latter case, more information (for example histopathology imaging) might be necessary to define the tissue micro-environments.

Typical workflow

A typical workflow consists of 3 steps:

  1. Multiple anndata objects (corresponding to multiple tissues in possibly a diverse set of conditions) are converted to (sparse) images. These images are cropped into overlapping patches of a characteristic length and are fed into a ssl framework. Importantly, in this step the model has no access to the gene expression profile. It only uses the cell-type labels together with their spatial coordinates to create a multi-channel image (in which each channel encodes the density of a specific cell-type). Therefore, the model can only leverage the cellular co-arrangement as a learning signal. See notebook1.

  2. Once a model is trained, any (new or old) anndata object can be processed. As described above, the anndata object is transformed into a sparse image and cropped into overlapping patches. Semantic features are associated to each patch and then transferred to the cells belonging to the patch. Ultimately each cell acquire a new set of annotations describing the local micro-environment of that cell. This steps can be repeated multiple times (once for each trained model) to compare the quality of the features generated by using different ssl model and/or differen patch sizes. See notebook2.

  3. Finally, we evaluate the quality of the features. To this end we use the ssl annotations to predict the gene expression profile conditioned on the cell-type. We compare multiple baselines to show that the ssl features are biological informative. See notebook3.

Why image-based self supervised learning?

Spatial transcriptomic data is a type of tabular data and could be analyzed without converting it to images. However, image-based approaches offer three remarkable advantages:

  1. We can leverage state-of-the-art approaches which are continuously developed by the larger ML community.

  2. By changing the patch size, we can easily obtain information about the cellular environment at different spatial resolution from local (few cells) and global (thousand of cells).

  3. In this approach it is trivial to combine cell-typing information with other imaging modalities such as histopathology. The images corresponding to cell-typing and histopathology can be simply concatenated before feeding them to the algorithm.

Installation

First, you need Python 3.9 and Pytorch (with CUDA support). If you run the following command from your terminal it should report True:

python -c 'import torch; print(torch.cuda.is_available())'

Next install the most recent version of Pyro (not yet available using pip):

git clone https://github.com/pyro-ppl/pyro.git
cd pyro
pip install .

Finally install Tissue Purifier and its dependencies:

git clone https://github.com/broadinstitute/tissue_purifier.git
cd tissue_purifier
pip install -r requirements.txt
pip install .

Docker Image

A GPU-enabled docker image is available from the Google Container Registry (GCR) as:

us.gcr.io/broad-dsde-methods/tissuepurifier:latest

Older versions are available at the same location, for example as

us.gcr.io/broad-dsde-methods/tissuepurifier:0.0.5

How to run

There are 3 ways to run the code:

You can run the notebooks sequentially. Each notebook demonstrate one step on the typical workflow described in Typical workflow:

Or you can run the code locally from the command line. First download the example data (first published in Dissecting Mammalian Spermatogenesis Using Spatial Transcriptomics by Chen et al.) and untar it in the “testis_anndata” directory.

gsutil -m cp gs://ld-data-bucket/tissue-purifier/slideseq_testis_anndata_h5ad.tar.gz ./
mkdir -p ./testis_anndata
tar -xzf slideseq_testis_anndata_h5ad.tar.gz -C /testis_anndata.

Next, navigate to the “tissue_purifier/run” directory and train the model (this will take about 6 hrs on a Nvidia p100):

cd tissue_purifier/run
python main_1_train_ssl.py --config config_barlow_ssl.yaml --data_folder testis_anndata

# or alternatively
# python main_1_train_ssl.py --config config_dino_ssl.yaml --data_folder testis_anndata --gpus 2
# python main_1_train_ssl.py --config config_simclr_ssl.yaml --data_folder testis_anndata --gpus 2
# python main_1_train_ssl.py --config config_vae_ssl.yaml --data_folder testis_anndata --gpus 2

Next extract the features (this will take only few minutes to run):

python main_2_featurize.py
    --anndata_in adata_0_raw.h5ad
    --anndata_out adata_0_annotated.h5ad
    --ckpt_in ckpt_barlow.ckpt
    --feature_key barlow
    --n_patches 500
    --ncv_k 10 25 100

Finally, evaluate the features based on their ability to predict the gene expression profile.

python main_3_genex.py --anndata_in XXX --l1 0.1 --n_pca 9 --XXX # DOUBLE CHECK

It might make sense to train your model remotely on google cloud (or another cloud provider) using Terra or cromwell. and cromshell. After installing cromshell and connecting to a cromwell server, you can submit a run as follow:

cd tissue_purifier/run
./submit_neptune_ml.sh neptune_ml.wdl --py main_1_train_ssl.py --wdl WDL_parameters.json --ml config_barlow_ssl.yaml

# or alternatively
# ./submit_neptune_ml.sh neptune_ml.wdl --py main_1_train_ssl.py --wdl WDL_parameters.json --ml config_dino_ssl.yaml
# ./submit_neptune_ml.sh neptune_ml.wdl --py main_1_train_ssl.py --wdl WDL_parameters.json --ml config_simclr_ssl.yaml
# ./submit_neptune_ml.sh neptune_ml.wdl --py main_1_train_ssl.py --wdl WDL_parameters.json --ml config_vae_ssl.yaml

Step 2 and 3 can be run locally since they are much shorter (see above).

Features and Limitations

Features:

  1. We have implemented multiple ssl strategies (such as convolutional Vae, Dino, BarlowTwin, SimClr) based on recent advances in image-based Machine Learning.

  2. Tissue Purifier can be used to analyze any type of localized quantitative measurement for example spatial proteomics (not only mRNA count data).

Current limitations:

  1. Tissue Purifier works only with 2D tissue slices. No 3D support at the moment.

  2. Tissue Purifier assumes a hard cell-type assignment.

Future Improvements

We hope to soon support:

  1. probabilistic cell-type assignment

  2. pairing with histopathology (i.e. dense-image)

  3. Extension to handle 3D images

Contributing

We aspire to make Tissue Purifier an easy-to-use and useful software package for the bioinformatics community. While we test and improve Tissue Purifier together with our research collaborators, your feedback is invaluable to us and allow us to steer Tissue Purifier in the direction that you find most useful in your research. If you have an interesting idea or suggestion, please do not hesitate to reach out to us.

If you encounter a bug, please file a detailed github issue and we will get back to you as soon as possible.

Citation

This software package was developed by Luca D’Alessio and Fedor Grab.