Miscellaneous Utilities¶

These are function Balblabla

Dictionary utilities¶

These are utilities for manipulating dictionaries. See notebook4 for an example of why/how to use them.

are_dicts_equal(dict1, dict2, keys_to_include=None, keys_to_exclude=None)[source]¶

Compare two dictionaries. Returns true if all entries are identical

Parameters

dict1 (dict) – first dictionary to compare
dict2 (dict) – second dictionary to compare
keys_to_include (Optional[List[str]]) – list of keys to use for the comparison. If None (defaults) the union of the keys in the two dictionary is used.
keys_to_exclude (Optional[List[str]]) – list of keys to exclude. If None (defaults) no keys are excluded.

Returns

result – True if all the entries corresponding to :attr:’keys_to_include’ are identical.

Note

float(1.0) if considered different from int(1)

Return type: bool

concatenate_list_of_dict(list_of_dict)[source]¶

Concatenate dictionary with the same set of keys

Parameters: list_of_dict – list of dictionary to concatenate
Returns: output_dict – the concatenated dictionary
Return type: dict

flatten_dict(input_dict, separator='_', prefix='')[source]¶

Flatten a (possibly nested) dictionary

Parameters

input_dict (dict) – the input dictionary to flatten
separator (str) – string used to merge nested keys. It defaults to “_”
prefix (str) – used in the recursive calls. Do not set manually

inspect_dict(d, prefix='')[source]¶

Inspect the content of the dictionary

Parameters

d – the dictionary to inspect
prefix (str) – used recursively in case of nested dictionary. Do not set it directly.

sort_dict_according_to_indices(input_dict, list_of_indices)[source]¶

Sort dictionaries w.r.t. a list of indices.

Parameters

input_dict (dict) – the dictionary to sort
list_of_indices (List[int]) – the indices to use in the sorting.

Returns

output_dict – the sorted dictionary.

Example

>>> input_dict = {'key': ['b', 'c', 'a']}
>>> list_of_indices = [3, 1, 2]
>>> output_dict = sort_dict_according_to_indices(input_dict, list_of_indices)
>>> print(output_dict) # will be a,b,c

Return type: dict

subset_dict(input_dict, mask)[source]¶

Subset all the elements of a dictionary according to a mask

Parameters

input_dict (dict) – dictionary with multiple entries in the form of list, numpy.arrau or torch.Tensors with the same leading dimensions, (N)
mask (Tensor) – boolean tensor of shape (N).

Returns

output_dict – a new dictionary with the subset values

subset_dict_non_overlapping_patches(input_dict, key_tissue, key_patch_xywh='patches_xywh', iom_threshold=0.0)[source]¶

Subset a dictionary containing overlapping patches to a smaller dictionary containing only (weakly) overlapping ones.

Parameters

input_dict (dict) – the dictionary to subset.
key_tissue (str) – the dictionary key corresponding to the tissue identifier.
key_patch_xywh (str) – the dictionary key corresponding to the coordinates (i.e. x,y,w,h) of the patches.
iom_threshold (float) – Threshold value for Intersection Over Minimum (IoM). If two patches have \(\text{IoM} > \text{threshold}\) only one will survive the filtering process. Set :attr:’iom_threshold’ = 0 to have a collection of strictly non-overlapping patches.

Returns

output_dict – Dictionary containing only patches with overlap less than threshold.

Note

The original dictionary will NOT be overwritten.

Return type: dict

transfer_annotations_between_dict(source_dict, dest_dict, annotation_keys, anchor_key, metric='euclidean')[source]¶

Transfer the annotations from the source dictionary to the destination dictionary. For each element in the destination dictionary it findis the closests element in the source dictionary and copies the annotations from there. Closeness is defined as the metric distance between the anchor_elements.

Parameters

source_dict (dict) – source dictionary from which the annotations will be read
dest_dict (dict) – destination dictionary where the annotation will be written
annotation_keys (List[Any]) – List of keys. It is assumed that these keys are present in the source_dictionary
anchor_key (Any) – The key of the element to be used to measure distances. It must be present in BOTH source and destination dictionaries.
metric (str) – the distance metric to measure distance between elements in the source and destination dictionaries. It defaults to ‘euclidian’.

Returns

dict – The updated destination dictionary

Return type

dict

Validation utilities¶

These are utilities used during validation to analyze the embeddings. See notebook4 for an example of why/how to use them.

class SmartUmap(preprocess_strategy, compute_all_pairwise_distances=False, **kargs)[source]¶

Wrapper around standard UMAP with get_graph() exposed.

__init__(preprocess_strategy, compute_all_pairwise_distances=False, **kargs)[source]¶

Parameters

preprocess_strategy (str) – str, can be ‘center’, ‘z_score’, ‘raw’. This is the operation to perform before UMAP
compute_all_pairwise_distances (bool) – bool, it True (default is False) compute all pairwise distances
**kargs – All the arguments that standard UMAP can accept

fit(data, y=None)[source]¶

Fit the Umap given the data

Parameters: data – array of shape \((n, p)\) where n are the points and p the features
Return type: SmartUmap

fit_transform(data, y=None)[source]¶

Utility method which internally calls fit() and transform()

Return type: ndarray

get_distances()[source]¶

Returns the symmetric (dense) matrix with the DISTANCES between elements

Return type: Tensor

get_graph()[source]¶

Returns the symmetric (sparse) matrix with the SIMILARITIES between elements

Return type: coo_matrix

transform(data)[source]¶

Use previously fitted model (including mean and std for centering and scaling the data). to transform the embeddings.

Parameters: data – array of shape \((n, p)\) to transfrom
Returns: embeddings – numpy.tensor of shape (n_sample, n_components)
Return type: ndarray

class SmartLeiden(graph, directed=True)[source]¶

Wrapper around standard Leiden algorithm. It can be initialized using the output of the SmartUmap.get_graph()

__init__(graph, directed=True)[source]¶

Parameters

graph (coo_matrix) – Usually a sparse matrix with the similarities among nodes describing the graph
directed (bool) – if True (default) builds a directed graph.

Note

The matrix obtained by the UMAP algorithm is symmetric, in that case directed should be set to True

cluster(resolution=1.0, use_weights=True, random_state=0, n_iterations=- 1, partition_type='RBC')[source]¶

Find the clusters in the data

Parameters

resolution (float) – resolution parameter controlling (indirectly) the number of clusters
use_weights (bool) – if True (defaults) the graph is weighted, i.e. the edges have different strengths
random_state (int) – control the random state. For reproducibility
n_iterations (int) – how many iterations of the greedy algorithm to perform. If -1 (defaults) it iterates till convergence.
partition_type (str) – The metric to optimize to find clusters. Either ‘CPM’ or ‘RBC’. :

Returns

labels – the integer cluster labels

Return type

ndarray

class SmartPca(preprocess_strategy)[source]¶

Return the PCA embeddings.

__init__(preprocess_strategy)[source]¶

Parameters: preprocess_strategy (str) – str, can be ‘center’, ‘z_score’, ‘raw’. This is the operation to perform before PCA

property explained_variance_¶: For compatibility with scikit_learn

property explained_variance_ratio_¶: For compatibility with scikit_learn

fit(data)[source]¶

Fit the PCA given the data. It automatically select the algorithm based on the number of features.

Parameters: data – array of shape \((n, p)\) where n are the points and p the features
Return type: SmartPca

fit_transform(data, n_components=None)[source]¶

Utility method which internally calls fit() and transform().

Parameters

data – tensor of shape \((n, p)\)
n_components (Union[int, float, None]) – If integer specifies the dimensionality of the data after PCA. If float in (0, 1) it auto selects the dimensionality so that the explained variance is at least that value. If none (defaults) uses the value previously used.

Returns

data_transformed – array of shape \((n, q)\)

Return type

ndarray

transform(data, n_components=None)[source]¶

Use a previously fitted model to transform the data.

Parameters

data – tensor of shape \((n, p)\) where n is the number of points and p are the features
n_components (Union[int, float, None]) – If integer specifies the dimensionality of the data after PCA. If float in (0, 1) it auto selects the dimensionality so that the explained variance is at least that value. If none it uses the previously used value.

Return type

ndarray

class SmartScaler(quantiles, clamp)[source]¶

Scale the values using the median and quantiles (with are robust version of mean and variance). \(data = (data - median) / scale\)

If clamp=True, each feature is clamped to the quantile range before applying the transformation. This is a simple way to deal with the outliers.

It does not deal with the situation in which outliers are inside the “box” of acceptable range but far from the reduced manifold. See situation shown below:

x x

x x

x x o

x x

__init__(quantiles, clamp)[source]¶

Parameters

quantiles (Tuple[float, float]) – The lowest and largest quantile used to scale the data. Must be in (0.0, 1.0)
clamp (bool) – If True, the data is clamped into q_low, q_high before scaling.

fit(data)[source]¶

Fit the data (i.e. computes quantiles and median)

Return type: SmartScaler

fit_transform(data)[source]¶

Utility method which internally calls fit() and transform()

Return type: ndarray

transform(data)[source]¶

Transform the data

Parameters: data – tensor of shape \((n, p)\)
Returns: out – tensor of the same shape as data with the scaled values.
Return type: ndarray

compute_distance_embedding(ref_embeddings, other_embeddings, metric, temperature=0.5)[source]¶

Compute distance between embeddings :type ref_embeddings: Tensor :param ref_embeddings: torch.Tensor of shape \((*, k)\) where k

is the dimension of the embedding

Parameters

other_embeddings (Tensor) – torch.Tensor of shape \((n, k)\)
temperature (float) – float, the temperature used to compute contrastive distance
metric (str) – Can be either ‘contrastive’ or ‘euclidean’

Returns

dist – distance of shape \((*, n)\)

Return type

Tensor

get_percentile(data, dim)[source]¶

Takes some data and convert it into a percentile (in [0.0, 1.0]) along a specified dimension. Useful to convert a tensor into the range [0.0, 1.0] for visualization.

Parameters

data (Union[Tensor, ndarray]) – input data to convert to percentile in [0,1].
dim (int) – the dimension along which to compute the quantiles

Returns

percentile – torch.tensor or numpy.array (depending on the input type) with the same shape as the input with the percentile values. A percentile of 0.9 means that 90% of the input values were smaller.

Return type

Union[Tensor, ndarray]

get_z_score(x, dim)[source]¶

Standardize vector by removing the mean and scaling to unit variance

Parameters

x (Tensor) – torch.Tensor
dim (int) – the dimension along which to compute the mean and std

Return type

Tensor

Returns

The z-score, i.e. z = (x - mean) / std

inverse_one_hot(image_in, bg_label=- 1, dim=- 3, threshold=0.1)[source]¶

Takes float tensor and compute the argmax and max_value along the specified dimension. Returns a integer tensor of the same shape as the input_tensor but with the dim removed. If the max_value is less than the threshold the bg_label is assigned.

Note

It can take an image of size \((C, W, H)\) and generate an integer mask of size \((W, H)\). This operation can be thought as the inverse of the one-hot operation which takes an integer tensor of size (*) and returns a float tensor with an extra dimension, for example (*, num_classes).

Parameters

image_in – any float tensor
bg_label (int) – integer, the value assigned to the entries of which are smaller than the threshold
dim (int) – int, the dimension along which to compute the max. For images this is usually the channel dimension, i.e. -3.
threshold (float) – float, the value of the threshold. Value smaller than this are set assigned to the background

Returns

out – An integer mask with the same size of the input tensor but with the dim removed.