Miscellaneous Utilities

These are function Balblabla

Dictionary utilities

These are utilities for manipulating dictionaries. See notebook4 for an example of why/how to use them.

are_dicts_equal(dict1, dict2, keys_to_include=None, keys_to_exclude=None)[source]

Compare two dictionaries. Returns true if all entries are identical

Parameters
  • dict1 (dict) – first dictionary to compare

  • dict2 (dict) – second dictionary to compare

  • keys_to_include (Optional[List[str]]) – list of keys to use for the comparison. If None (defaults) the union of the keys in the two dictionary is used.

  • keys_to_exclude (Optional[List[str]]) – list of keys to exclude. If None (defaults) no keys are excluded.

Returns

result – True if all the entries corresponding to :attr:’keys_to_include’ are identical.

Note

float(1.0) if considered different from int(1)

Return type

bool

concatenate_list_of_dict(list_of_dict)[source]

Concatenate dictionary with the same set of keys

Parameters

list_of_dict – list of dictionary to concatenate

Returns

output_dict – the concatenated dictionary

Return type

dict

flatten_dict(input_dict, separator='_', prefix='')[source]

Flatten a (possibly nested) dictionary

Parameters
  • input_dict (dict) – the input dictionary to flatten

  • separator (str) – string used to merge nested keys. It defaults to “_”

  • prefix (str) – used in the recursive calls. Do not set manually

inspect_dict(d, prefix='')[source]

Inspect the content of the dictionary

Parameters
  • d – the dictionary to inspect

  • prefix (str) – used recursively in case of nested dictionary. Do not set it directly.

sort_dict_according_to_indices(input_dict, list_of_indices)[source]

Sort dictionaries w.r.t. a list of indices.

Parameters
  • input_dict (dict) – the dictionary to sort

  • list_of_indices (List[int]) – the indices to use in the sorting.

Returns

output_dict – the sorted dictionary.

Example

>>> input_dict = {'key': ['b', 'c', 'a']}
>>> list_of_indices = [3, 1, 2]
>>> output_dict = sort_dict_according_to_indices(input_dict, list_of_indices)
>>> print(output_dict) # will be a,b,c
Return type

dict

subset_dict(input_dict, mask)[source]

Subset all the elements of a dictionary according to a mask

Parameters
  • input_dict (dict) – dictionary with multiple entries in the form of list, numpy.arrau or torch.Tensors with the same leading dimensions, (N)

  • mask (Tensor) – boolean tensor of shape (N).

Returns

output_dict – a new dictionary with the subset values

subset_dict_non_overlapping_patches(input_dict, key_tissue, key_patch_xywh='patches_xywh', iom_threshold=0.0)[source]

Subset a dictionary containing overlapping patches to a smaller dictionary containing only (weakly) overlapping ones.

Parameters
  • input_dict (dict) – the dictionary to subset.

  • key_tissue (str) – the dictionary key corresponding to the tissue identifier.

  • key_patch_xywh (str) – the dictionary key corresponding to the coordinates (i.e. x,y,w,h) of the patches.

  • iom_threshold (float) – Threshold value for Intersection Over Minimum (IoM). If two patches have \(\text{IoM} > \text{threshold}\) only one will survive the filtering process. Set :attr:’iom_threshold’ = 0 to have a collection of strictly non-overlapping patches.

Returns

output_dict – Dictionary containing only patches with overlap less than threshold.

Note

The original dictionary will NOT be overwritten.

Return type

dict

transfer_annotations_between_dict(source_dict, dest_dict, annotation_keys, anchor_key, metric='euclidean')[source]

Transfer the annotations from the source dictionary to the destination dictionary. For each element in the destination dictionary it findis the closests element in the source dictionary and copies the annotations from there. Closeness is defined as the metric distance between the anchor_elements.

Parameters
  • source_dict (dict) – source dictionary from which the annotations will be read

  • dest_dict (dict) – destination dictionary where the annotation will be written

  • annotation_keys (List[Any]) – List of keys. It is assumed that these keys are present in the source_dictionary

  • anchor_key (Any) – The key of the element to be used to measure distances. It must be present in BOTH source and destination dictionaries.

  • metric (str) – the distance metric to measure distance between elements in the source and destination dictionaries. It defaults to ‘euclidian’.

Returns

dict – The updated destination dictionary

Return type

dict

Validation utilities

These are utilities used during validation to analyze the embeddings. See notebook4 for an example of why/how to use them.

class SmartUmap(preprocess_strategy, compute_all_pairwise_distances=False, **kargs)[source]

Wrapper around standard UMAP with get_graph() exposed.

__init__(preprocess_strategy, compute_all_pairwise_distances=False, **kargs)[source]
Parameters
  • preprocess_strategy (str) – str, can be ‘center’, ‘z_score’, ‘raw’. This is the operation to perform before UMAP

  • compute_all_pairwise_distances (bool) – bool, it True (default is False) compute all pairwise distances

  • **kargs – All the arguments that standard UMAP can accept

fit(data, y=None)[source]

Fit the Umap given the data

Parameters

data – array of shape \((n, p)\) where n are the points and p the features

Return type

SmartUmap

fit_transform(data, y=None)[source]

Utility method which internally calls fit() and transform()

Return type

ndarray

get_distances()[source]

Returns the symmetric (dense) matrix with the DISTANCES between elements

Return type

Tensor

get_graph()[source]

Returns the symmetric (sparse) matrix with the SIMILARITIES between elements

Return type

coo_matrix

transform(data)[source]

Use previously fitted model (including mean and std for centering and scaling the data). to transform the embeddings.

Parameters

data – array of shape \((n, p)\) to transfrom

Returns

embeddings – numpy.tensor of shape (n_sample, n_components)

Return type

ndarray

class SmartLeiden(graph, directed=True)[source]

Wrapper around standard Leiden algorithm. It can be initialized using the output of the SmartUmap.get_graph()

__init__(graph, directed=True)[source]
Parameters
  • graph (coo_matrix) – Usually a sparse matrix with the similarities among nodes describing the graph

  • directed (bool) – if True (default) builds a directed graph.

Note

The matrix obtained by the UMAP algorithm is symmetric, in that case directed should be set to True

cluster(resolution=1.0, use_weights=True, random_state=0, n_iterations=- 1, partition_type='RBC')[source]

Find the clusters in the data

Parameters
  • resolution (float) – resolution parameter controlling (indirectly) the number of clusters

  • use_weights (bool) – if True (defaults) the graph is weighted, i.e. the edges have different strengths

  • random_state (int) – control the random state. For reproducibility

  • n_iterations (int) – how many iterations of the greedy algorithm to perform. If -1 (defaults) it iterates till convergence.

  • partition_type (str) – The metric to optimize to find clusters. Either ‘CPM’ or ‘RBC’. :

Returns

labels – the integer cluster labels

Return type

ndarray

class SmartPca(preprocess_strategy)[source]

Return the PCA embeddings.

__init__(preprocess_strategy)[source]
Parameters

preprocess_strategy (str) – str, can be ‘center’, ‘z_score’, ‘raw’. This is the operation to perform before PCA

property explained_variance_

For compatibility with scikit_learn

property explained_variance_ratio_

For compatibility with scikit_learn

fit(data)[source]

Fit the PCA given the data. It automatically select the algorithm based on the number of features.

Parameters

data – array of shape \((n, p)\) where n are the points and p the features

Return type

SmartPca

fit_transform(data, n_components=None)[source]

Utility method which internally calls fit() and transform().

Parameters
  • data – tensor of shape \((n, p)\)

  • n_components (Union[int, float, None]) – If integer specifies the dimensionality of the data after PCA. If float in (0, 1) it auto selects the dimensionality so that the explained variance is at least that value. If none (defaults) uses the value previously used.

Returns

data_transformed – array of shape \((n, q)\)

Return type

ndarray

transform(data, n_components=None)[source]

Use a previously fitted model to transform the data.

Parameters
  • data – tensor of shape \((n, p)\) where n is the number of points and p are the features

  • n_components (Union[int, float, None]) – If integer specifies the dimensionality of the data after PCA. If float in (0, 1) it auto selects the dimensionality so that the explained variance is at least that value. If none it uses the previously used value.

Return type

ndarray

class SmartScaler(quantiles, clamp)[source]

Scale the values using the median and quantiles (with are robust version of mean and variance). \(data = (data - median) / scale\)

If clamp=True, each feature is clamped to the quantile range before applying the transformation. This is a simple way to deal with the outliers.

It does not deal with the situation in which outliers are inside the “box” of acceptable range but far from the reduced manifold. See situation shown below:

x x

x x

x x o

x x

__init__(quantiles, clamp)[source]
Parameters
  • quantiles (Tuple[float, float]) – The lowest and largest quantile used to scale the data. Must be in (0.0, 1.0)

  • clamp (bool) – If True, the data is clamped into q_low, q_high before scaling.

fit(data)[source]

Fit the data (i.e. computes quantiles and median)

Return type

SmartScaler

fit_transform(data)[source]

Utility method which internally calls fit() and transform()

Return type

ndarray

transform(data)[source]

Transform the data

Parameters

data – tensor of shape \((n, p)\)

Returns

out – tensor of the same shape as data with the scaled values.

Return type

ndarray

compute_distance_embedding(ref_embeddings, other_embeddings, metric, temperature=0.5)[source]

Compute distance between embeddings :type ref_embeddings: Tensor :param ref_embeddings: torch.Tensor of shape \((*, k)\) where k

is the dimension of the embedding

Parameters
  • other_embeddings (Tensor) – torch.Tensor of shape \((n, k)\)

  • temperature (float) – float, the temperature used to compute contrastive distance

  • metric (str) – Can be either ‘contrastive’ or ‘euclidean’

Returns

dist – distance of shape \((*, n)\)

Return type

Tensor

get_percentile(data, dim)[source]

Takes some data and convert it into a percentile (in [0.0, 1.0]) along a specified dimension. Useful to convert a tensor into the range [0.0, 1.0] for visualization.

Parameters
  • data (Union[Tensor, ndarray]) – input data to convert to percentile in [0,1].

  • dim (int) – the dimension along which to compute the quantiles

Returns

percentile – torch.tensor or numpy.array (depending on the input type) with the same shape as the input with the percentile values. A percentile of 0.9 means that 90% of the input values were smaller.

Return type

Union[Tensor, ndarray]

get_z_score(x, dim)[source]

Standardize vector by removing the mean and scaling to unit variance

Parameters
  • x (Tensor) – torch.Tensor

  • dim (int) – the dimension along which to compute the mean and std

Return type

Tensor

Returns

The z-score, i.e. z = (x - mean) / std

inverse_one_hot(image_in, bg_label=- 1, dim=- 3, threshold=0.1)[source]

Takes float tensor and compute the argmax and max_value along the specified dimension. Returns a integer tensor of the same shape as the input_tensor but with the dim removed. If the max_value is less than the threshold the bg_label is assigned.

Note

It can take an image of size \((C, W, H)\) and generate an integer mask of size \((W, H)\). This operation can be thought as the inverse of the one-hot operation which takes an integer tensor of size (*) and returns a float tensor with an extra dimension, for example (*, num_classes).

Parameters
  • image_in – any float tensor

  • bg_label (int) – integer, the value assigned to the entries of which are smaller than the threshold

  • dim (int) – int, the dimension along which to compute the max. For images this is usually the channel dimension, i.e. -3.

  • threshold (float) – float, the value of the threshold. Value smaller than this are set assigned to the background

Returns

out – An integer mask with the same size of the input tensor but with the dim removed.