Miscellaneous Utilities¶
These are function Balblabla
Dictionary utilities¶
These are utilities for manipulating dictionaries. See notebook4 for an example of why/how to use them.
- are_dicts_equal(dict1, dict2, keys_to_include=None, keys_to_exclude=None)[source]¶
Compare two dictionaries. Returns true if all entries are identical
- Parameters
dict1 (
dict
) – first dictionary to comparedict2 (
dict
) – second dictionary to comparekeys_to_include (
Optional
[List
[str
]]) – list of keys to use for the comparison. If None (defaults) the union of the keys in the two dictionary is used.keys_to_exclude (
Optional
[List
[str
]]) – list of keys to exclude. If None (defaults) no keys are excluded.
- Returns
result – True if all the entries corresponding to :attr:’keys_to_include’ are identical.
Note
float(1.0) if considered different from int(1)
- Return type
bool
- concatenate_list_of_dict(list_of_dict)[source]¶
Concatenate dictionary with the same set of keys
- Parameters
list_of_dict – list of dictionary to concatenate
- Returns
output_dict – the concatenated dictionary
- Return type
dict
- flatten_dict(input_dict, separator='_', prefix='')[source]¶
Flatten a (possibly nested) dictionary
- Parameters
input_dict (
dict
) – the input dictionary to flattenseparator (
str
) – string used to merge nested keys. It defaults to “_”prefix (
str
) – used in the recursive calls. Do not set manually
- inspect_dict(d, prefix='')[source]¶
Inspect the content of the dictionary
- Parameters
d – the dictionary to inspect
prefix (
str
) – used recursively in case of nested dictionary. Do not set it directly.
- sort_dict_according_to_indices(input_dict, list_of_indices)[source]¶
Sort dictionaries w.r.t. a list of indices.
- Parameters
input_dict (
dict
) – the dictionary to sortlist_of_indices (
List
[int
]) – the indices to use in the sorting.
- Returns
output_dict – the sorted dictionary.
Example
>>> input_dict = {'key': ['b', 'c', 'a']} >>> list_of_indices = [3, 1, 2] >>> output_dict = sort_dict_according_to_indices(input_dict, list_of_indices) >>> print(output_dict) # will be a,b,c
- Return type
dict
- subset_dict(input_dict, mask)[source]¶
Subset all the elements of a dictionary according to a mask
- Parameters
input_dict (
dict
) – dictionary with multiple entries in the form of list, numpy.arrau or torch.Tensors with the same leading dimensions, (N)mask (
Tensor
) – boolean tensor of shape (N).
- Returns
output_dict – a new dictionary with the subset values
- subset_dict_non_overlapping_patches(input_dict, key_tissue, key_patch_xywh='patches_xywh', iom_threshold=0.0)[source]¶
Subset a dictionary containing overlapping patches to a smaller dictionary containing only (weakly) overlapping ones.
- Parameters
input_dict (
dict
) – the dictionary to subset.key_tissue (
str
) – the dictionary key corresponding to the tissue identifier.key_patch_xywh (
str
) – the dictionary key corresponding to the coordinates (i.e. x,y,w,h) of the patches.iom_threshold (
float
) – Threshold value for Intersection Over Minimum (IoM). If two patches have \(\text{IoM} > \text{threshold}\) only one will survive the filtering process. Set :attr:’iom_threshold’ = 0 to have a collection of strictly non-overlapping patches.
- Returns
output_dict – Dictionary containing only patches with overlap less than threshold.
Note
The original dictionary will NOT be overwritten.
- Return type
dict
- transfer_annotations_between_dict(source_dict, dest_dict, annotation_keys, anchor_key, metric='euclidean')[source]¶
Transfer the annotations from the source dictionary to the destination dictionary. For each element in the destination dictionary it findis the closests element in the source dictionary and copies the annotations from there. Closeness is defined as the metric distance between the anchor_elements.
- Parameters
source_dict (
dict
) – source dictionary from which the annotations will be readdest_dict (
dict
) – destination dictionary where the annotation will be writtenannotation_keys (
List
[Any
]) – List of keys. It is assumed that these keys are present in the source_dictionaryanchor_key (
Any
) – The key of the element to be used to measure distances. It must be present in BOTH source and destination dictionaries.metric (
str
) – the distance metric to measure distance between elements in the source and destination dictionaries. It defaults to ‘euclidian’.
- Returns
dict – The updated destination dictionary
- Return type
dict
Validation utilities¶
These are utilities used during validation to analyze the embeddings. See notebook4 for an example of why/how to use them.
- class SmartUmap(preprocess_strategy, compute_all_pairwise_distances=False, **kargs)[source]¶
Wrapper around standard UMAP with
get_graph()
exposed.- __init__(preprocess_strategy, compute_all_pairwise_distances=False, **kargs)[source]¶
- Parameters
preprocess_strategy (
str
) – str, can be ‘center’, ‘z_score’, ‘raw’. This is the operation to perform before UMAPcompute_all_pairwise_distances (
bool
) – bool, it True (default is False) compute all pairwise distances**kargs – All the arguments that standard UMAP can accept
- fit(data, y=None)[source]¶
Fit the Umap given the data
- Parameters
data – array of shape \((n, p)\) where n are the points and p the features
- Return type
- fit_transform(data, y=None)[source]¶
Utility method which internally calls
fit()
andtransform()
- Return type
ndarray
- get_distances()[source]¶
Returns the symmetric (dense) matrix with the DISTANCES between elements
- Return type
Tensor
- class SmartLeiden(graph, directed=True)[source]¶
Wrapper around standard Leiden algorithm. It can be initialized using the output of the
SmartUmap.get_graph()
- __init__(graph, directed=True)[source]¶
- Parameters
graph (coo_matrix) – Usually a sparse matrix with the similarities among nodes describing the graph
directed (
bool
) – if True (default) builds a directed graph.
Note
The matrix obtained by the UMAP algorithm is symmetric, in that case directed should be set to True
- cluster(resolution=1.0, use_weights=True, random_state=0, n_iterations=- 1, partition_type='RBC')[source]¶
Find the clusters in the data
- Parameters
resolution (
float
) – resolution parameter controlling (indirectly) the number of clustersuse_weights (
bool
) – if True (defaults) the graph is weighted, i.e. the edges have different strengthsrandom_state (
int
) – control the random state. For reproducibilityn_iterations (
int
) – how many iterations of the greedy algorithm to perform. If -1 (defaults) it iterates till convergence.partition_type (
str
) – The metric to optimize to find clusters. Either ‘CPM’ or ‘RBC’. :
- Returns
labels – the integer cluster labels
- Return type
ndarray
- class SmartPca(preprocess_strategy)[source]¶
Return the PCA embeddings.
- __init__(preprocess_strategy)[source]¶
- Parameters
preprocess_strategy (
str
) – str, can be ‘center’, ‘z_score’, ‘raw’. This is the operation to perform before PCA
- property explained_variance_¶
For compatibility with scikit_learn
- property explained_variance_ratio_¶
For compatibility with scikit_learn
- fit(data)[source]¶
Fit the PCA given the data. It automatically select the algorithm based on the number of features.
- Parameters
data – array of shape \((n, p)\) where n are the points and p the features
- Return type
- fit_transform(data, n_components=None)[source]¶
Utility method which internally calls
fit()
andtransform()
.- Parameters
data – tensor of shape \((n, p)\)
n_components (
Union
[int
,float
,None
]) – If integer specifies the dimensionality of the data after PCA. If float in (0, 1) it auto selects the dimensionality so that the explained variance is at least that value. If none (defaults) uses the value previously used.
- Returns
data_transformed – array of shape \((n, q)\)
- Return type
ndarray
- transform(data, n_components=None)[source]¶
Use a previously fitted model to transform the data.
- Parameters
data – tensor of shape \((n, p)\) where n is the number of points and p are the features
n_components (
Union
[int
,float
,None
]) – If integer specifies the dimensionality of the data after PCA. If float in (0, 1) it auto selects the dimensionality so that the explained variance is at least that value. If none it uses the previously used value.
- Return type
ndarray
- class SmartScaler(quantiles, clamp)[source]¶
Scale the values using the median and quantiles (with are robust version of mean and variance). \(data = (data - median) / scale\)
If clamp=True, each feature is clamped to the quantile range before applying the transformation. This is a simple way to deal with the outliers.
It does not deal with the situation in which outliers are inside the “box” of acceptable range but far from the reduced manifold. See situation shown below:
x x
x x
x x o
x x
- __init__(quantiles, clamp)[source]¶
- Parameters
quantiles (
Tuple
[float
,float
]) – The lowest and largest quantile used to scale the data. Must be in (0.0, 1.0)clamp (
bool
) – If True, the data is clamped into q_low, q_high before scaling.
- fit_transform(data)[source]¶
Utility method which internally calls
fit()
andtransform()
- Return type
ndarray
- compute_distance_embedding(ref_embeddings, other_embeddings, metric, temperature=0.5)[source]¶
Compute distance between embeddings :type ref_embeddings:
Tensor
:param ref_embeddings: torch.Tensor of shape \((*, k)\) where kis the dimension of the embedding
- Parameters
other_embeddings (
Tensor
) – torch.Tensor of shape \((n, k)\)temperature (
float
) – float, the temperature used to compute contrastive distancemetric (
str
) – Can be either ‘contrastive’ or ‘euclidean’
- Returns
dist – distance of shape \((*, n)\)
- Return type
Tensor
- get_percentile(data, dim)[source]¶
Takes some data and convert it into a percentile (in [0.0, 1.0]) along a specified dimension. Useful to convert a tensor into the range [0.0, 1.0] for visualization.
- Parameters
data (
Union
[Tensor
,ndarray
]) – input data to convert to percentile in [0,1].dim (
int
) – the dimension along which to compute the quantiles
- Returns
percentile – torch.tensor or numpy.array (depending on the input type) with the same shape as the input with the percentile values. A percentile of 0.9 means that 90% of the input values were smaller.
- Return type
Union
[Tensor
,ndarray
]
- get_z_score(x, dim)[source]¶
Standardize vector by removing the mean and scaling to unit variance
- Parameters
x (
Tensor
) – torch.Tensordim (
int
) – the dimension along which to compute the mean and std
- Return type
Tensor
- Returns
The z-score, i.e. z = (x - mean) / std
- inverse_one_hot(image_in, bg_label=- 1, dim=- 3, threshold=0.1)[source]¶
Takes float tensor and compute the argmax and max_value along the specified dimension. Returns a integer tensor of the same shape as the input_tensor but with the
dim
removed. If the max_value is less than the threshold the bg_label is assigned.Note
It can take an image of size \((C, W, H)\) and generate an integer mask of size \((W, H)\). This operation can be thought as the inverse of the one-hot operation which takes an integer tensor of size (*) and returns a float tensor with an extra dimension, for example (*, num_classes).
- Parameters
image_in – any float tensor
bg_label (
int
) – integer, the value assigned to the entries of which are smaller than the thresholddim (
int
) – int, the dimension along which to compute the max. For images this is usually the channel dimension, i.e. -3.threshold (
float
) – float, the value of the threshold. Value smaller than this are set assigned to the background
- Returns
out – An integer mask with the same size of the input tensor but with the dim removed.