pairwise_distances#

torchdr.pairwise_distances(X: Tensor | DataLoader, Y: Tensor | None = None, metric: str = 'euclidean', backend: str | FaissConfig | None = None, exclude_diag: bool = False, k: int | None = None, return_indices: bool = False, device: str = 'auto', distributed_ctx: DistributedContext | None = None)[source]#

Compute pairwise distances between two tensors or from a DataLoader.

This is the main distance computation function that supports multiple backends for efficient computation. It can compute: - Full pairwise distance matrices between X and Y (or X and itself) - k-nearest neighbor distances when k is specified - Distances with various metrics (euclidean, manhattan, angular, etc.)

When X is a DataLoader, data is streamed to build the FAISS index incrementally, avoiding the need to hold the full dataset in CPU RAM. This is particularly useful for large datasets that don’t fit in memory.

For computing distances between specific indexed subsets, use pairwise_distances_indexed instead.

Parameters:
  • X (torch.Tensor of shape (n_samples, n_features) or DataLoader) – Input data. When a DataLoader is provided: - Must have shuffle=False for deterministic iteration - Must yield tensors of shape (batch_size, n_features) - k parameter is required (only k-NN computation supported) - Y parameter must be None (self-distance only)

  • Y (torch.Tensor of shape (m_samples, n_features), optional) – Input data. If None, Y is set to X. Not supported with DataLoader input.

  • metric (str, optional) – Metric to use. Default is “euclidean”.

  • backend ({'keops', 'faiss', None} or FaissConfig, optional) – Backend to use for computation. Can be: - “keops”: Use KeOps for memory-efficient symbolic computations - “faiss”: Use FAISS for fast k-NN computations with default settings - None: Use standard PyTorch operations - FaissConfig object: Use FAISS with custom configuration If None, use standard torch operations. DataLoader input forces FAISS backend.

  • exclude_diag (bool, optional) – Whether to exclude the diagonal from the distance matrix. Only used when k is not None. Default is False.

  • k (int, optional) – If not None, return only the k-nearest neighbors. Required when using DataLoader input.

  • return_indices (bool, optional) – Whether to return the indices of the k-nearest neighbors. Default is False.

  • device (str, default="auto") – Device to use for computation. If “auto”, keeps data on its current device. Otherwise, temporarily moves data to specified device for computation. Output remains on the computation device.

  • distributed_ctx (DistributedContext, optional) – Distributed computation context for multi-GPU scenarios. When provided: - Each GPU computes distances for its assigned chunk of rows - Requires k to be specified (sparse computation) - Forces backend to “faiss” if not already set - Results remain distributed (no gathering across GPUs) Default is None (single GPU computation).

Returns:

  • C (torch.Tensor) – Pairwise distances.

  • indices (torch.Tensor, optional) – Indices of the k-nearest neighbors. Only returned if k is not None.

Examples

>>> import torch
>>> from torchdr.distance import pairwise_distances, FaissConfig
>>> # Basic usage with tensor
>>> X = torch.randn(1000, 128)
>>> distances = pairwise_distances(X, k=10, backend='faiss')
>>> # Using DataLoader for memory-efficient computation
>>> from torch.utils.data import DataLoader, TensorDataset
>>> dataset = TensorDataset(torch.randn(100000, 128))
>>> dataloader = DataLoader(dataset, batch_size=10000, shuffle=False)
>>> distances, indices = pairwise_distances(
...     dataloader, k=15, return_indices=True
... )
>>> # DataLoader with multi-GPU (after torch.distributed.init_process_group)
>>> from torchdr.distributed import DistributedContext
>>> dist_ctx = DistributedContext()
>>> distances, indices = pairwise_distances(
...     dataloader, k=15, distributed_ctx=dist_ctx, return_indices=True
... )
>>> # Each GPU gets its chunk of results