ExactIncrementalPCA#

class torchdr.ExactIncrementalPCA(n_components: int = 2, device: str = 'auto', distributed: str | bool = 'auto', verbose: bool = False, random_state: float | None = None, **kwargs)[source]#

Bases: DRModule

Exact Incremental Principal Component Analysis.

This implementation computes the exact PCA solution by incrementally building the covariance matrix X.T @ X in batches. This is memory-efficient when the number of features is small, as only the (n_features, n_features) covariance matrix needs to be stored, not the full dataset.

Unlike IncrementalPCA which uses an approximate incremental SVD algorithm, this method computes the exact PCA solution but requires two passes through the data: one to compute the mean, and one to build the covariance matrix.

Parameters:
  • n_components (int, default=2) – Number of components to keep.

  • device (str, default="auto") – Device on which the computations are performed.

  • distributed (str or bool, default="auto") –

    Whether to use distributed mode for multi-GPU training.

    • ”auto”: Automatically detect if torch.distributed is initialized and use distributed mode if available.

    • True: Force distributed mode (requires torch.distributed to be initialized).

    • False: Disable distributed mode.

    In distributed mode, each GPU computes local statistics which are then aggregated using all-reduce operations. This is communication-efficient when the number of samples is much larger than the number of features (n >> d).

  • verbose (bool, default=False) – Whether to print information during the computations.

  • random_state (float, default=None) – Random seed for reproducibility.

mean_#

Per-feature empirical mean, calculated from the training set.

Type:

torch.Tensor of shape (n_features,)

components_#

Principal axes in feature space, representing the directions of maximum variance in the data.

Type:

torch.Tensor of shape (n_components, n_features)

explained_variance_#

The amount of variance explained by each of the selected components.

Type:

torch.Tensor of shape (n_components,)

n_samples_seen_#

The number of samples processed.

Type:

int

n_features_in_#

Number of features seen during fit.

Type:

int

Notes

When to use each incremental PCA variant:

  • IncrementalPCA: Use when you need single-pass processing, can tolerate approximate results, or have high-dimensional data where storing the full covariance matrix would be prohibitive.

  • ExactIncrementalPCA: Use when you need exact PCA results, have low-dimensional data (small n_features), and can afford two passes through the data.

In distributed mode:

  • Requires torch.distributed to be initialized (use torchrun or TorchDR CLI)

  • Automatically uses local_rank for GPU assignment

  • Each GPU only needs its data chunk in memory

  • Uses covariance aggregation: O(d) communication for mean, O(d^2) for covariance

  • Mathematically equivalent to running on concatenated data from all GPUs

Examples

Using with PyTorch DataLoader for large datasets:

from torch.utils.data import DataLoader, TensorDataset
from torchdr.spectral_embedding import ExactIncrementalPCA

# Create a DataLoader for a huge dataset
dataset = TensorDataset(huge_X_tensor)
dataloader = DataLoader(dataset, batch_size=1000, shuffle=False)

# Initialize the model
pca = ExactIncrementalPCA(n_components=50, device='cuda')

# First pass: compute mean
batch_list = []
for batch in dataloader:
    X_batch = batch[0]  # DataLoader returns tuples
    batch_list.append(X_batch)
pca.compute_mean(batch_list)

# Second pass: fit the model
pca.fit(batch_list)

# Transform new data
test_loader = DataLoader(test_dataset, batch_size=1000)
transformed_batches = []
for batch in test_loader:
    X_batch = batch[0]
    X_transformed = pca.transform(X_batch)
    transformed_batches.append(X_transformed)

Using with data generators for streaming:

import torch
from torchdr.spectral_embedding import ExactIncrementalPCA

# Generate large dataset that doesn't fit in memory
def data_generator():
    for i in range(100):  # 100 batches
        yield torch.randn(1000, 50)  # 1000 samples, 50 features

# First pass: compute mean
pca = ExactIncrementalPCA(n_components=10)
pca.compute_mean(data_generator())

# Second pass: fit the model
pca.fit(data_generator())

# Transform new data
X_new = torch.randn(100, 50)
X_transformed = pca.transform(X_new)

Multi-GPU distributed usage (launch with torchrun –nproc_per_node=4):

import torch
from torchdr import ExactIncrementalPCA

# Each GPU loads its chunk of the data
rank = torch.distributed.get_rank()
world_size = torch.distributed.get_world_size()
chunk_size = len(full_data) // world_size
X_local = full_data[rank * chunk_size:(rank + 1) * chunk_size]

# Create batches for incremental processing
batch_size = 1000
batches = [X_local[i:i+batch_size] for i in range(0, len(X_local), batch_size)]

# Distributed PCA - handles communication automatically
pca = ExactIncrementalPCA(n_components=50, distributed="auto")
pca.compute_mean(batches)  # First pass: compute global mean
pca.fit(batches)           # Second pass: build global covariance
X_transformed = pca.transform(X_local)  # Transform local data
compute_mean(X_batches)[source]#

Compute the mean from batches of data (first pass).

In distributed mode, each GPU computes its local sum and sample count, then all-reduce is used to compute the global mean.

Parameters:

X_batches (iterable of torch.Tensor or single torch.Tensor) – Either an iterable yielding batches of data, a DataLoader, or a single tensor. Each batch should have shape (n_samples, n_features). DataLoader batches can be tuples (X, y) - only X will be used.

Returns:

self – Returns the instance itself.

Return type:

ExactIncrementalPCA

fit(X_batches, y=None)[source]#

Fit the model with batches of samples.

This method assumes the mean has already been computed using compute_mean(). If mean is not computed, it will compute it first (requiring two passes).

In distributed mode, each GPU computes its local covariance contribution, then all-reduce is used to compute the global covariance matrix before eigendecomposition.

Parameters:
  • X_batches (iterable of torch.Tensor, DataLoader, or single torch.Tensor) – Either an iterable yielding batches of data, a DataLoader, or a single tensor. Each batch should have shape (n_samples, n_features). DataLoader batches can be tuples (X, y) - only X will be used.

  • y (None) – Ignored. Present for API consistency.

Returns:

self – Returns the instance itself.

Return type:

ExactIncrementalPCA

fit_transform(X_batches, y=None)[source]#

Fit the model and transform the data.

Parameters:
  • X_batches (iterable of torch.Tensor or single torch.Tensor) – Either an iterable yielding batches of data, or a single tensor.

  • y (None) – Ignored. Present for API consistency.

Returns:

X_transformed – Transformed data (concatenated from all batches).

Return type:

torch.Tensor

partial_fit(X: Tensor)[source]#

Incrementally fit the model with a batch of samples.

This method assumes the mean has already been computed using compute_mean(). Accumulates X_centered.T @ X_centered for one batch.

Parameters:

X (torch.Tensor of shape (n_samples, n_features)) – Training batch.

Returns:

self – Returns the instance itself.

Return type:

ExactIncrementalPCA

set_fit_request(*, X_batches: bool | None | str = '$UNCHANGED$') ExactIncrementalPCA#

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

X_batches (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for X_batches parameter in fit.

Returns:

self – The updated object.

Return type:

object

transform(X: Tensor) Tensor[source]#

Apply dimensionality reduction on X.

Parameters:

X (torch.Tensor of shape (n_samples, n_features)) – Data to transform.

Returns:

X_transformed – Transformed data.

Return type:

torch.Tensor of shape (n_samples, n_components)