ExactIncrementalPCA#
- class torchdr.ExactIncrementalPCA(n_components: int = 2, device: str = 'auto', distributed: str | bool = 'auto', verbose: bool = False, random_state: float | None = None, **kwargs)[source]#
Bases:
DRModuleExact Incremental Principal Component Analysis.
This implementation computes the exact PCA solution by incrementally building the covariance matrix X.T @ X in batches. This is memory-efficient when the number of features is small, as only the (n_features, n_features) covariance matrix needs to be stored, not the full dataset.
Unlike IncrementalPCA which uses an approximate incremental SVD algorithm, this method computes the exact PCA solution but requires two passes through the data: one to compute the mean, and one to build the covariance matrix.
- Parameters:
n_components (int, default=2) – Number of components to keep.
device (str, default="auto") – Device on which the computations are performed.
distributed (str or bool, default="auto") –
Whether to use distributed mode for multi-GPU training.
”auto”: Automatically detect if torch.distributed is initialized and use distributed mode if available.
True: Force distributed mode (requires torch.distributed to be initialized).
False: Disable distributed mode.
In distributed mode, each GPU computes local statistics which are then aggregated using all-reduce operations. This is communication-efficient when the number of samples is much larger than the number of features (n >> d).
verbose (bool, default=False) – Whether to print information during the computations.
random_state (float, default=None) – Random seed for reproducibility.
- mean_#
Per-feature empirical mean, calculated from the training set.
- Type:
torch.Tensor of shape (n_features,)
- components_#
Principal axes in feature space, representing the directions of maximum variance in the data.
- Type:
torch.Tensor of shape (n_components, n_features)
- explained_variance_#
The amount of variance explained by each of the selected components.
- Type:
torch.Tensor of shape (n_components,)
Notes
When to use each incremental PCA variant:
IncrementalPCA: Use when you need single-pass processing, can tolerate approximate results, or have high-dimensional data where storing the full covariance matrix would be prohibitive.
ExactIncrementalPCA: Use when you need exact PCA results, have low-dimensional data (small n_features), and can afford two passes through the data.
In distributed mode:
Requires torch.distributed to be initialized (use torchrun or TorchDR CLI)
Automatically uses local_rank for GPU assignment
Each GPU only needs its data chunk in memory
Uses covariance aggregation: O(d) communication for mean, O(d^2) for covariance
Mathematically equivalent to running on concatenated data from all GPUs
Examples
Using with PyTorch DataLoader for large datasets:
from torch.utils.data import DataLoader, TensorDataset from torchdr.spectral_embedding import ExactIncrementalPCA # Create a DataLoader for a huge dataset dataset = TensorDataset(huge_X_tensor) dataloader = DataLoader(dataset, batch_size=1000, shuffle=False) # Initialize the model pca = ExactIncrementalPCA(n_components=50, device='cuda') # First pass: compute mean batch_list = [] for batch in dataloader: X_batch = batch[0] # DataLoader returns tuples batch_list.append(X_batch) pca.compute_mean(batch_list) # Second pass: fit the model pca.fit(batch_list) # Transform new data test_loader = DataLoader(test_dataset, batch_size=1000) transformed_batches = [] for batch in test_loader: X_batch = batch[0] X_transformed = pca.transform(X_batch) transformed_batches.append(X_transformed)
Using with data generators for streaming:
import torch from torchdr.spectral_embedding import ExactIncrementalPCA # Generate large dataset that doesn't fit in memory def data_generator(): for i in range(100): # 100 batches yield torch.randn(1000, 50) # 1000 samples, 50 features # First pass: compute mean pca = ExactIncrementalPCA(n_components=10) pca.compute_mean(data_generator()) # Second pass: fit the model pca.fit(data_generator()) # Transform new data X_new = torch.randn(100, 50) X_transformed = pca.transform(X_new)
Multi-GPU distributed usage (launch with torchrun –nproc_per_node=4):
import torch from torchdr import ExactIncrementalPCA # Each GPU loads its chunk of the data rank = torch.distributed.get_rank() world_size = torch.distributed.get_world_size() chunk_size = len(full_data) // world_size X_local = full_data[rank * chunk_size:(rank + 1) * chunk_size] # Create batches for incremental processing batch_size = 1000 batches = [X_local[i:i+batch_size] for i in range(0, len(X_local), batch_size)] # Distributed PCA - handles communication automatically pca = ExactIncrementalPCA(n_components=50, distributed="auto") pca.compute_mean(batches) # First pass: compute global mean pca.fit(batches) # Second pass: build global covariance X_transformed = pca.transform(X_local) # Transform local data
- compute_mean(X_batches)[source]#
Compute the mean from batches of data (first pass).
In distributed mode, each GPU computes its local sum and sample count, then all-reduce is used to compute the global mean.
- Parameters:
X_batches (iterable of torch.Tensor or single torch.Tensor) – Either an iterable yielding batches of data, a DataLoader, or a single tensor. Each batch should have shape (n_samples, n_features). DataLoader batches can be tuples (X, y) - only X will be used.
- Returns:
self – Returns the instance itself.
- Return type:
- fit(X_batches, y=None)[source]#
Fit the model with batches of samples.
This method assumes the mean has already been computed using compute_mean(). If mean is not computed, it will compute it first (requiring two passes).
In distributed mode, each GPU computes its local covariance contribution, then all-reduce is used to compute the global covariance matrix before eigendecomposition.
- Parameters:
X_batches (iterable of torch.Tensor, DataLoader, or single torch.Tensor) – Either an iterable yielding batches of data, a DataLoader, or a single tensor. Each batch should have shape (n_samples, n_features). DataLoader batches can be tuples (X, y) - only X will be used.
y (None) – Ignored. Present for API consistency.
- Returns:
self – Returns the instance itself.
- Return type:
- fit_transform(X_batches, y=None)[source]#
Fit the model and transform the data.
- Parameters:
X_batches (iterable of torch.Tensor or single torch.Tensor) – Either an iterable yielding batches of data, or a single tensor.
y (None) – Ignored. Present for API consistency.
- Returns:
X_transformed – Transformed data (concatenated from all batches).
- Return type:
- partial_fit(X: Tensor)[source]#
Incrementally fit the model with a batch of samples.
This method assumes the mean has already been computed using compute_mean(). Accumulates X_centered.T @ X_centered for one batch.
- Parameters:
X (torch.Tensor of shape (n_samples, n_features)) – Training batch.
- Returns:
self – Returns the instance itself.
- Return type:
- set_fit_request(*, X_batches: bool | None | str = '$UNCHANGED$') ExactIncrementalPCA#
Configure whether metadata should be requested to be passed to the
fitmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- transform(X: Tensor) Tensor[source]#
Apply dimensionality reduction on X.
- Parameters:
X (torch.Tensor of shape (n_samples, n_features)) – Data to transform.
- Returns:
X_transformed – Transformed data.
- Return type:
torch.Tensor of shape (n_samples, n_components)