ExactIncrementalPCA#
- class torchdr.ExactIncrementalPCA(n_components: int = 2, device: str = 'auto', verbose: bool = False, random_state: float | None = None, **kwargs)[source]#
Bases:
DRModule
Exact Incremental Principal Component Analysis.
This implementation computes the exact PCA solution by incrementally building the covariance matrix X.T @ X in batches. This is memory-efficient when the number of features is small, as only the (n_features, n_features) covariance matrix needs to be stored, not the full dataset.
Unlike IncrementalPCA which uses an approximate incremental SVD algorithm, this method computes the exact PCA solution but requires two passes through the data: one to compute the mean, and one to build the covariance matrix.
- Parameters:
- mean_#
Per-feature empirical mean, calculated from the training set.
- Type:
torch.Tensor of shape (n_features,)
- components_#
Principal axes in feature space, representing the directions of maximum variance in the data.
- Type:
torch.Tensor of shape (n_components, n_features)
- explained_variance_#
The amount of variance explained by each of the selected components.
- Type:
torch.Tensor of shape (n_components,)
Notes
When to use each incremental PCA variant:
IncrementalPCA: Use when you need single-pass processing, can tolerate approximate results, or have high-dimensional data where storing the full covariance matrix would be prohibitive.
ExactIncrementalPCA: Use when you need exact PCA results, have low-dimensional data (small n_features), and can afford two passes through the data.
Examples
Using with PyTorch DataLoader for large datasets:
from torch.utils.data import DataLoader, TensorDataset from torchdr.spectral_embedding import ExactIncrementalPCA # Create a DataLoader for a huge dataset dataset = TensorDataset(huge_X_tensor) dataloader = DataLoader(dataset, batch_size=1000, shuffle=False) # Initialize the model pca = ExactIncrementalPCA(n_components=50, device='cuda') # First pass: compute mean batch_list = [] for batch in dataloader: X_batch = batch[0] # DataLoader returns tuples batch_list.append(X_batch) pca.compute_mean(batch_list) # Second pass: fit the model pca.fit(batch_list) # Transform new data test_loader = DataLoader(test_dataset, batch_size=1000) transformed_batches = [] for batch in test_loader: X_batch = batch[0] X_transformed = pca.transform(X_batch) transformed_batches.append(X_transformed)
Using with data generators for streaming:
import torch from torchdr.spectral_embedding import ExactIncrementalPCA # Generate large dataset that doesn't fit in memory def data_generator(): for i in range(100): # 100 batches yield torch.randn(1000, 50) # 1000 samples, 50 features # First pass: compute mean pca = ExactIncrementalPCA(n_components=10) pca.compute_mean(data_generator()) # Second pass: fit the model pca.fit(data_generator()) # Transform new data X_new = torch.randn(100, 50) X_transformed = pca.transform(X_new)
- compute_mean(X_batches)[source]#
Compute the mean from batches of data (first pass).
- Parameters:
X_batches (iterable of torch.Tensor or single torch.Tensor) – Either an iterable yielding batches of data, or a single tensor. Each batch should have shape (n_samples, n_features).
- Returns:
self – Returns the instance itself.
- Return type:
- fit(X_batches, y=None)[source]#
Fit the model with batches of samples.
This method assumes the mean has already been computed using compute_mean(). If mean is not computed, it will compute it first (requiring two passes).
- Parameters:
X_batches (iterable of torch.Tensor or single torch.Tensor) – Either an iterable yielding batches of data, or a single tensor. Each batch should have shape (n_samples, n_features).
y (None) – Ignored. Present for API consistency.
- Returns:
self – Returns the instance itself.
- Return type:
- fit_transform(X_batches, y=None)[source]#
Fit the model and transform the data.
- Parameters:
X_batches (iterable of torch.Tensor or single torch.Tensor) – Either an iterable yielding batches of data, or a single tensor.
y (None) – Ignored. Present for API consistency.
- Returns:
X_transformed – Transformed data (concatenated from all batches).
- Return type:
torch.Tensor
- partial_fit(X: Tensor)[source]#
Incrementally fit the model with a batch of samples.
This method assumes the mean has already been computed using compute_mean().
- Parameters:
X (torch.Tensor of shape (n_samples, n_features)) – Training batch.
- Returns:
self – Returns the instance itself.
- Return type:
- set_fit_request(*, X_batches: bool | None | str = '$UNCHANGED$') ExactIncrementalPCA #
Configure whether metadata should be requested to be passed to the
fit
method.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True
(seesklearn.set_config()
). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed tofit
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it tofit
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- X_batchesstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
X_batches
parameter infit
.
- selfobject
The updated object.