ExactIncrementalPCA#

class torchdr.ExactIncrementalPCA(n_components: int = 2, device: str = 'auto', verbose: bool = False, random_state: float | None = None, **kwargs)[source]#

Bases: DRModule

Exact Incremental Principal Component Analysis.

This implementation computes the exact PCA solution by incrementally building the covariance matrix X.T @ X in batches. This is memory-efficient when the number of features is small, as only the (n_features, n_features) covariance matrix needs to be stored, not the full dataset.

Unlike IncrementalPCA which uses an approximate incremental SVD algorithm, this method computes the exact PCA solution but requires two passes through the data: one to compute the mean, and one to build the covariance matrix.

Parameters:
  • n_components (int, default=2) – Number of components to keep.

  • device (str, default="auto") – Device on which the computations are performed.

  • verbose (bool, default=False) – Whether to print information during the computations.

  • random_state (float, default=None) – Random seed for reproducibility.

mean_#

Per-feature empirical mean, calculated from the training set.

Type:

torch.Tensor of shape (n_features,)

components_#

Principal axes in feature space, representing the directions of maximum variance in the data.

Type:

torch.Tensor of shape (n_components, n_features)

explained_variance_#

The amount of variance explained by each of the selected components.

Type:

torch.Tensor of shape (n_components,)

n_samples_seen_#

The number of samples processed.

Type:

int

n_features_in_#

Number of features seen during fit.

Type:

int

Notes

When to use each incremental PCA variant:

  • IncrementalPCA: Use when you need single-pass processing, can tolerate approximate results, or have high-dimensional data where storing the full covariance matrix would be prohibitive.

  • ExactIncrementalPCA: Use when you need exact PCA results, have low-dimensional data (small n_features), and can afford two passes through the data.

Examples

Using with PyTorch DataLoader for large datasets:

from torch.utils.data import DataLoader, TensorDataset
from torchdr.spectral_embedding import ExactIncrementalPCA

# Create a DataLoader for a huge dataset
dataset = TensorDataset(huge_X_tensor)
dataloader = DataLoader(dataset, batch_size=1000, shuffle=False)

# Initialize the model
pca = ExactIncrementalPCA(n_components=50, device='cuda')

# First pass: compute mean
batch_list = []
for batch in dataloader:
    X_batch = batch[0]  # DataLoader returns tuples
    batch_list.append(X_batch)
pca.compute_mean(batch_list)

# Second pass: fit the model
pca.fit(batch_list)

# Transform new data
test_loader = DataLoader(test_dataset, batch_size=1000)
transformed_batches = []
for batch in test_loader:
    X_batch = batch[0]
    X_transformed = pca.transform(X_batch)
    transformed_batches.append(X_transformed)

Using with data generators for streaming:

import torch
from torchdr.spectral_embedding import ExactIncrementalPCA

# Generate large dataset that doesn't fit in memory
def data_generator():
    for i in range(100):  # 100 batches
        yield torch.randn(1000, 50)  # 1000 samples, 50 features

# First pass: compute mean
pca = ExactIncrementalPCA(n_components=10)
pca.compute_mean(data_generator())

# Second pass: fit the model
pca.fit(data_generator())

# Transform new data
X_new = torch.randn(100, 50)
X_transformed = pca.transform(X_new)
compute_mean(X_batches)[source]#

Compute the mean from batches of data (first pass).

Parameters:

X_batches (iterable of torch.Tensor or single torch.Tensor) – Either an iterable yielding batches of data, or a single tensor. Each batch should have shape (n_samples, n_features).

Returns:

self – Returns the instance itself.

Return type:

ExactIncrementalPCA

fit(X_batches, y=None)[source]#

Fit the model with batches of samples.

This method assumes the mean has already been computed using compute_mean(). If mean is not computed, it will compute it first (requiring two passes).

Parameters:
  • X_batches (iterable of torch.Tensor or single torch.Tensor) – Either an iterable yielding batches of data, or a single tensor. Each batch should have shape (n_samples, n_features).

  • y (None) – Ignored. Present for API consistency.

Returns:

self – Returns the instance itself.

Return type:

ExactIncrementalPCA

fit_transform(X_batches, y=None)[source]#

Fit the model and transform the data.

Parameters:
  • X_batches (iterable of torch.Tensor or single torch.Tensor) – Either an iterable yielding batches of data, or a single tensor.

  • y (None) – Ignored. Present for API consistency.

Returns:

X_transformed – Transformed data (concatenated from all batches).

Return type:

torch.Tensor

partial_fit(X: Tensor)[source]#

Incrementally fit the model with a batch of samples.

This method assumes the mean has already been computed using compute_mean().

Parameters:

X (torch.Tensor of shape (n_samples, n_features)) – Training batch.

Returns:

self – Returns the instance itself.

Return type:

ExactIncrementalPCA

set_fit_request(*, X_batches: bool | None | str = '$UNCHANGED$') ExactIncrementalPCA#

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

X_batchesstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for X_batches parameter in fit.

selfobject

The updated object.

transform(X: Tensor) Tensor[source]#

Apply dimensionality reduction on X.

Parameters:

X (torch.Tensor of shape (n_samples, n_features)) – Data to transform.

Returns:

X_transformed – Transformed data.

Return type:

torch.Tensor of shape (n_samples, n_components)