Out-of-Core Dimensionality Reduction for Large Data via Out-of-Sample Extensions
- URL: http://arxiv.org/abs/2408.04129v1
- Date: Wed, 7 Aug 2024 23:30:53 GMT
- Title: Out-of-Core Dimensionality Reduction for Large Data via Out-of-Sample Extensions
- Authors: Luca Reichmann, David Hägele, Daniel Weiskopf,
- Abstract summary: Dimensionality reduction (DR) is a well-established approach for the visualization of high-dimensional data sets.
We propose the use of out-of-sample extensions to perform DR on large data sets.
We provide an evaluation of the projection quality of five common DR algorithms.
- Score: 8.368145000145594
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Dimensionality reduction (DR) is a well-established approach for the visualization of high-dimensional data sets. While DR methods are often applied to typical DR benchmark data sets in the literature, they might suffer from high runtime complexity and memory requirements, making them unsuitable for large data visualization especially in environments outside of high-performance computing. To perform DR on large data sets, we propose the use of out-of-sample extensions. Such extensions allow inserting new data into existing projections, which we leverage to iteratively project data into a reference projection that consists only of a small manageable subset. This process makes it possible to perform DR out-of-core on large data, which would otherwise not be possible due to memory and runtime limitations. For metric multidimensional scaling (MDS), we contribute an implementation with out-of-sample projection capability since typical software libraries do not support it. We provide an evaluation of the projection quality of five common DR algorithms (MDS, PCA, t-SNE, UMAP, and autoencoders) using quality metrics from the literature and analyze the trade-off between the size of the reference set and projection quality. The runtime behavior of the algorithms is also quantified with respect to reference set size, out-of-sample batch size, and dimensionality of the data sets. Furthermore, we compare the out-of-sample approach to other recently introduced DR methods, such as PaCMAP and TriMAP, which claim to handle larger data sets than traditional approaches. To showcase the usefulness of DR on this large scale, we contribute a use case where we analyze ensembles of streamlines amounting to one billion projected instances.
Related papers
- Distributional Reduction: Unifying Dimensionality Reduction and Clustering with Gromov-Wasserstein [56.62376364594194]
Unsupervised learning aims to capture the underlying structure of potentially large and high-dimensional datasets.
In this work, we revisit these approaches under the lens of optimal transport and exhibit relationships with the Gromov-Wasserstein problem.
This unveils a new general framework, called distributional reduction, that recovers DR and clustering as special cases and allows addressing them jointly within a single optimization problem.
arXiv Detail & Related papers (2024-02-03T19:00:19Z) - RGM: A Robust Generalizable Matching Model [49.60975442871967]
We propose a deep model for sparse and dense matching, termed RGM (Robust Generalist Matching)
To narrow the gap between synthetic training samples and real-world scenarios, we build a new, large-scale dataset with sparse correspondence ground truth.
We are able to mix up various dense and sparse matching datasets, significantly improving the training diversity.
arXiv Detail & Related papers (2023-10-18T07:30:08Z) - Dimensionality Reduction as Probabilistic Inference [10.714603218784175]
Dimensionality reduction (DR) algorithms compress high-dimensional data into a lower dimensional representation while preserving important features of the data.
We introduce the ProbDR variational framework, which interprets a wide range of classical DR algorithms as probabilistic inference algorithms in this framework.
arXiv Detail & Related papers (2023-04-15T23:48:59Z) - RENs: Relevance Encoding Networks [0.0]
This paper proposes relevance encoding networks (RENs): a novel probabilistic VAE-based framework that uses the automatic relevance determination (ARD) prior in the latent space to learn the data-specific bottleneck dimensionality.
We show that the proposed model learns the relevant latent bottleneck dimensionality without compromising the representation and generation quality of the samples.
arXiv Detail & Related papers (2022-05-25T21:53:48Z) - High Performance Out-of-sample Embedding Techniques for Multidimensional
Scaling [0.5156484100374058]
We propose an out-of-sample embedding (OSE) solution to extend the MDS algorithm for large-scale data.
We present two OSE techniques: the first based on an optimisation approach and the second based on a neural network model.
arXiv Detail & Related papers (2021-11-07T12:36:33Z) - Visual Cluster Separation Using High-Dimensional Sharpened
Dimensionality Reduction [65.80631307271705]
High-Dimensional Sharpened DR' (HD-SDR) is tested on both synthetic and real-world data sets.
Our method achieves good quality (measured by quality metrics) and scales computationally well with large high-dimensional data.
To illustrate its concrete applications, we further apply HD-SDR on a recent astronomical catalog.
arXiv Detail & Related papers (2021-10-01T11:13:51Z) - SreaMRAK a Streaming Multi-Resolution Adaptive Kernel Algorithm [60.61943386819384]
Existing implementations of KRR require that all the data is stored in the main memory.
We propose StreaMRAK - a streaming version of KRR.
We present a showcase study on two synthetic problems and the prediction of the trajectory of a double pendulum.
arXiv Detail & Related papers (2021-08-23T21:03:09Z) - Evaluating representations by the complexity of learning low-loss
predictors [55.94170724668857]
We consider the problem of evaluating representations of data for use in solving a downstream task.
We propose to measure the quality of a representation by the complexity of learning a predictor on top of the representation that achieves low loss on a task of interest.
arXiv Detail & Related papers (2020-09-15T22:06:58Z) - A Visual Analytics Framework for Reviewing Multivariate Time-Series Data
with Dimensionality Reduction [19.460188497780155]
dimensionality reduction (DR) methods are often used to uncover the intrinsic structure and features of the data.
We present MulTiDR, a new DR framework that enables processing of time-dependent multivariate data as a whole.
By coupling with a contrastive learning method and interactive visualizations, our framework enhances analysts' ability to interpret DR results.
arXiv Detail & Related papers (2020-08-02T04:22:43Z) - Longitudinal Variational Autoencoder [1.4680035572775534]
A common approach to analyse high-dimensional data that contains missing values is to learn a low-dimensional representation using variational autoencoders (VAEs)
Standard VAEs assume that the learnt representations are i.i.d., and fail to capture the correlations between the data samples.
We propose the Longitudinal VAE (L-VAE), that uses a multi-output additive Gaussian process (GP) prior to extend the VAE's capability to learn structured low-dimensional representations.
Our approach can simultaneously accommodate both time-varying shared and random effects, produce structured low-dimensional representations
arXiv Detail & Related papers (2020-06-17T10:30:14Z) - NCVis: Noise Contrastive Approach for Scalable Visualization [79.44177623781043]
NCVis is a high-performance dimensionality reduction method built on a sound statistical basis of noise contrastive estimation.
We show that NCVis outperforms state-of-the-art techniques in terms of speed while preserving the representation quality of other methods.
arXiv Detail & Related papers (2020-01-30T15:43:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.