Related papers: Out-of-Core Dimensionality Reduction for Large Data via Out-of-Sample Extensions

Out-of-Core Dimensionality Reduction for Large Data via Out-of-Sample Extensions

URL: http://arxiv.org/abs/2408.04129v1
Date: Wed, 7 Aug 2024 23:30:53 GMT
Title: Out-of-Core Dimensionality Reduction for Large Data via Out-of-Sample Extensions
Authors: Luca Reichmann, David Hägele, Daniel Weiskopf,
Abstract summary: Dimensionality reduction (DR) is a well-established approach for the visualization of high-dimensional data sets. We propose the use of out-of-sample extensions to perform DR on large data sets. We provide an evaluation of the projection quality of five common DR algorithms.
Score: 8.368145000145594
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Dimensionality reduction (DR) is a well-established approach for the visualization of high-dimensional data sets. While DR methods are often applied to typical DR benchmark data sets in the literature, they might suffer from high runtime complexity and memory requirements, making them unsuitable for large data visualization especially in environments outside of high-performance computing. To perform DR on large data sets, we propose the use of out-of-sample extensions. Such extensions allow inserting new data into existing projections, which we leverage to iteratively project data into a reference projection that consists only of a small manageable subset. This process makes it possible to perform DR out-of-core on large data, which would otherwise not be possible due to memory and runtime limitations. For metric multidimensional scaling (MDS), we contribute an implementation with out-of-sample projection capability since typical software libraries do not support it. We provide an evaluation of the projection quality of five common DR algorithms (MDS, PCA, t-SNE, UMAP, and autoencoders) using quality metrics from the literature and analyze the trade-off between the size of the reference set and projection quality. The runtime behavior of the algorithms is also quantified with respect to reference set size, out-of-sample batch size, and dimensionality of the data sets. Furthermore, we compare the out-of-sample approach to other recently introduced DR methods, such as PaCMAP and TriMAP, which claim to handle larger data sets than traditional approaches. To showcase the usefulness of DR on this large scale, we contribute a use case where we analyze ensembles of streamlines amounting to one billion projected instances.

Related papers

CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation [51.2289822267563]
We propose a method for generating synthetic datasets, given a small number of user-written few-shots that demonstrate the task to be performed.<n>We demonstrate that CRAFT can efficiently generate large-scale task-specific training datasets for four diverse tasks: biology, medicine, and commonsense question-answering (QA)<n>Our experiments show that CRAFT-based models outperform or match general LLMs on QA tasks, while exceeding models trained on human-curated summarization data by 46 preference points.
arXiv Detail & Related papers (2024-09-03T17:54:40Z)
Distributional Reduction: Unifying Dimensionality Reduction and Clustering with Gromov-Wasserstein [56.62376364594194]
Unsupervised learning aims to capture the underlying structure of potentially large and high-dimensional datasets. In this work, we revisit these approaches under the lens of optimal transport and exhibit relationships with the Gromov-Wasserstein problem. This unveils a new general framework, called distributional reduction, that recovers DR and clustering as special cases and allows addressing them jointly within a single optimization problem.
arXiv Detail & Related papers (2024-02-03T19:00:19Z)
RGM: A Robust Generalizable Matching Model [49.60975442871967]
We propose a deep model for sparse and dense matching, termed RGM (Robust Generalist Matching) To narrow the gap between synthetic training samples and real-world scenarios, we build a new, large-scale dataset with sparse correspondence ground truth. We are able to mix up various dense and sparse matching datasets, significantly improving the training diversity.
arXiv Detail & Related papers (2023-10-18T07:30:08Z)
Dimensionality Reduction as Probabilistic Inference [10.714603218784175]
Dimensionality reduction (DR) algorithms compress high-dimensional data into a lower dimensional representation while preserving important features of the data. We introduce the ProbDR variational framework, which interprets a wide range of classical DR algorithms as probabilistic inference algorithms in this framework.
arXiv Detail & Related papers (2023-04-15T23:48:59Z)
RENs: Relevance Encoding Networks [0.0]
This paper proposes relevance encoding networks (RENs): a novel probabilistic VAE-based framework that uses the automatic relevance determination (ARD) prior in the latent space to learn the data-specific bottleneck dimensionality. We show that the proposed model learns the relevant latent bottleneck dimensionality without compromising the representation and generation quality of the samples.
arXiv Detail & Related papers (2022-05-25T21:53:48Z)
High Performance Out-of-sample Embedding Techniques for Multidimensional Scaling [0.5156484100374058]
We propose an out-of-sample embedding (OSE) solution to extend the MDS algorithm for large-scale data. We present two OSE techniques: the first based on an optimisation approach and the second based on a neural network model.
arXiv Detail & Related papers (2021-11-07T12:36:33Z)
Visual Cluster Separation Using High-Dimensional Sharpened Dimensionality Reduction [65.80631307271705]
High-Dimensional Sharpened DR' (HD-SDR) is tested on both synthetic and real-world data sets. Our method achieves good quality (measured by quality metrics) and scales computationally well with large high-dimensional data. To illustrate its concrete applications, we further apply HD-SDR on a recent astronomical catalog.
arXiv Detail & Related papers (2021-10-01T11:13:51Z)
SreaMRAK a Streaming Multi-Resolution Adaptive Kernel Algorithm [60.61943386819384]
Existing implementations of KRR require that all the data is stored in the main memory. We propose StreaMRAK - a streaming version of KRR. We present a showcase study on two synthetic problems and the prediction of the trajectory of a double pendulum.
arXiv Detail & Related papers (2021-08-23T21:03:09Z)
Evaluating representations by the complexity of learning low-loss predictors [55.94170724668857]
We consider the problem of evaluating representations of data for use in solving a downstream task. We propose to measure the quality of a representation by the complexity of learning a predictor on top of the representation that achieves low loss on a task of interest.
arXiv Detail & Related papers (2020-09-15T22:06:58Z)
A Visual Analytics Framework for Reviewing Multivariate Time-Series Data with Dimensionality Reduction [19.460188497780155]
dimensionality reduction (DR) methods are often used to uncover the intrinsic structure and features of the data. We present MulTiDR, a new DR framework that enables processing of time-dependent multivariate data as a whole. By coupling with a contrastive learning method and interactive visualizations, our framework enhances analysts' ability to interpret DR results.
arXiv Detail & Related papers (2020-08-02T04:22:43Z)
Longitudinal Variational Autoencoder [1.4680035572775534]
A common approach to analyse high-dimensional data that contains missing values is to learn a low-dimensional representation using variational autoencoders (VAEs) Standard VAEs assume that the learnt representations are i.i.d., and fail to capture the correlations between the data samples. We propose the Longitudinal VAE (L-VAE), that uses a multi-output additive Gaussian process (GP) prior to extend the VAE's capability to learn structured low-dimensional representations. Our approach can simultaneously accommodate both time-varying shared and random effects, produce structured low-dimensional representations
arXiv Detail & Related papers (2020-06-17T10:30:14Z)
NCVis: Noise Contrastive Approach for Scalable Visualization [79.44177623781043]
NCVis is a high-performance dimensionality reduction method built on a sound statistical basis of noise contrastive estimation. We show that NCVis outperforms state-of-the-art techniques in terms of speed while preserving the representation quality of other methods.
arXiv Detail & Related papers (2020-01-30T15:43:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.