Tuning the perplexity for and computing sampling-based t-SNE embeddings
- URL: http://arxiv.org/abs/2308.15513v1
- Date: Tue, 29 Aug 2023 16:24:11 GMT
- Title: Tuning the perplexity for and computing sampling-based t-SNE embeddings
- Authors: Martin Skrodzki, Nicolas Chaves-de-Plaza, Klaus Hildebrandt, Thomas
H\"ollt, Elmar Eisemann
- Abstract summary: We show that a sampling-based embedding approach can circumvent problems with large data sets.
We show how this approach speeds up the computation and increases the quality of the embeddings.
- Score: 7.85331971049706
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Widely used pipelines for the analysis of high-dimensional data utilize
two-dimensional visualizations. These are created, e.g., via t-distributed
stochastic neighbor embedding (t-SNE). When it comes to large data sets,
applying these visualization techniques creates suboptimal embeddings, as the
hyperparameters are not suitable for large data. Cranking up these parameters
usually does not work as the computations become too expensive for practical
workflows. In this paper, we argue that a sampling-based embedding approach can
circumvent these problems. We show that hyperparameters must be chosen
carefully, depending on the sampling rate and the intended final embedding.
Further, we show how this approach speeds up the computation and increases the
quality of the embeddings.
Related papers
- Entropic Optimal Transport Eigenmaps for Nonlinear Alignment and Joint Embedding of High-Dimensional Datasets [11.105392318582677]
We propose a principled approach for aligning and jointly embedding a pair of datasets with theoretical guarantees.
Our approach leverages the leading singular vectors of the EOT plan matrix between two datasets to extract their shared underlying structure.
We show that in a high-dimensional regime, the EOT plan recovers the shared manifold structure by approximating a kernel function evaluated at the locations of the latent variables.
arXiv Detail & Related papers (2024-07-01T18:48:55Z) - ComboStoc: Combinatorial Stochasticity for Diffusion Generative Models [65.82630283336051]
We show that the space spanned by the combination of dimensions and attributes is insufficiently sampled by existing training scheme of diffusion generative models.
We present a simple fix to this problem by constructing processes that fully exploit the structures, hence the name ComboStoc.
arXiv Detail & Related papers (2024-05-22T15:23:10Z) - Gradient-Based Feature Learning under Structured Data [57.76552698981579]
In the anisotropic setting, the commonly used spherical gradient dynamics may fail to recover the true direction.
We show that appropriate weight normalization that is reminiscent of batch normalization can alleviate this issue.
In particular, under the spiked model with a suitably large spike, the sample complexity of gradient-based training can be made independent of the information exponent.
arXiv Detail & Related papers (2023-09-07T16:55:50Z) - Scalable High-Dimensional Multivariate Linear Regression for
Feature-Distributed Data [0.0]
This paper proposes a two-stage relaxed greedy algorithm (TSRGA) for applying multivariate linear regression to feature-distributed data.
The main advantage of TSRGA is that its communication complexity does not depend on the feature dimension, making it highly scalable to very large data sets.
We apply the proposed TSRGA in a financial application that leverages unstructured data from the 10-K reports.
arXiv Detail & Related papers (2023-07-07T06:24:56Z) - Beyond Individual Input for Deep Anomaly Detection on Tabular Data [0.0]
Anomaly detection is vital in many domains, such as finance, healthcare, and cybersecurity.
To the best of our knowledge, this is the first work to successfully combine feature-feature and sample-sample dependencies.
Our method achieves state-of-the-art performance, outperforming existing methods by 2.4% and 1.2% in terms of F1-score and AUROC, respectively.
arXiv Detail & Related papers (2023-05-24T13:13:26Z) - Linking data separation, visual separation, and classifier performance
using pseudo-labeling by contrastive learning [125.99533416395765]
We argue that the performance of the final classifier depends on the data separation present in the latent space and visual separation present in the projection.
We demonstrate our results by the classification of five real-world challenging image datasets of human intestinal parasites with only 1% supervised samples.
arXiv Detail & Related papers (2023-02-06T10:01:38Z) - Bayesian Interpolation with Deep Linear Networks [92.1721532941863]
Characterizing how neural network depth, width, and dataset size jointly impact model quality is a central problem in deep learning theory.
We show that linear networks make provably optimal predictions at infinite depth.
We also show that with data-agnostic priors, Bayesian model evidence in wide linear networks is maximized at infinite depth.
arXiv Detail & Related papers (2022-12-29T20:57:46Z) - AVIDA: Alternating method for Visualizing and Integrating Data [1.6637373649145604]
AVIDA is a framework for simultaneously performing data alignment and dimension reduction.
We show that AVIDA correctly aligns high-dimensional datasets without common features.
In general applications, other methods can be used for the alignment and dimension reduction modules.
arXiv Detail & Related papers (2022-05-31T22:36:10Z) - RENs: Relevance Encoding Networks [0.0]
This paper proposes relevance encoding networks (RENs): a novel probabilistic VAE-based framework that uses the automatic relevance determination (ARD) prior in the latent space to learn the data-specific bottleneck dimensionality.
We show that the proposed model learns the relevant latent bottleneck dimensionality without compromising the representation and generation quality of the samples.
arXiv Detail & Related papers (2022-05-25T21:53:48Z) - UnProjection: Leveraging Inverse-Projections for Visual Analytics of
High-Dimensional Data [63.74032987144699]
We present NNInv, a deep learning technique with the ability to approximate the inverse of any projection or mapping.
NNInv learns to reconstruct high-dimensional data from any arbitrary point on a 2D projection space, giving users the ability to interact with the learned high-dimensional representation in a visual analytics system.
arXiv Detail & Related papers (2021-11-02T17:11:57Z) - Revealing the Structure of Deep Neural Networks via Convex Duality [70.15611146583068]
We study regularized deep neural networks (DNNs) and introduce a convex analytic framework to characterize the structure of hidden layers.
We show that a set of optimal hidden layer weights for a norm regularized training problem can be explicitly found as the extreme points of a convex set.
We apply the same characterization to deep ReLU networks with whitened data and prove the same weight alignment holds.
arXiv Detail & Related papers (2020-02-22T21:13:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.