Tuning the perplexity for and computing sampling-based t-SNE embeddings
- URL: http://arxiv.org/abs/2308.15513v1
- Date: Tue, 29 Aug 2023 16:24:11 GMT
- Title: Tuning the perplexity for and computing sampling-based t-SNE embeddings
- Authors: Martin Skrodzki, Nicolas Chaves-de-Plaza, Klaus Hildebrandt, Thomas
H\"ollt, Elmar Eisemann
- Abstract summary: We show that a sampling-based embedding approach can circumvent problems with large data sets.
We show how this approach speeds up the computation and increases the quality of the embeddings.
- Score: 7.85331971049706
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Widely used pipelines for the analysis of high-dimensional data utilize
two-dimensional visualizations. These are created, e.g., via t-distributed
stochastic neighbor embedding (t-SNE). When it comes to large data sets,
applying these visualization techniques creates suboptimal embeddings, as the
hyperparameters are not suitable for large data. Cranking up these parameters
usually does not work as the computations become too expensive for practical
workflows. In this paper, we argue that a sampling-based embedding approach can
circumvent these problems. We show that hyperparameters must be chosen
carefully, depending on the sampling rate and the intended final embedding.
Further, we show how this approach speeds up the computation and increases the
quality of the embeddings.
Related papers
- Constructing Gaussian Processes via Samplets [0.0]
We examine recent convergence results to identify models with optimal convergence rates.
We propose a Samplet-based approach to efficiently construct and train the Gaussian Processes.
arXiv Detail & Related papers (2024-11-11T18:01:03Z) - Entropic Optimal Transport Eigenmaps for Nonlinear Alignment and Joint Embedding of High-Dimensional Datasets [11.105392318582677]
We propose a principled approach for aligning and jointly embedding a pair of datasets with theoretical guarantees.
Our approach leverages the leading singular vectors of the EOT plan matrix between two datasets to extract their shared underlying structure.
We show that in a high-dimensional regime, the EOT plan recovers the shared manifold structure by approximating a kernel function evaluated at the locations of the latent variables.
arXiv Detail & Related papers (2024-07-01T18:48:55Z) - ComboStoc: Combinatorial Stochasticity for Diffusion Generative Models [65.82630283336051]
We show that the space spanned by the combination of dimensions and attributes is insufficiently sampled by existing training scheme of diffusion generative models.
We present a simple fix to this problem by constructing processes that fully exploit the structures, hence the name ComboStoc.
arXiv Detail & Related papers (2024-05-22T15:23:10Z) - Gradient-Based Feature Learning under Structured Data [57.76552698981579]
In the anisotropic setting, the commonly used spherical gradient dynamics may fail to recover the true direction.
We show that appropriate weight normalization that is reminiscent of batch normalization can alleviate this issue.
In particular, under the spiked model with a suitably large spike, the sample complexity of gradient-based training can be made independent of the information exponent.
arXiv Detail & Related papers (2023-09-07T16:55:50Z) - Scalable High-Dimensional Multivariate Linear Regression for
Feature-Distributed Data [0.0]
This paper proposes a two-stage relaxed greedy algorithm (TSRGA) for applying multivariate linear regression to feature-distributed data.
The main advantage of TSRGA is that its communication complexity does not depend on the feature dimension, making it highly scalable to very large data sets.
We apply the proposed TSRGA in a financial application that leverages unstructured data from the 10-K reports.
arXiv Detail & Related papers (2023-07-07T06:24:56Z) - Stochastic Marginal Likelihood Gradients using Neural Tangent Kernels [78.6096486885658]
We introduce lower bounds to the linearized Laplace approximation of the marginal likelihood.
These bounds are amenable togradient-based optimization and allow to trade off estimation accuracy against computational complexity.
arXiv Detail & Related papers (2023-06-06T19:02:57Z) - Beyond Individual Input for Deep Anomaly Detection on Tabular Data [0.0]
Anomaly detection is vital in many domains, such as finance, healthcare, and cybersecurity.
To the best of our knowledge, this is the first work to successfully combine feature-feature and sample-sample dependencies.
Our method achieves state-of-the-art performance, outperforming existing methods by 2.4% and 1.2% in terms of F1-score and AUROC, respectively.
arXiv Detail & Related papers (2023-05-24T13:13:26Z) - Generative modeling of time-dependent densities via optimal transport
and projection pursuit [3.069335774032178]
We propose a cheap alternative to popular deep learning algorithms for temporal modeling.
Our method is highly competitive compared with state-of-the-art solvers.
arXiv Detail & Related papers (2023-04-19T13:50:13Z) - Linking data separation, visual separation, and classifier performance
using pseudo-labeling by contrastive learning [125.99533416395765]
We argue that the performance of the final classifier depends on the data separation present in the latent space and visual separation present in the projection.
We demonstrate our results by the classification of five real-world challenging image datasets of human intestinal parasites with only 1% supervised samples.
arXiv Detail & Related papers (2023-02-06T10:01:38Z) - Transport with Support: Data-Conditional Diffusion Bridges [18.933928516349397]
We introduce the Iterative Smoothing Bridge (ISB) to solve constrained time-series data generation tasks.
We show that the ISB generalises well to high-dimensional data, is computationally efficient, and provides accurate estimates of the marginals at intermediate and terminal times.
arXiv Detail & Related papers (2023-01-31T13:50:16Z) - Bayesian Interpolation with Deep Linear Networks [92.1721532941863]
Characterizing how neural network depth, width, and dataset size jointly impact model quality is a central problem in deep learning theory.
We show that linear networks make provably optimal predictions at infinite depth.
We also show that with data-agnostic priors, Bayesian model evidence in wide linear networks is maximized at infinite depth.
arXiv Detail & Related papers (2022-12-29T20:57:46Z) - FaDIn: Fast Discretized Inference for Hawkes Processes with General
Parametric Kernels [82.53569355337586]
This work offers an efficient solution to temporal point processes inference using general parametric kernels with finite support.
The method's effectiveness is evaluated by modeling the occurrence of stimuli-induced patterns from brain signals recorded with magnetoencephalography (MEG)
Results show that the proposed approach leads to an improved estimation of pattern latency than the state-of-the-art.
arXiv Detail & Related papers (2022-10-10T12:35:02Z) - AVIDA: Alternating method for Visualizing and Integrating Data [1.6637373649145604]
AVIDA is a framework for simultaneously performing data alignment and dimension reduction.
We show that AVIDA correctly aligns high-dimensional datasets without common features.
In general applications, other methods can be used for the alignment and dimension reduction modules.
arXiv Detail & Related papers (2022-05-31T22:36:10Z) - RENs: Relevance Encoding Networks [0.0]
This paper proposes relevance encoding networks (RENs): a novel probabilistic VAE-based framework that uses the automatic relevance determination (ARD) prior in the latent space to learn the data-specific bottleneck dimensionality.
We show that the proposed model learns the relevant latent bottleneck dimensionality without compromising the representation and generation quality of the samples.
arXiv Detail & Related papers (2022-05-25T21:53:48Z) - UnProjection: Leveraging Inverse-Projections for Visual Analytics of
High-Dimensional Data [63.74032987144699]
We present NNInv, a deep learning technique with the ability to approximate the inverse of any projection or mapping.
NNInv learns to reconstruct high-dimensional data from any arbitrary point on a 2D projection space, giving users the ability to interact with the learned high-dimensional representation in a visual analytics system.
arXiv Detail & Related papers (2021-11-02T17:11:57Z) - High Dimensional Level Set Estimation with Bayesian Neural Network [58.684954492439424]
This paper proposes novel methods to solve the high dimensional Level Set Estimation problems using Bayesian Neural Networks.
For each problem, we derive the corresponding theoretic information based acquisition function to sample the data points.
Numerical experiments on both synthetic and real-world datasets show that our proposed method can achieve better results compared to existing state-of-the-art approaches.
arXiv Detail & Related papers (2020-12-17T23:21:53Z) - Stochastic Optimization with Laggard Data Pipelines [65.20044914532221]
We show that "dataechoed" extensions of common optimization methods exhibit provable improvements over their synchronous counterparts.
Specifically, we show that in convex optimization with minibatches, data echoing affords speedups on the curvature-dominated part of the convergence rate, while maintaining the optimal statistical rate.
arXiv Detail & Related papers (2020-10-26T14:55:31Z) - Hyperparameter Selection for Subsampling Bootstraps [0.0]
A subsampling method like BLB serves as a powerful tool for assessing the quality of estimators for massive data.
The performance of the subsampling methods are highly influenced by the selection of tuning parameters.
We develop a hyperparameter selection methodology, which can be used to select tuning parameters for subsampling methods.
Both simulation studies and real data analysis demonstrate the superior advantage of our method.
arXiv Detail & Related papers (2020-06-02T17:10:45Z) - Optimizing Vessel Trajectory Compression [71.42030830910227]
In previous work we introduced a trajectory detection module that can provide summarized representations of vessel trajectories by consuming AIS positional messages online.
This methodology can provide reliable trajectory synopses with little deviations from the original course by discarding at least 70% of the raw data as redundant.
However, such trajectory compression is very sensitive to parametrization.
We take into account the type of each vessel in order to provide a suitable configuration that can yield improved trajectory synopses.
arXiv Detail & Related papers (2020-05-11T20:38:56Z) - Revealing the Structure of Deep Neural Networks via Convex Duality [70.15611146583068]
We study regularized deep neural networks (DNNs) and introduce a convex analytic framework to characterize the structure of hidden layers.
We show that a set of optimal hidden layer weights for a norm regularized training problem can be explicitly found as the extreme points of a convex set.
We apply the same characterization to deep ReLU networks with whitened data and prove the same weight alignment holds.
arXiv Detail & Related papers (2020-02-22T21:13:44Z) - Support recovery and sup-norm convergence rates for sparse pivotal
estimation [79.13844065776928]
In high dimensional sparse regression, pivotal estimators are estimators for which the optimal regularization parameter is independent of the noise level.
We show minimax sup-norm convergence rates for non smoothed and smoothed, single task and multitask square-root Lasso-type estimators.
arXiv Detail & Related papers (2020-01-15T16:11:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.