Related papers: Euclidean Distance Deflation Under High-Dimensional Heteroskedastic Noise

Euclidean Distance Deflation Under High-Dimensional Heteroskedastic Noise

URL: http://arxiv.org/abs/2507.18520v1
Date: Thu, 24 Jul 2025 15:45:23 GMT
Title: Euclidean Distance Deflation Under High-Dimensional Heteroskedastic Noise
Authors: Keyi Li, Yuval Kluger, Boris Landa,
Abstract summary: We develop a principled hyper-free approach that jointly estimates the noise magnitudes and corrects the distances.<n> Notably, when applied to single-cell RNA sequencing data, our method yields noise estimates consistent with an established model.
Score: 9.887133861477233
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Pairwise Euclidean distance calculation is a fundamental step in many machine learning and data analysis algorithms. In real-world applications, however, these distances are frequently distorted by heteroskedastic noise$\unicode{x2014}$a prevalent form of inhomogeneous corruption characterized by variable noise magnitudes across data observations. Such noise inflates the computed distances in a nontrivial way, leading to misrepresentations of the underlying data geometry. In this work, we address the tasks of estimating the noise magnitudes per observation and correcting the pairwise Euclidean distances under heteroskedastic noise. Perhaps surprisingly, we show that in general high-dimensional settings and without assuming prior knowledge on the clean data structure or noise distribution, both tasks can be performed reliably, even when the noise levels vary considerably. Specifically, we develop a principled, hyperparameter-free approach that jointly estimates the noise magnitudes and corrects the distances. We provide theoretical guarantees for our approach, establishing probabilistic bounds on the estimation errors of both noise magnitudes and distances. These bounds, measured in the normalized $\ell_1$ norm, converge to zero at polynomial rates as both feature dimension and dataset size increase. Experiments on synthetic datasets demonstrate that our method accurately estimates distances in challenging regimes, significantly improving the robustness of subsequent distance-based computations. Notably, when applied to single-cell RNA sequencing data, our method yields noise magnitude estimates consistent with an established prototypical model, enabling accurate nearest neighbor identification that is fundamental to many downstream analyses.

Related papers

Robust Representation Consistency Model via Contrastive Denoising [83.47584074390842]
randomized smoothing provides theoretical guarantees for certifying robustness against adversarial perturbations.<n> diffusion models have been successfully employed for randomized smoothing to purify noise-perturbed samples.<n>We reformulate the generative modeling task along the diffusion trajectories in pixel space as a discriminative task in the latent space.
arXiv Detail & Related papers (2025-01-22T18:52:06Z)
Quasi-Bayesian sequential deconvolution [7.10052009802944]
We develop a principled sequential approach to estimate $f$ in a streaming or online domain.<n>Local and uniform Gaussian central limit theorems for $f_n$ are established, leading to credible intervals and bands for $f$.<n>An empirical validation of our methods is presented on synthetic and real data.
arXiv Detail & Related papers (2024-08-26T16:40:04Z)
A Bayesian Approach Toward Robust Multidimensional Ellipsoid-Specific Fitting [0.0]
This work presents a novel and effective method for fitting multidimensional ellipsoids to scattered data in the contamination of noise and outliers. We incorporate a uniform prior distribution to constrain the search for primitive parameters within an ellipsoidal domain. We apply it to a wide range of practical applications such as microscopy cell counting, 3D reconstruction, geometric shape approximation, and magnetometer calibration tasks.
arXiv Detail & Related papers (2024-07-27T14:31:51Z)
Effective Causal Discovery under Identifiable Heteroscedastic Noise Model [45.98718860540588]
Causal DAG learning has recently achieved promising performance in terms of both accuracy and efficiency. We propose a novel formulation for DAG learning that accounts for the variation in noise variance across variables and observations. We then propose an effective two-phase iterative DAG learning algorithm to address the increasing optimization difficulties.
arXiv Detail & Related papers (2023-12-20T08:51:58Z)
Optimizing the Noise in Self-Supervised Learning: from Importance Sampling to Noise-Contrastive Estimation [80.07065346699005]
It is widely assumed that the optimal noise distribution should be made equal to the data distribution, as in Generative Adversarial Networks (GANs) We turn to Noise-Contrastive Estimation which grounds this self-supervised task as an estimation problem of an energy-based model of the data. We soberly conclude that the optimal noise may be hard to sample from, and the gain in efficiency can be modest compared to choosing the noise distribution equal to the data's.
arXiv Detail & Related papers (2023-01-23T19:57:58Z)
Robust Inference of Manifold Density and Geometry by Doubly Stochastic Scaling [8.271859911016719]
We develop tools for robust inference under high-dimensional noise. We show that our approach is robust to variability in technical noise levels across cell types.
arXiv Detail & Related papers (2022-09-16T15:39:11Z)
The Optimal Noise in Noise-Contrastive Learning Is Not What You Think [80.07065346699005]
We show that deviating from this assumption can actually lead to better statistical estimators. In particular, the optimal noise distribution is different from the data's and even from a different family.
arXiv Detail & Related papers (2022-03-02T13:59:20Z)
Partial Identification with Noisy Covariates: A Robust Optimization Approach [94.10051154390237]
Causal inference from observational datasets often relies on measuring and adjusting for covariates. We show that this robust optimization approach can extend a wide range of causal adjustment methods to perform partial identification. Across synthetic and real datasets, we find that this approach provides ATE bounds with a higher coverage probability than existing methods.
arXiv Detail & Related papers (2022-02-22T04:24:26Z)
Fully Adaptive Bayesian Algorithm for Data Analysis, FABADA [0.0]
This paper describes a novel non-parametric noise reduction technique from the point of view of Bayesian inference. It iteratively evaluates possible smoothed versions of the data, the smooth models, obtaining an estimation of the underlying signal. Iterations stop based on the evidence and the $chi2$ statistic of the last smooth model, and we compute the expected value of the signal.
arXiv Detail & Related papers (2022-01-13T18:54:31Z)
Manifold learning with approximate nearest neighbors [1.8477401359673706]
We use a broad range of approximate nearest neighbor algorithms within manifold learning algorithms and evaluate their impact on embedding accuracy. Via a thorough empirical investigation based on the benchmark MNIST dataset, it is shown that approximate nearest neighbors lead to substantial improvements in computational time. This application demonstrates how the proposed methods can be used to visualize and identify anomalies and uncover underlying structure within high-dimensional data.
arXiv Detail & Related papers (2021-02-22T12:04:23Z)
Optimal oracle inequalities for solving projected fixed-point equations [53.31620399640334]
We study methods that use a collection of random observations to compute approximate solutions by searching over a known low-dimensional subspace of the Hilbert space. We show how our results precisely characterize the error of a class of temporal difference learning methods for the policy evaluation problem with linear function approximation.
arXiv Detail & Related papers (2020-12-09T20:19:32Z)
$\gamma$-ABC: Outlier-Robust Approximate Bayesian Computation Based on a Robust Divergence Estimator [95.71091446753414]
We propose to use a nearest-neighbor-based $gamma$-divergence estimator as a data discrepancy measure. Our method achieves significantly higher robustness than existing discrepancy measures.
arXiv Detail & Related papers (2020-06-13T06:09:27Z)
Manifold Fitting under Unbounded Noise [4.54773250519101]
We introduce a new manifold-fitting method, by which the output manifold is constructed by directly estimating the tangent spaces at the projected points on the underlying manifold. Our new method provides theoretical convergence in high probability, in terms of the upper bound of the distance between the estimated and underlying manifold.
arXiv Detail & Related papers (2019-09-23T08:55:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.