Local Averaging Accurately Distills Manifold Structure From Noisy Data
- URL: http://arxiv.org/abs/2506.18761v1
- Date: Mon, 23 Jun 2025 15:32:16 GMT
- Title: Local Averaging Accurately Distills Manifold Structure From Noisy Data
- Authors: Yihan Shen, Shiyu Wang, Arnaud Lamy, Mariam Avagyan, John Wright,
- Abstract summary: Local averaging is a cornerstone of state-of-the-art provable methods for manifold fitting and denoising.<n>We provide theoretical analyses of a two-round mini-batch local averaging method applied to noisy samples drawn from a $d$-dimensional manifold.<n>The proposed method can serve as a preprocessing step for a wide range of provable methods designed for lower-noise regimes.
- Score: 4.63748375343038
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: High-dimensional data are ubiquitous, with examples ranging from natural images to scientific datasets, and often reside near low-dimensional manifolds. Leveraging this geometric structure is vital for downstream tasks, including signal denoising, reconstruction, and generation. However, in practice, the manifold is typically unknown and only noisy samples are available. A fundamental approach to uncovering the manifold structure is local averaging, which is a cornerstone of state-of-the-art provable methods for manifold fitting and denoising. However, to the best of our knowledge, there are no works that rigorously analyze the accuracy of local averaging in a manifold setting in high-noise regimes. In this work, we provide theoretical analyses of a two-round mini-batch local averaging method applied to noisy samples drawn from a $d$-dimensional manifold $\mathcal M \subset \mathbb{R}^D$, under a relatively high-noise regime where the noise size is comparable to the reach $\tau$. We show that with high probability, the averaged point $\hat{\mathbf q}$ achieves the bound $d(\hat{\mathbf q}, \mathcal M) \leq \sigma \sqrt{d\left(1+\frac{\kappa\mathrm{diam}(\mathcal {M})}{\log(D)}\right)}$, where $\sigma, \mathrm{diam(\mathcal M)},\kappa$ denote the standard deviation of the Gaussian noise, manifold's diameter and a bound on its extrinsic curvature, respectively. This is the first analysis of local averaging accuracy over the manifold in the relatively high noise regime where $\sigma \sqrt{D} \approx \tau$. The proposed method can serve as a preprocessing step for a wide range of provable methods designed for lower-noise regimes. Additionally, our framework can provide a theoretical foundation for a broad spectrum of denoising and dimensionality reduction methods that rely on local averaging techniques.
Related papers
- Nonlinear Stochastic Gradient Descent and Heavy-tailed Noise: A Unified Framework and High-probability Guarantees [56.80920351680438]
We study high-probability convergence in online learning, in the presence of heavy-tailed noise.<n>We provide guarantees for a broad class of nonlinearities, without any assumptions on noise moments.
arXiv Detail & Related papers (2024-10-17T18:25:28Z) - Adaptive $k$-nearest neighbor classifier based on the local estimation of the shape operator [49.87315310656657]
We introduce a new adaptive $k$-nearest neighbours ($kK$-NN) algorithm that explores the local curvature at a sample to adaptively defining the neighborhood size.
Results on many real-world datasets indicate that the new $kK$-NN algorithm yields superior balanced accuracy compared to the established $k$-NN method.
arXiv Detail & Related papers (2024-09-08T13:08:45Z) - SGD with Clipping is Secretly Estimating the Median Gradient [19.69067856415625]
We study distributed learning with corrupted nodes, the presence of large outliers in the training data, learning under privacy constraints, or even heavy-tailed noise due to the dynamics of the algorithm itself.
We first consider computing the median gradient across samples, and show that the resulting method can converge even under heavy-tailed state-dependent noise.
We propose an algorithm estimating the median gradient across iterations, and find that several well known methods - in particular different forms of clipping - are particular cases of this framework.
arXiv Detail & Related papers (2024-02-20T08:54:07Z) - Classification Diffusion Models: Revitalizing Density Ratio Estimation [21.264497139730473]
$textitclassification diffusion models$ (CDMs) is a DRE-based generative method that adopts the formalism of denoising diffusion models.
Our method is the first DRE-based technique that can successfully generate images beyond the MNIST dataset.
arXiv Detail & Related papers (2024-02-15T16:49:42Z) - Optimal twirling depths for shadow tomography in the presence of noise [0.1227734309612871]
We consider the sample complexity as a function of the depth of the circuit, in the presence of noise.
We find that this noise has important implications for determining the optimal twirling ensemble.
These thresholds strongly constrain the search for optimal strategies to implement shadow tomography.
arXiv Detail & Related papers (2023-11-16T19:00:01Z) - Robust Inference of Manifold Density and Geometry by Doubly Stochastic
Scaling [8.271859911016719]
We develop tools for robust inference under high-dimensional noise.
We show that our approach is robust to variability in technical noise levels across cell types.
arXiv Detail & Related papers (2022-09-16T15:39:11Z) - Manifold Free Riemannian Optimization [4.484251538832438]
A principled framework for solving optimization problems with a smooth manifold $mathcalM$ is proposed.
We use a noiseless sample set of the cost function $(x_i, y_i)in mathcalM times mathbbR$ and the intrinsic dimension of the manifold $mathcalM$.
arXiv Detail & Related papers (2022-09-07T16:19:06Z) - Mean-Square Analysis with An Application to Optimal Dimension Dependence
of Langevin Monte Carlo [60.785586069299356]
This work provides a general framework for the non-asymotic analysis of sampling error in 2-Wasserstein distance.
Our theoretical analysis is further validated by numerical experiments.
arXiv Detail & Related papers (2021-09-08T18:00:05Z) - Towards Sample-Optimal Compressive Phase Retrieval with Sparse and
Generative Priors [59.33977545294148]
We show that $O(k log L)$ samples suffice to guarantee that the signal is close to any vector that minimizes an amplitude-based empirical loss function.
We adapt this result to sparse phase retrieval, and show that $O(s log n)$ samples are sufficient for a similar guarantee when the underlying signal is $s$-sparse and $n$-dimensional.
arXiv Detail & Related papers (2021-06-29T12:49:54Z) - Lattice partition recovery with dyadic CART [79.96359947166592]
We study piece-wise constant signals corrupted by additive Gaussian noise over a $d$-dimensional lattice.
Data of this form naturally arise in a host of applications, and the tasks of signal detection or testing, de-noising and estimation have been studied extensively in the statistical and signal processing literature.
In this paper we consider instead the problem of partition recovery, i.e.of estimating the partition of the lattice induced by the constancy regions of the unknown signal.
We prove that a DCART-based procedure consistently estimates the underlying partition at a rate of order $sigma2 k*
arXiv Detail & Related papers (2021-05-27T23:41:01Z) - Learning Halfspaces with Tsybakov Noise [50.659479930171585]
We study the learnability of halfspaces in the presence of Tsybakov noise.
We give an algorithm that achieves misclassification error $epsilon$ with respect to the true halfspace.
arXiv Detail & Related papers (2020-06-11T14:25:02Z) - Manifold Fitting under Unbounded Noise [4.54773250519101]
We introduce a new manifold-fitting method, by which the output manifold is constructed by directly estimating the tangent spaces at the projected points on the underlying manifold.
Our new method provides theoretical convergence in high probability, in terms of the upper bound of the distance between the estimated and underlying manifold.
arXiv Detail & Related papers (2019-09-23T08:55:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.