Coresets for Clustering Under Stochastic Noise
- URL: http://arxiv.org/abs/2510.23438v1
- Date: Mon, 27 Oct 2025 15:41:27 GMT
- Title: Coresets for Clustering Under Stochastic Noise
- Authors: Lingxiao Huang, Zhize Li, Nisheeth K. Vishnoi, Runkai Yang, Haoyu Zhao,
- Abstract summary: We study the problem of constructing coresets for $(k, z)$-clustering when the input dataset is corrupted by noise.<n>We use surrogate error metrics that are tractable and provably related to the true clustering cost.<n>We show that the coreset size can improve by a factor of up to $mathrmpoly(k)$, where $n$ is the dataset size.
- Score: 33.12252505929148
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study the problem of constructing coresets for $(k, z)$-clustering when the input dataset is corrupted by stochastic noise drawn from a known distribution. In this setting, evaluating the quality of a coreset is inherently challenging, as the true underlying dataset is unobserved. To address this, we investigate coreset construction using surrogate error metrics that are tractable and provably related to the true clustering cost. We analyze a traditional metric from prior work and introduce a new error metric that more closely aligns with the true cost. Although our metric is defined independently of the noise distribution, it enables approximation guarantees that scale with the noise level. We design a coreset construction algorithm based on this metric and show that, under mild assumptions on the data and noise, enforcing an $\varepsilon$-bound under our metric yields smaller coresets and tighter guarantees on the true clustering cost than those obtained via classical metrics. In particular, we prove that the coreset size can improve by a factor of up to $\mathrm{poly}(k)$, where $n$ is the dataset size. Experiments on real-world datasets support our theoretical findings and demonstrate the practical advantages of our approach.
Related papers
- Is Multi-Distribution Learning as Easy as PAC Learning: Sharp Rates with Bounded Label Noise [26.182166506085114]
We show that learning across $k$ distributions incurs slow rates scaling with $k/2$, even under constant noise levels, unless each distribution is learned separately.<n>A key technical contribution is a structured hypothesis-testing framework that captures the statistical cost of certifying near-optimality.
arXiv Detail & Related papers (2026-02-24T16:00:15Z) - Riemannian Zeroth-Order Gradient Estimation with Structure-Preserving Metrics for Geodesically Incomplete Manifolds [57.179679246370114]
We construct metrics that are geodesically complete while ensuring that every stationary point under the new metric remains stationary under the original one.<n>An $$-stationary point under the constructed metric $g'$ also corresponds to an $$-stationary point under the original metric $g'$.<n>Experiments on a practical mesh optimization task demonstrate that our framework maintains stable convergence even in the absence of geodesic completeness.
arXiv Detail & Related papers (2026-01-12T22:08:03Z) - DAG DECORation: Continuous Optimization for Structure Learning under Hidden Confounding [0.0]
We study structure learning for linear Gaussian SEMs in the presence of latent confounding.<n>We propose textscDECOR, a single likelihood-based estimator that jointly learns a DAG and a correlated noise model.
arXiv Detail & Related papers (2025-10-02T15:23:30Z) - Geometric Median Matching for Robust k-Subset Selection from Noisy Data [75.86423267723728]
We propose a novel k-subset selection strategy that leverages Geometric Median -- a robust estimator with an optimal breakdown point of 1/2.<n>Our method iteratively selects a k-subset such that the mean of the subset approximates the GM of the (potentially) noisy dataset, ensuring robustness even under arbitrary corruption.
arXiv Detail & Related papers (2025-04-01T09:22:05Z) - Adaptive $k$-nearest neighbor classifier based on the local estimation of the shape operator [49.87315310656657]
We introduce a new adaptive $k$-nearest neighbours ($kK$-NN) algorithm that explores the local curvature at a sample to adaptively defining the neighborhood size.
Results on many real-world datasets indicate that the new $kK$-NN algorithm yields superior balanced accuracy compared to the established $k$-NN method.
arXiv Detail & Related papers (2024-09-08T13:08:45Z) - TANGO: Clustering with Typicality-Aware Nonlocal Mode-Seeking and Graph-Cut Optimization [2.4783546111391215]
Density-based mode-seeking methods generate a emphdensity-ascending dependency from low-density points towards higher-density neighbors.<n>We introduce a novel concept called emphtypicality, by exploring the emphlocally defined dependency from a emphglobal perspective.<n>We devise an algorithm that effectively and efficiently identifies modes with the help of the global-view typicality.
arXiv Detail & Related papers (2024-08-19T15:26:25Z) - Rethinking k-means from manifold learning perspective [122.38667613245151]
We present a new clustering algorithm which directly detects clusters of data without mean estimation.
Specifically, we construct distance matrix between data points by Butterworth filter.
To well exploit the complementary information embedded in different views, we leverage the tensor Schatten p-norm regularization.
arXiv Detail & Related papers (2023-05-12T03:01:41Z) - Revisiting Rotation Averaging: Uncertainties and Robust Losses [51.64986160468128]
We argue that the main problem of current methods is the minimized cost function that is only weakly connected with the input data via the estimated epipolar.
We propose to better model the underlying noise distributions by directly propagating the uncertainty from the point correspondences into the rotation averaging.
arXiv Detail & Related papers (2023-03-09T11:51:20Z) - Coresets for Time Series Clustering [33.801060211529354]
We study the problem of constructing coresets for clustering problems with time series data.
Our main contribution is an algorithm to construct coresets for a mixture model.
We empirically assess the performance of our coreset with synthetic data.
arXiv Detail & Related papers (2021-10-28T16:21:13Z) - Noise-robust Clustering [2.0199917525888895]
This paper presents noise-robust clustering techniques in unsupervised machine learning.
The uncertainty about the noise, consistency, and other ambiguities can become severe obstacles in data analytics.
arXiv Detail & Related papers (2021-10-17T17:15:13Z) - Spatially relaxed inference on high-dimensional linear models [48.989769153211995]
We study the properties of ensembled clustered inference algorithms which combine spatially constrained clustering, statistical inference, and ensembling to aggregate several clustered inference solutions.
We show that ensembled clustered inference algorithms control the $delta$-FWER under standard assumptions for $delta$ equal to the largest cluster diameter.
arXiv Detail & Related papers (2021-06-04T16:37:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.