Robust Estimation Under Heterogeneous Corruption Rates
- URL: http://arxiv.org/abs/2508.15051v2
- Date: Tue, 30 Sep 2025 18:17:37 GMT
- Title: Robust Estimation Under Heterogeneous Corruption Rates
- Authors: Syomantak Chaudhuri, Jerry Li, Thomas A. Courtade,
- Abstract summary: We study the problem of robust estimation under heterogeneous corruption rates.<n>Existing robust estimators typically assume uniform or worst-case corruption.<n>We give tight minimax rates for all heterogeneous corruption patterns.
- Score: 8.778670369403029
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study the problem of robust estimation under heterogeneous corruption rates, where each sample may be independently corrupted with a known but non-identical probability. This setting arises naturally in distributed and federated learning, crowdsourcing, and sensor networks, yet existing robust estimators typically assume uniform or worst-case corruption, ignoring structural heterogeneity. For mean estimation for multivariate bounded distributions and univariate gaussian distributions, we give tight minimax rates for all heterogeneous corruption patterns. For multivariate gaussian mean estimation and linear regression, we establish the minimax rate for squared error up to a factor of $\sqrt{d}$, where $d$ is the dimension. Roughly, our findings suggest that samples beyond a certain corruption threshold may be discarded by the optimal estimators -- this threshold is determined by the empirical distribution of the corruption rates given.
Related papers
- High-Dimensional Robust Mean Estimation with Untrusted Batches [38.14592862692954]
We study high-dimensional mean estimation in a collaborative setting where data is contributed by $N$ users in batches of size $n$.<n>We formalize this challenge through a double corruption landscape: an $varepsilon$-fraction of users are entirely adversarial, while the remaining good'' users provide data from distributions that are related to $P$, but deviate by a proximity parameter $$.<n>Our algorithms achieve the minimax-optimal error rate $O(sqrtvarepsilon/n + sqrtd/nN + s
arXiv Detail & Related papers (2026-02-24T08:59:37Z) - Bulk-Calibrated Credal Ambiguity Sets: Fast, Tractable Decision Making under Out-of-Sample Contamination [8.826173150779145]
Distributionally robust optimisation (DRO) minimises the worst-case expected loss over an ambiguity set.<n>We show how IP credal sets translate into DRO objectives with interpretable tolerance levels.
arXiv Detail & Related papers (2026-01-29T06:37:36Z) - Linear Regression under Missing or Corrupted Coordinates [58.9213131489513]
We study how data may be corrupted or erased by an adversary under a coordinate-wise budget.<n>In the incomplete data setting, an adversary may inspect the dataset and delete entries in up to an $eta$-fraction of samples per coordinate.<n>In the corrupted data setting, the adversary instead replaces values arbitrarily, and the corruption locations are unknown to the learner.
arXiv Detail & Related papers (2025-09-23T17:01:43Z) - Asymptotically Optimal Linear Best Feasible Arm Identification with Fixed Budget [55.938644481736446]
We introduce a novel algorithm for best feasible arm identification that guarantees an exponential decay in the error probability.<n>We validate our algorithm through comprehensive empirical evaluations across various problem instances with different levels of complexity.
arXiv Detail & Related papers (2025-06-03T02:56:26Z) - Optimal Robust Estimation under Local and Global Corruptions: Stronger Adversary and Smaller Error [10.266928164137635]
Algorithmic robust statistics has traditionally focused on the contamination model where a small fraction of the samples are arbitrarily corrupted.
We consider a recent contamination model that combines two kinds of corruptions: (i) small fraction of arbitrary outliers, as in classical robust statistics, and (ii) local perturbations, where samples may undergo bounded shifts on average.
We show that information theoretically optimal error can indeed be achieved in time, under an even emphstronger local perturbation model.
arXiv Detail & Related papers (2024-10-22T17:51:23Z) - Rejection via Learning Density Ratios [50.91522897152437]
Classification with rejection emerges as a learning paradigm which allows models to abstain from making predictions.<n>We propose a different distributional perspective, where we seek to find an idealized data distribution which maximizes a pretrained model's performance.<n>Our framework is tested empirically over clean and noisy datasets.
arXiv Detail & Related papers (2024-05-29T01:32:17Z) - CRIMED: Lower and Upper Bounds on Regret for Bandits with Unbounded
Stochastic Corruption [21.709840199725015]
We investigate the regret-minimisation problem in a multi-armed bandit setting with arbitrary corruptions.
We introduce CRIMED, an algorithm that achieves the exact lower bound on regret for bandits with known variance.
Notably, CRIMED can effectively handle corruptions with $varepsilon$ values as high as $frac12$.
arXiv Detail & Related papers (2023-09-28T16:19:53Z) - Robust Online Covariance and Sparse Precision Estimation Under Arbitrary
Data Corruption [1.5850859526672516]
We introduce a modified trimmed-inner-product algorithm to robustly estimate the covariance in an online scenario.
We provide the error-bound and convergence properties of the estimates to the true precision matrix under our algorithms.
arXiv Detail & Related papers (2023-09-16T05:37:28Z) - Correcting Underrepresentation and Intersectional Bias for Classification [49.1574468325115]
We consider the problem of learning from data corrupted by underrepresentation bias.
We show that with a small amount of unbiased data, we can efficiently estimate the group-wise drop-out rates.
We show that our algorithm permits efficient learning for model classes of finite VC dimension.
arXiv Detail & Related papers (2023-06-19T18:25:44Z) - General Gaussian Noise Mechanisms and Their Optimality for Unbiased Mean
Estimation [58.03500081540042]
A classical approach to private mean estimation is to compute the true mean and add unbiased, but possibly correlated, Gaussian noise to it.
We show that for every input dataset, an unbiased mean estimator satisfying concentrated differential privacy introduces approximately at least as much error.
arXiv Detail & Related papers (2023-01-31T18:47:42Z) - Robust Estimation for Nonparametric Families via Generative Adversarial
Networks [92.64483100338724]
We provide a framework for designing Generative Adversarial Networks (GANs) to solve high dimensional robust statistics problems.
Our work extend these to robust mean estimation, second moment estimation, and robust linear regression.
In terms of techniques, our proposed GAN losses can be viewed as a smoothed and generalized Kolmogorov-Smirnov distance.
arXiv Detail & Related papers (2022-02-02T20:11:33Z) - A Unified Framework for Multi-distribution Density Ratio Estimation [101.67420298343512]
Binary density ratio estimation (DRE) provides the foundation for many state-of-the-art machine learning algorithms.
We develop a general framework from the perspective of Bregman minimization divergence.
We show that our framework leads to methods that strictly generalize their counterparts in binary DRE.
arXiv Detail & Related papers (2021-12-07T01:23:20Z) - Robust Mean Estimation on Highly Incomplete Data with Arbitrary Outliers [7.224832132296238]
We study the problem of robustly estimating the mean of a $d$-dimensional distribution given $N$ examples.
We show algorithms that estimate the mean of the distribution with information-theoretically optimal dimension-independent error guarantees in nearly-linear time.
arXiv Detail & Related papers (2020-08-18T17:53:34Z) - Log-Likelihood Ratio Minimizing Flows: Towards Robust and Quantifiable
Neural Distribution Alignment [52.02794488304448]
We propose a new distribution alignment method based on a log-likelihood ratio statistic and normalizing flows.
We experimentally verify that minimizing the resulting objective results in domain alignment that preserves the local structure of input domains.
arXiv Detail & Related papers (2020-03-26T22:10:04Z) - Error bounds in estimating the out-of-sample prediction error using
leave-one-out cross validation in high-dimensions [19.439945058410203]
We study the problem of out-of-sample risk estimation in the high dimensional regime.
Extensive empirical evidence confirms the accuracy of leave-one-out cross validation.
One technical advantage of the theory is that it can be used to clarify and connect some results from the recent literature on scalable approximate LO.
arXiv Detail & Related papers (2020-03-03T20:07:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.