Linear Regression under Missing or Corrupted Coordinates
- URL: http://arxiv.org/abs/2509.19242v1
- Date: Tue, 23 Sep 2025 17:01:43 GMT
- Title: Linear Regression under Missing or Corrupted Coordinates
- Authors: Ilias Diakonikolas, Jelena Diakonikolas, Daniel M. Kane, Jasper C. H. Lee, Thanasis Pittas,
- Abstract summary: We study how data may be corrupted or erased by an adversary under a coordinate-wise budget.<n>In the incomplete data setting, an adversary may inspect the dataset and delete entries in up to an $eta$-fraction of samples per coordinate.<n>In the corrupted data setting, the adversary instead replaces values arbitrarily, and the corruption locations are unknown to the learner.
- Score: 58.9213131489513
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study multivariate linear regression under Gaussian covariates in two settings, where data may be erased or corrupted by an adversary under a coordinate-wise budget. In the incomplete data setting, an adversary may inspect the dataset and delete entries in up to an $\eta$-fraction of samples per coordinate; a strong form of the Missing Not At Random model. In the corrupted data setting, the adversary instead replaces values arbitrarily, and the corruption locations are unknown to the learner. Despite substantial work on missing data, linear regression under such adversarial missingness remains poorly understood, even information-theoretically. Unlike the clean setting, where estimation error vanishes with more samples, here the optimal error remains a positive function of the problem parameters. Our main contribution is to characterize this error up to constant factors across essentially the entire parameter range. Specifically, we establish novel information-theoretic lower bounds on the achievable error that match the error of (computationally efficient) algorithms. A key implication is that, perhaps surprisingly, the optimal error in the missing data setting matches that in the corruption setting-so knowing the corruption locations offers no general advantage.
Related papers
- Distributionally Robust Optimization with Adversarial Data Contamination [49.89480853499918]
We focus on optimizing Wasserstein-1 DRO objectives for generalized linear models with convex Lipschitz loss functions.<n>Our primary contribution lies in a novel modeling framework that integrates robustness against training data contamination with robustness against distributional shifts.<n>This work establishes the first rigorous guarantees, supported by efficient computation, for learning under the dual challenges of data contamination and distributional shifts.
arXiv Detail & Related papers (2025-07-14T18:34:10Z) - Adversarial Robustness of Nonparametric Regression [14.20104019605888]
We characterize the adversarial robustness in nonparametric regression, assuming the regression function belongs to the second-order Sobolev space.<n>We show that, perhaps surprisingly, the classical smoothing spline estimator, when properly regularized, exhibits robustness against adversarial corruption.
arXiv Detail & Related papers (2025-05-23T00:18:20Z) - Geometric Median Matching for Robust k-Subset Selection from Noisy Data [75.86423267723728]
We propose a novel k-subset selection strategy that leverages Geometric Median -- a robust estimator with an optimal breakdown point of 1/2.<n>Our method iteratively selects a k-subset such that the mean of the subset approximates the GM of the (potentially) noisy dataset, ensuring robustness even under arbitrary corruption.
arXiv Detail & Related papers (2025-04-01T09:22:05Z) - Corruption-Robust Offline Reinforcement Learning with General Function
Approximation [60.91257031278004]
We investigate the problem of corruption in offline reinforcement learning (RL) with general function approximation.
Our goal is to find a policy that is robust to such corruption and minimizes the suboptimality gap with respect to the optimal policy for the uncorrupted Markov decision processes (MDPs)
arXiv Detail & Related papers (2023-10-23T04:07:26Z) - Robust Online Covariance and Sparse Precision Estimation Under Arbitrary
Data Corruption [1.5850859526672516]
We introduce a modified trimmed-inner-product algorithm to robustly estimate the covariance in an online scenario.
We provide the error-bound and convergence properties of the estimates to the true precision matrix under our algorithms.
arXiv Detail & Related papers (2023-09-16T05:37:28Z) - Detecting Errors in a Numerical Response via any Regression Model [21.651775224356214]
Noise plagues many numerical datasets, where the recorded values in the data may fail to match the true underlying values.
We introduce veracity scores that distinguish between genuine errors and natural data fluctuations.
We also contribute a new error detection benchmark involving 5 regression datasets with real-world numerical errors.
arXiv Detail & Related papers (2023-05-26T02:15:26Z) - Corruption-Robust Algorithms with Uncertainty Weighting for Nonlinear
Contextual Bandits and Markov Decision Processes [59.61248760134937]
We propose an efficient algorithm to achieve a regret of $tildeO(sqrtT+zeta)$.
The proposed algorithm relies on the recently developed uncertainty-weighted least-squares regression from linear contextual bandit.
We generalize our algorithm to the episodic MDP setting and first achieve an additive dependence on the corruption level $zeta$.
arXiv Detail & Related papers (2022-12-12T15:04:56Z) - Conservative Policy Construction Using Variational Autoencoders for
Logged Data with Missing Values [77.99648230758491]
We consider the problem of constructing personalized policies using logged data when there are missing values in the attributes of features.
The goal is to recommend an action when $Xt$, a degraded version of $Xb$ with missing values, is observed.
In particular, we introduce the textitconservative strategy where the policy is designed to safely handle the uncertainty due to missingness.
arXiv Detail & Related papers (2021-09-08T16:09:47Z) - Classification and Uncertainty Quantification of Corrupted Data using
Semi-Supervised Autoencoders [11.300365160909879]
We present a probabilistic approach to classify strongly corrupted data and quantify uncertainty.
A semi-supervised autoencoder trained on uncorrupted data is the underlying architecture.
We show that the model uncertainty strongly depends on whether the classification is correct or wrong.
arXiv Detail & Related papers (2021-05-27T18:47:55Z) - On the Error Resistance of Hinge Loss Minimization [30.808062097285706]
We identify a set of conditions on the data under which surrogate loss minimization algorithms provably learn the correct classifier.
In particular, we show that if the data is linearly classifiable with a slightly non-trivial margin, then surrogate loss minimization has negligible error on the uncorrupted data.
arXiv Detail & Related papers (2020-12-02T06:49:24Z) - A Hypergradient Approach to Robust Regression without Correspondence [85.49775273716503]
We consider a variant of regression problem, where the correspondence between input and output data is not available.
Most existing methods are only applicable when the sample size is small.
We propose a new computational framework -- ROBOT -- for the shuffled regression problem.
arXiv Detail & Related papers (2020-11-30T21:47:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.