From Invariant Representations to Invariant Data: Provable Robustness to Spurious Correlations via Noisy Counterfactual Matching
- URL: http://arxiv.org/abs/2505.24843v2
- Date: Mon, 10 Nov 2025 16:59:58 GMT
- Title: From Invariant Representations to Invariant Data: Provable Robustness to Spurious Correlations via Noisy Counterfactual Matching
- Authors: Ruqi Bai, Yao Ji, Zeyu Zhou, David I. Inouye,
- Abstract summary: Models that learn spurious correlations from training data often fail when deployed in new environments.<n>We propose a data-centric alternative that shifts the focus from learning invariant representations to leveraging invariant data pairs.<n>Noisy Counterfactual Matching improves robustness by leveraging even a small number of emphnoisy counterfactual pairs.
- Score: 14.679039741833305
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Models that learn spurious correlations from training data often fail when deployed in new environments. While many methods aim to learn invariant representations to address this, they often underperform standard empirical risk minimization (ERM). We propose a data-centric alternative that shifts the focus from learning invariant representations to leveraging invariant data pairs -- pairs of samples that should have the same prediction. We prove that certain counterfactuals naturally satisfy this invariance property. Based on this, we introduce Noisy Counterfactual Matching (NCM), a simple constraint-based method that improves robustness by leveraging even a small number of \emph{noisy} counterfactual pairs -- improving upon prior works that do not explicitly consider noise. For linear causal models, we prove that NCM's test-domain error is bounded by its in-domain error plus a term dependent on the counterfactuals' quality and diversity. Experiments on synthetic data validate our theory, and we demonstrate NCM's effectiveness on real-world datasets.
Related papers
- Robust Gaussian Processes via Relevance Pursuit [17.39376866275623]
We propose and study a GP model that achieves robustness against sparse outliers by inferring data-point-specific noise levels.<n>We show, surprisingly, that the model can be parameterized such that the associated log marginal likelihood is strongly concave in the data-point-specific noise variances.
arXiv Detail & Related papers (2024-10-31T17:59:56Z) - Adapting to Shifting Correlations with Unlabeled Data Calibration [6.84735357291896]
Distribution shifts between sites can seriously degrade model performance since models are prone to exploiting unstable correlations.
We propose Generalized Prevalence Adjustment (GPA), a flexible method that adjusts model predictions to the shifting correlations between prediction target and confounders.
GPA can infer the interaction between target and confounders in new sites using unlabeled samples from those sites.
arXiv Detail & Related papers (2024-09-09T18:45:43Z) - Unveiling the Statistical Foundations of Chain-of-Thought Prompting Methods [59.779795063072655]
Chain-of-Thought (CoT) prompting and its variants have gained popularity as effective methods for solving multi-step reasoning problems.
We analyze CoT prompting from a statistical estimation perspective, providing a comprehensive characterization of its sample complexity.
arXiv Detail & Related papers (2024-08-25T04:07:18Z) - ROTI-GCV: Generalized Cross-Validation for right-ROTationally Invariant Data [1.194799054956877]
Two key tasks in high-dimensional regularized regression are tuning the regularization strength for accurate predictions and estimating the out-of-sample risk.
We introduce a new framework, ROTI-GCV, for reliably performing cross-validation under challenging conditions.
arXiv Detail & Related papers (2024-06-17T15:50:00Z) - The Implicit Bias of Heterogeneity towards Invariance: A Study of Multi-Environment Matrix Sensing [9.551225697705199]
This paper studies the implicit bias of Gradient Descent (SGD) over heterogeneous data and shows that the implicit bias drives the model learning towards an invariant solution.<n>Specifically, we theoretically investigate the multi-environment low-rank matrix sensing problem where in each environment, the signal comprises (i) a lower-rank invariant part shared across all environments; and (ii) a significantly varying environment-dependent spurious component.<n>The key insight is, through simply employing the large step size large-batch SGD sequentially in each environment without any explicit regularization, the oscillation caused by heterogeneity can provably prevent model learning spurious signals.
arXiv Detail & Related papers (2024-03-03T07:38:24Z) - Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation [63.180725016463974]
Cross-modal retrieval relies on well-matched large-scale datasets that are laborious in practice.
We introduce a novel noisy correspondence learning framework, namely textbfSelf-textbfReinforcing textbfErrors textbfMitigation (SREM)
arXiv Detail & Related papers (2023-12-27T09:03:43Z) - Effective Causal Discovery under Identifiable Heteroscedastic Noise Model [45.98718860540588]
Causal DAG learning has recently achieved promising performance in terms of both accuracy and efficiency.
We propose a novel formulation for DAG learning that accounts for the variation in noise variance across variables and observations.
We then propose an effective two-phase iterative DAG learning algorithm to address the increasing optimization difficulties.
arXiv Detail & Related papers (2023-12-20T08:51:58Z) - Tackling Diverse Minorities in Imbalanced Classification [80.78227787608714]
Imbalanced datasets are commonly observed in various real-world applications, presenting significant challenges in training classifiers.
We propose generating synthetic samples iteratively by mixing data samples from both minority and majority classes.
We demonstrate the effectiveness of our proposed framework through extensive experiments conducted on seven publicly available benchmark datasets.
arXiv Detail & Related papers (2023-08-28T18:48:34Z) - Boosting Differentiable Causal Discovery via Adaptive Sample Reweighting [62.23057729112182]
Differentiable score-based causal discovery methods learn a directed acyclic graph from observational data.
We propose a model-agnostic framework to boost causal discovery performance by dynamically learning the adaptive weights for the Reweighted Score function, ReScore.
arXiv Detail & Related papers (2023-03-06T14:49:59Z) - Faithful Heteroscedastic Regression with Neural Networks [2.2835610890984164]
Parametric methods that employ neural networks for parameter maps can capture complex relationships in the data.
We make two simple modifications to optimization to produce a heteroscedastic model with mean estimates that are provably as accurate as those from its homoscedastic counterpart.
Our approach provably retains the accuracy of an equally flexible mean-only model while also offering best-in-class variance calibration.
arXiv Detail & Related papers (2022-12-18T22:34:42Z) - PAC Generalization via Invariant Representations [41.02828564338047]
We consider the notion of $epsilon$-approximate invariance in a finite sample setting.
Inspired by PAC learning, we obtain finite-sample out-of-distribution generalization guarantees.
Our results show bounds that do not scale in ambient dimension when intervention sites are restricted to lie in a constant size subset of in-degree bounded nodes.
arXiv Detail & Related papers (2022-05-30T15:50:14Z) - Learn from Unpaired Data for Image Restoration: A Variational Bayes
Approach [18.007258270845107]
We propose LUD-VAE, a deep generative method to learn the joint probability density function from data sampled from marginal distributions.
We apply our method to real-world image denoising and super-resolution tasks and train the models using the synthetic data generated by the LUD-VAE.
arXiv Detail & Related papers (2022-04-21T13:27:17Z) - Equivariance Allows Handling Multiple Nuisance Variables When Analyzing
Pooled Neuroimaging Datasets [53.34152466646884]
In this paper, we show how bringing recent results on equivariant representation learning instantiated on structured spaces together with simple use of classical results on causal inference provides an effective practical solution.
We demonstrate how our model allows dealing with more than one nuisance variable under some assumptions and can enable analysis of pooled scientific datasets in scenarios that would otherwise entail removing a large portion of the samples.
arXiv Detail & Related papers (2022-03-29T04:54:06Z) - Invariant Ancestry Search [6.583725235299022]
We introduce the concept of minimal invariance and propose invariant ancestry search (IAS)
In its population version, IAS outputs a set which contains only ancestors of the response and is the output of ICP.
We develop scalable algorithms and perform experiments on simulated and real data.
arXiv Detail & Related papers (2022-02-02T08:28:00Z) - Training on Test Data with Bayesian Adaptation for Covariate Shift [96.3250517412545]
Deep neural networks often make inaccurate predictions with unreliable uncertainty estimates.
We derive a Bayesian model that provides for a well-defined relationship between unlabeled inputs under distributional shift and model parameters.
We show that our method improves both accuracy and uncertainty estimation.
arXiv Detail & Related papers (2021-09-27T01:09:08Z) - Robust Bayesian Inference for Discrete Outcomes with the Total Variation
Distance [5.139874302398955]
Models of discrete-valued outcomes are easily misspecified if the data exhibit zero-inflation, overdispersion or contamination.
Here, we introduce a robust discrepancy-based Bayesian approach using the Total Variation Distance (TVD)
We empirically demonstrate that our approach is robust and significantly improves predictive performance on a range of simulated and real world data.
arXiv Detail & Related papers (2020-10-26T09:53:06Z) - Accounting for Unobserved Confounding in Domain Generalization [107.0464488046289]
This paper investigates the problem of learning robust, generalizable prediction models from a combination of datasets.
Part of the challenge of learning robust models lies in the influence of unobserved confounders.
We demonstrate the empirical performance of our approach on healthcare data from different modalities.
arXiv Detail & Related papers (2020-07-21T08:18:06Z) - Good Classifiers are Abundant in the Interpolating Regime [64.72044662855612]
We develop a methodology to compute precisely the full distribution of test errors among interpolating classifiers.
We find that test errors tend to concentrate around a small typical value $varepsilon*$, which deviates substantially from the test error of worst-case interpolating model.
Our results show that the usual style of analysis in statistical learning theory may not be fine-grained enough to capture the good generalization performance observed in practice.
arXiv Detail & Related papers (2020-06-22T21:12:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.