Related papers: Are Domain Generalization Benchmarks with Accuracy on the Line Misspecified?

Are Domain Generalization Benchmarks with Accuracy on the Line Misspecified?

URL: http://arxiv.org/abs/2504.00186v1
Date: Mon, 31 Mar 2025 19:50:04 GMT
Title: Are Domain Generalization Benchmarks with Accuracy on the Line Misspecified?
Authors: Olawale Salaudeen, Nicole Chiou, Shiny Weng, Sanmi Koyejo,
Abstract summary: Conventional wisdom suggests that models relying on spurious correlations will fail to generalize out-of-distribution.<n>We show that many widely used benchmarks for evaluating robustness to spurious correlations are misspecified.<n>We highlight the need to rethink how robustness to spurious correlations is assessed, identify well-specified benchmarks the field should prioritize, and enumerate strategies for designing future benchmarks that meaningfully reflect robustness under distribution shift.
Score: 11.534630666670568
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Spurious correlations are unstable statistical associations that hinder robust decision-making. Conventional wisdom suggests that models relying on such correlations will fail to generalize out-of-distribution (OOD), especially under strong distribution shifts. However, empirical evidence challenges this view as naive in-distribution empirical risk minimizers often achieve the best OOD accuracy across popular OOD generalization benchmarks. In light of these results, we propose a different perspective: many widely used benchmarks for evaluating robustness to spurious correlations are misspecified. Specifically, they fail to include shifts in spurious correlations that meaningfully impact OOD generalization, making them unsuitable for evaluating the benefit of removing such correlations. We establish conditions under which a distribution shift can reliably assess a model's reliance on spurious correlations. Crucially, under these conditions, we should not observe a strong positive correlation between in-distribution and OOD accuracy, often called "accuracy on the line." Yet, most state-of-the-art benchmarks exhibit this pattern, suggesting they do not effectively assess robustness. Our findings expose a key limitation in current benchmarks used to evaluate domain generalization algorithms, that is, models designed to avoid spurious correlations. We highlight the need to rethink how robustness to spurious correlations is assessed, identify well-specified benchmarks the field should prioritize, and enumerate strategies for designing future benchmarks that meaningfully reflect robustness under distribution shift.

Related papers

Rethinking Relation Extraction: Beyond Shortcuts to Generalization with a Debiased Benchmark [53.876493664396506]
Benchmarks are crucial for evaluating machine learning algorithm performance, facilitating comparison and identifying superior solutions. This paper addresses the issue of entity bias in relation extraction tasks, where models tend to rely on entity mentions rather than context. We propose a debiased relation extraction benchmark DREB that breaks the pseudo-correlation between entity mentions and relation types through entity replacement. To establish a new baseline on DREB, we introduce MixDebias, a debiasing method combining data-level and model training-level techniques.
arXiv Detail & Related papers (2025-01-02T17:01:06Z)
Mitigating Spurious Correlations via Disagreement Probability [4.8884049398279705]
Models trained with empirical risk minimization (ERM) are prone to be biased towards spurious correlations between target labels and bias attributes. We introduce a training objective designed to robustly enhance model performance across all data samples. We then derive a debiasing method, Disagreement Probability based Resampling for debiasing (DPR), which does not require bias labels.
arXiv Detail & Related papers (2024-11-04T02:44:04Z)
Accuracy on the wrong line: On the pitfalls of noisy data for out-of-distribution generalisation [70.36344590967519]
We show that noisy data and nuisance features can be sufficient to shatter the Accuracy-on-the-line phenomenon. We demonstrate this phenomenon across both synthetic and real datasets with noisy data and nuisance features.
arXiv Detail & Related papers (2024-06-27T09:57:31Z)
Assessing Model Generalization in Vicinity [34.86022681163714]
This paper evaluates the generalization ability of classification models on out-of-distribution test sets without depending on ground truth labels. We propose incorporating responses from neighboring test samples into the correctness assessment of each individual sample. The resulting scores are then averaged across all test samples to provide a holistic indication of model accuracy.
arXiv Detail & Related papers (2024-06-13T15:58:37Z)
MANO: Exploiting Matrix Norm for Unsupervised Accuracy Estimation Under Distribution Shifts [25.643876327918544]
Leveraging the models' outputs, specifically the logits, is a common approach to estimating the test accuracy of a pre-trained neural network on out-of-distribution samples. Despite their ease of implementation and computational efficiency, current logit-based methods are vulnerable to overconfidence issues, leading to prediction bias. We propose MaNo which applies a data-dependent normalization on the logits to reduce prediction bias and takes the $L_p$ norm of the matrix of normalized logits as the estimation score.
arXiv Detail & Related papers (2024-05-29T10:45:06Z)
Mixture Data for Training Cannot Ensure Out-of-distribution Generalization [21.801115344132114]
We show that increasing the size of training data does not always lead to a reduction in the test generalization error. In this work, we quantitatively redefine OOD data as those situated outside the convex hull of mixed training data. Our proof of the new risk bound agrees that the efficacy of well-trained models can be guaranteed for unseen data.
arXiv Detail & Related papers (2023-12-25T11:00:38Z)
Towards Calibrated Robust Fine-Tuning of Vision-Language Models [97.19901765814431]
This work proposes a robust fine-tuning method that improves both OOD accuracy and confidence calibration simultaneously in vision language models. We show that both OOD classification and OOD calibration errors have a shared upper bound consisting of two terms of ID data. Based on this insight, we design a novel framework that conducts fine-tuning with a constrained multimodal contrastive loss enforcing a larger smallest singular value.
arXiv Detail & Related papers (2023-11-03T05:41:25Z)
Test-Time Adaptation Induces Stronger Accuracy and Agreement-on-the-Line [65.14099135546594]
Recent test-time adaptation (TTA) methods drastically strengthen the ACL and AGL trends in models, even in shifts where models showed very weak correlations before. Our results show that by combining TTA with AGL-based estimation methods, we can estimate the OOD performance of models with high precision for a broader set of distribution shifts.
arXiv Detail & Related papers (2023-10-07T23:21:25Z)
Advancing Counterfactual Inference through Nonlinear Quantile Regression [77.28323341329461]
We propose a framework for efficient and effective counterfactual inference implemented with neural networks. The proposed approach enhances the capacity to generalize estimated counterfactual outcomes to unseen data. Empirical results conducted on multiple datasets offer compelling support for our theoretical assertions.
arXiv Detail & Related papers (2023-06-09T08:30:51Z)
GREAT Score: Global Robustness Evaluation of Adversarial Perturbation using Generative Models [60.48306899271866]
We present a new framework, called GREAT Score, for global robustness evaluation of adversarial perturbation using generative models. We show high correlation and significantly reduced cost of GREAT Score when compared to the attack-based model ranking on RobustBench. GREAT Score can be used for remote auditing of privacy-sensitive black-box models.
arXiv Detail & Related papers (2023-04-19T14:58:27Z)
Free Lunch for Generating Effective Outlier Supervision [46.37464572099351]
We propose an ultra-effective method to generate near-realistic outlier supervision. Our proposed textttBayesAug significantly reduces the false positive rate over 12.50% compared with the previous schemes.
arXiv Detail & Related papers (2023-01-17T01:46:45Z)
Calibrated ensembles can mitigate accuracy tradeoffs under distribution shift [108.30303219703845]
We find that ID-calibrated ensembles outperforms prior state-of-the-art (based on self-training) on both ID and OOD accuracy. We analyze this method in stylized settings, and identify two important conditions for ensembles to perform well both ID and OOD.
arXiv Detail & Related papers (2022-07-18T23:14:44Z)
Improved OOD Generalization via Conditional Invariant Regularizer [43.62211060412388]
We show that given a class label, conditionally independent models of spurious attributes are OOD general. Based on this, metric Conditional Variation (CSV) which controls OOD error is proposed to measure such conditional independence. An algorithm with minicave convergence rate is proposed to solve the problem.
arXiv Detail & Related papers (2022-07-14T06:34:21Z)
Agreement-on-the-Line: Predicting the Performance of Neural Networks under Distribution Shift [18.760716606922482]
We show a similar but surprising phenomenon also holds for the agreement between pairs of neural network classifiers. Our prediction algorithm outperforms previous methods both in shifts where agreement-on-the-line holds and, surprisingly, when accuracy is not on the line.
arXiv Detail & Related papers (2022-06-27T07:50:47Z)
Robustness and Accuracy Could Be Reconcilable by (Proper) Definition [109.62614226793833]
The trade-off between robustness and accuracy has been widely studied in the adversarial literature. We find that it may stem from the improperly defined robust error, which imposes an inductive bias of local invariance. By definition, SCORE facilitates the reconciliation between robustness and accuracy, while still handling the worst-case uncertainty.
arXiv Detail & Related papers (2022-02-21T10:36:09Z)
Unveiling Project-Specific Bias in Neural Code Models [20.131797671630963]
Large Language Models (LLMs) based neural code models often struggle to generalize effectively to real-world inter-project out-of-distribution (OOD) data. We show that this phenomenon is caused by the heavy reliance on project-specific shortcuts for prediction instead of ground-truth evidence. We propose a novel bias mitigation mechanism that regularizes the model's learning behavior by leveraging latent logic relations among samples.
arXiv Detail & Related papers (2022-01-19T02:09:48Z)
Provably Robust Detection of Out-of-distribution Data (almost) for free [124.14121487542613]
Deep neural networks are known to produce highly overconfident predictions on out-of-distribution (OOD) data. In this paper we propose a novel method where from first principles we combine a certifiable OOD detector with a standard classifier into an OOD aware classifier. In this way we achieve the best of two worlds: certifiably adversarially robust OOD detection, even for OOD samples close to the in-distribution, without loss in prediction accuracy and close to state-of-the-art OOD detection performance for non-manipulated OOD data.
arXiv Detail & Related papers (2021-06-08T11:40:49Z)
Deconfounding Scores: Feature Representations for Causal Effect Estimation with Weak Overlap [140.98628848491146]
We introduce deconfounding scores, which induce better overlap without biasing the target of estimation. We show that deconfounding scores satisfy a zero-covariance condition that is identifiable in observed data. In particular, we show that this technique could be an attractive alternative to standard regularizations.
arXiv Detail & Related papers (2021-04-12T18:50:11Z)
Learn what you can't learn: Regularized Ensembles for Transductive Out-of-distribution Detection [76.39067237772286]
We show that current out-of-distribution (OOD) detection algorithms for neural networks produce unsatisfactory results in a variety of OOD detection scenarios. This paper studies how such "hard" OOD scenarios can benefit from adjusting the detection method after observing a batch of the test data. We propose a novel method that uses an artificial labeling scheme for the test data and regularization to obtain ensembles of models that produce contradictory predictions only on the OOD samples in a test batch.
arXiv Detail & Related papers (2020-12-10T16:55:13Z)
Beyond Marginal Uncertainty: How Accurately can Bayesian Regression Models Estimate Posterior Predictive Correlations? [13.127549105535623]
It is often more useful to estimate predictive correlations between the function values at different input locations. We first consider a downstream task which depends on posterior predictive correlations: transductive active learning (TAL) Since TAL is too expensive and indirect to guide development of algorithms, we introduce two metrics which more directly evaluate the predictive correlations.
arXiv Detail & Related papers (2020-11-06T03:48:59Z)
Latent Causal Invariant Model [128.7508609492542]
Current supervised learning can learn spurious correlation during the data-fitting process. We propose a Latent Causal Invariance Model (LaCIM) which pursues causal prediction.
arXiv Detail & Related papers (2020-11-04T10:00:27Z)
Learning Causal Semantic Representation for Out-of-Distribution Prediction [125.38836464226092]
We propose a Causal Semantic Generative model (CSG) based on a causal reasoning so that the two factors are modeled separately. We show that CSG can identify the semantic factor by fitting training data, and this semantic-identification guarantees the boundedness of OOD generalization error.
arXiv Detail & Related papers (2020-11-03T13:16:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.