Related papers: Aggregation Hides Out-of-Distribution Generalization Failures from Spurious Correlations

Aggregation Hides Out-of-Distribution Generalization Failures from Spurious Correlations

URL: http://arxiv.org/abs/2510.24884v1
Date: Tue, 28 Oct 2025 18:35:57 GMT
Title: Aggregation Hides Out-of-Distribution Generalization Failures from Spurious Correlations
Authors: Olawale Salaudeen, Haoran Zhang, Kumail Alhamoud, Sara Beery, Marzyeh Ghassemi,
Abstract summary: We find that correlations that improve ID but reduce OOD performance are rare in practice.<n>Using a simple gradient-based method, we identify semantically coherent OOD subsets where accuracy on the line does not hold.<n>Our findings indicate that aggregate metrics can obscure important failure modes of OOD robustness.
Score: 23.364199238965075
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Benchmarks for out-of-distribution (OOD) generalization frequently show a strong positive correlation between in-distribution (ID) and OOD accuracy across models, termed "accuracy-on-the-line." This pattern is often taken to imply that spurious correlations - correlations that improve ID but reduce OOD performance - are rare in practice. We find that this positive correlation is often an artifact of aggregating heterogeneous OOD examples. Using a simple gradient-based method, OODSelect, we identify semantically coherent OOD subsets where accuracy on the line does not hold. Across widely used distribution shift benchmarks, the OODSelect uncovers subsets, sometimes over half of the standard OOD set, where higher ID accuracy predicts lower OOD accuracy. Our findings indicate that aggregate metrics can obscure important failure modes of OOD robustness. We release code and the identified subsets to facilitate further research.

Related papers

Are Domain Generalization Benchmarks with Accuracy on the Line Misspecified? [11.534630666670568]
Spurious correlations, unstable statistical shortcuts a model can exploit, are expected to degrade performance out-of-distribution.<n>We show that current practice evaluates "robustness" without truly stressing the spurious signals we seek to eliminate.
arXiv Detail & Related papers (2025-03-31T19:50:04Z)
The Best of Both Worlds: On the Dilemma of Out-of-distribution Detection [75.65876949930258]
Out-of-distribution (OOD) detection is essential for model trustworthiness. We show that the superior OOD detection performance of state-of-the-art methods is achieved by secretly sacrificing the OOD generalization ability.
arXiv Detail & Related papers (2024-10-12T07:02:04Z)
Rethinking the Evaluation of Out-of-Distribution Detection: A Sorites Paradox [70.57120710151105]
Most existing out-of-distribution (OOD) detection benchmarks classify samples with novel labels as the OOD data. Some marginal OOD samples actually have close semantic contents to the in-distribution (ID) sample, which makes determining the OOD sample a Sorites Paradox. We construct a benchmark named Incremental Shift OOD (IS-OOD) to address the issue.
arXiv Detail & Related papers (2024-06-14T09:27:56Z)
Model-free Test Time Adaptation for Out-Of-Distribution Detection [62.49795078366206]
We propose a Non-Parametric Test Time textbfAdaptation framework for textbfDistribution textbfDetection (abbr) abbr utilizes online test samples for model adaptation during testing, enhancing adaptability to changing data distributions. We demonstrate the effectiveness of abbr through comprehensive experiments on multiple OOD detection benchmarks.
arXiv Detail & Related papers (2023-11-28T02:00:47Z)
How Does Fine-Tuning Impact Out-of-Distribution Detection for Vision-Language Models? [29.75562085178755]
We study how fine-tuning impact OOD detection for few-shot downstream tasks. Our results suggest that a proper choice of OOD scores is essential for CLIP-based fine-tuning. We also show that prompt learning demonstrates the state-of-the-art OOD detection performance over the zero-shot counterpart.
arXiv Detail & Related papers (2023-06-09T17:16:50Z)
Pseudo-OOD training for robust language models [78.15712542481859]
OOD detection is a key component of a reliable machine-learning model for any industry-scale application. We propose POORE - POsthoc pseudo-Ood REgularization, that generates pseudo-OOD samples using in-distribution (IND) data. We extensively evaluate our framework on three real-world dialogue systems, achieving new state-of-the-art in OOD detection.
arXiv Detail & Related papers (2022-10-17T14:32:02Z)
Calibrated ensembles can mitigate accuracy tradeoffs under distribution shift [108.30303219703845]
We find that ID-calibrated ensembles outperforms prior state-of-the-art (based on self-training) on both ID and OOD accuracy. We analyze this method in stylized settings, and identify two important conditions for ensembles to perform well both ID and OOD.
arXiv Detail & Related papers (2022-07-18T23:14:44Z)
Agreement-on-the-Line: Predicting the Performance of Neural Networks under Distribution Shift [18.760716606922482]
We show a similar but surprising phenomenon also holds for the agreement between pairs of neural network classifiers. Our prediction algorithm outperforms previous methods both in shifts where agreement-on-the-line holds and, surprisingly, when accuracy is not on the line.
arXiv Detail & Related papers (2022-06-27T07:50:47Z)
Robust Out-of-distribution Detection for Neural Networks [51.19164318924997]
We show that existing detection mechanisms can be extremely brittle when evaluating on in-distribution and OOD inputs. We propose an effective algorithm called ALOE, which performs robust training by exposing the model to both adversarially crafted inlier and outlier examples.
arXiv Detail & Related papers (2020-03-21T17:46:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.