Can Out-of-Distribution Evaluations Uncover Reliance on Shortcuts? A Case Study in Question Answering
- URL: http://arxiv.org/abs/2508.18407v1
- Date: Mon, 25 Aug 2025 18:49:50 GMT
- Title: Can Out-of-Distribution Evaluations Uncover Reliance on Shortcuts? A Case Study in Question Answering
- Authors: Michal Štefánik, Timothee Mickus, Marek Kadlčík, Michal Spiegel, Josef Kuchař,
- Abstract summary: A majority of recent work in AI assesses models' generalization capabilities through the lens of performance on out-of-distribution (OOD) datasets.<n>We challenge this assumption and confront the results obtained from OOD evaluations with a set of specific failure modes documented in existing question-answering (QA) models.<n>We find that different datasets used for OOD evaluations in QA provide an estimate of models' robustness to shortcuts that have a vastly different quality, some largely under-performing even a simple, in-distribution evaluation.
- Score: 4.123456708238846
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A majority of recent work in AI assesses models' generalization capabilities through the lens of performance on out-of-distribution (OOD) datasets. Despite their practicality, such evaluations build upon a strong assumption: that OOD evaluations can capture and reflect upon possible failures in a real-world deployment. In this work, we challenge this assumption and confront the results obtained from OOD evaluations with a set of specific failure modes documented in existing question-answering (QA) models, referred to as a reliance on spurious features or prediction shortcuts. We find that different datasets used for OOD evaluations in QA provide an estimate of models' robustness to shortcuts that have a vastly different quality, some largely under-performing even a simple, in-distribution evaluation. We partially attribute this to the observation that spurious shortcuts are shared across ID+OOD datasets, but also find cases where a dataset's quality for training and evaluation is largely disconnected. Our work underlines limitations of commonly-used OOD-based evaluations of generalization, and provides methodology and recommendations for evaluating generalization within and beyond QA more robustly.
Related papers
- OODBench: Out-of-Distribution Benchmark for Large Vision-Language Models [48.08263342427679]
In real-world scenarios, it is often impractical to expect that all data processed by an AI system satisfy the assumption that data are independent and identically distributed.<n>We propose OODBench, a predominantly automated method with minimal human verification.<n>We show that current VLMs still exhibit notable performance degradation on OODBench, even when the underlying image categories are common.
arXiv Detail & Related papers (2026-02-20T09:34:21Z) - Do Generalisation Results Generalise? [19.855708462203097]
We evaluate a model's performance across multiple OOD testsets throughout a fine run.<n>We then evaluate the partial correlation of performances across these testsets, regressing out in-domain performance.<n>Analysing OLMo2 and OPT, we observe no overarching trend in generalisation results.
arXiv Detail & Related papers (2025-12-08T18:59:51Z) - The Best of Both Worlds: On the Dilemma of Out-of-distribution Detection [75.65876949930258]
Out-of-distribution (OOD) detection is essential for model trustworthiness.
We show that the superior OOD detection performance of state-of-the-art methods is achieved by secretly sacrificing the OOD generalization ability.
arXiv Detail & Related papers (2024-10-12T07:02:04Z) - A Survey on Evaluation of Out-of-Distribution Generalization [41.39827887375374]
Out-of-Distribution (OOD) generalization is a complex and fundamental problem.
This paper serves as the first effort to conduct a comprehensive review of OOD evaluation.
We categorize existing research into three paradigms: OOD performance testing, OOD performance prediction, and OOD intrinsic property characterization.
arXiv Detail & Related papers (2024-03-04T09:30:35Z) - Revisiting Out-of-distribution Robustness in NLP: Benchmark, Analysis,
and LLMs Evaluations [111.88727295707454]
This paper reexamines the research on out-of-distribution (OOD) robustness in the field of NLP.
We propose a benchmark construction protocol that ensures clear differentiation and challenging distribution shifts.
We conduct experiments on pre-trained language models for analysis and evaluation of OOD robustness.
arXiv Detail & Related papers (2023-06-07T17:47:03Z) - Towards Realistic Out-of-Distribution Detection: A Novel Evaluation
Framework for Improving Generalization in OOD Detection [14.541761912174799]
This paper presents a novel evaluation framework for Out-of-Distribution (OOD) detection.
It aims to assess the performance of machine learning models in more realistic settings.
arXiv Detail & Related papers (2022-11-20T07:30:15Z) - Pseudo-OOD training for robust language models [78.15712542481859]
OOD detection is a key component of a reliable machine-learning model for any industry-scale application.
We propose POORE - POsthoc pseudo-Ood REgularization, that generates pseudo-OOD samples using in-distribution (IND) data.
We extensively evaluate our framework on three real-world dialogue systems, achieving new state-of-the-art in OOD detection.
arXiv Detail & Related papers (2022-10-17T14:32:02Z) - ID and OOD Performance Are Sometimes Inversely Correlated on Real-world
Datasets [30.82918381331854]
In-distribution (ID) and out-of-distribution (OOD) performance of models in computer vision and NLP are compared.
Some studies report a frequent positive correlation and some surprisingly never observe an inverse correlation indicative of a necessary trade-off.
This paper shows with multiple datasets that inverse correlations between ID and OOD performance do happen in real-world data.
arXiv Detail & Related papers (2022-09-01T17:27:25Z) - Understanding and Testing Generalization of Deep Networks on
Out-of-Distribution Data [30.471871571256198]
Deep network models perform excellently on In-Distribution data, but can significantly fail on Out-Of-Distribution data.
This study is devoted to analyzing the problem of experimental ID test and designing OOD test paradigm.
arXiv Detail & Related papers (2021-11-17T15:29:07Z) - ATOM: Robustifying Out-of-distribution Detection Using Outlier Mining [51.19164318924997]
Adrial Training with informative Outlier Mining improves robustness of OOD detection.
ATOM achieves state-of-the-art performance under a broad family of classic and adversarial OOD evaluation tasks.
arXiv Detail & Related papers (2020-06-26T20:58:05Z) - On the Value of Out-of-Distribution Testing: An Example of Goodhart's
Law [78.10523907729642]
VQA-CP has become the standard OOD benchmark for visual question answering.
Most published methods rely on explicit knowledge of the construction of the OOD splits.
We show that embarrassingly-simple methods, including one that generates answers at random, surpass the state of the art on some question types.
arXiv Detail & Related papers (2020-05-19T06:45:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.