Related papers: Can Out-of-Distribution Evaluations Uncover Reliance on Shortcuts? A Case Study in Question Answering

Can Out-of-Distribution Evaluations Uncover Reliance on Shortcuts? A Case Study in Question Answering

URL: http://arxiv.org/abs/2508.18407v1
Date: Mon, 25 Aug 2025 18:49:50 GMT
Title: Can Out-of-Distribution Evaluations Uncover Reliance on Shortcuts? A Case Study in Question Answering
Authors: Michal Štefánik, Timothee Mickus, Marek Kadlčík, Michal Spiegel, Josef Kuchař,
Abstract summary: A majority of recent work in AI assesses models' generalization capabilities through the lens of performance on out-of-distribution (OOD) datasets.<n>We challenge this assumption and confront the results obtained from OOD evaluations with a set of specific failure modes documented in existing question-answering (QA) models.<n>We find that different datasets used for OOD evaluations in QA provide an estimate of models' robustness to shortcuts that have a vastly different quality, some largely under-performing even a simple, in-distribution evaluation.
Score: 4.123456708238846
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A majority of recent work in AI assesses models' generalization capabilities through the lens of performance on out-of-distribution (OOD) datasets. Despite their practicality, such evaluations build upon a strong assumption: that OOD evaluations can capture and reflect upon possible failures in a real-world deployment. In this work, we challenge this assumption and confront the results obtained from OOD evaluations with a set of specific failure modes documented in existing question-answering (QA) models, referred to as a reliance on spurious features or prediction shortcuts. We find that different datasets used for OOD evaluations in QA provide an estimate of models' robustness to shortcuts that have a vastly different quality, some largely under-performing even a simple, in-distribution evaluation. We partially attribute this to the observation that spurious shortcuts are shared across ID+OOD datasets, but also find cases where a dataset's quality for training and evaluation is largely disconnected. Our work underlines limitations of commonly-used OOD-based evaluations of generalization, and provides methodology and recommendations for evaluating generalization within and beyond QA more robustly.

Related papers

OODBench: Out-of-Distribution Benchmark for Large Vision-Language Models [48.08263342427679]
In real-world scenarios, it is often impractical to expect that all data processed by an AI system satisfy the assumption that data are independent and identically distributed.<n>We propose OODBench, a predominantly automated method with minimal human verification.<n>We show that current VLMs still exhibit notable performance degradation on OODBench, even when the underlying image categories are common.
arXiv Detail & Related papers (2026-02-20T09:34:21Z)
Do Generalisation Results Generalise? [19.855708462203097]
We evaluate a model's performance across multiple OOD testsets throughout a fine run.<n>We then evaluate the partial correlation of performances across these testsets, regressing out in-domain performance.<n>Analysing OLMo2 and OPT, we observe no overarching trend in generalisation results.
arXiv Detail & Related papers (2025-12-08T18:59:51Z)
The Best of Both Worlds: On the Dilemma of Out-of-distribution Detection [75.65876949930258]
Out-of-distribution (OOD) detection is essential for model trustworthiness. We show that the superior OOD detection performance of state-of-the-art methods is achieved by secretly sacrificing the OOD generalization ability.
arXiv Detail & Related papers (2024-10-12T07:02:04Z)
A Survey on Evaluation of Out-of-Distribution Generalization [41.39827887375374]
Out-of-Distribution (OOD) generalization is a complex and fundamental problem. This paper serves as the first effort to conduct a comprehensive review of OOD evaluation. We categorize existing research into three paradigms: OOD performance testing, OOD performance prediction, and OOD intrinsic property characterization.
arXiv Detail & Related papers (2024-03-04T09:30:35Z)
Revisiting Out-of-distribution Robustness in NLP: Benchmark, Analysis, and LLMs Evaluations [111.88727295707454]
This paper reexamines the research on out-of-distribution (OOD) robustness in the field of NLP. We propose a benchmark construction protocol that ensures clear differentiation and challenging distribution shifts. We conduct experiments on pre-trained language models for analysis and evaluation of OOD robustness.
arXiv Detail & Related papers (2023-06-07T17:47:03Z)
Towards Realistic Out-of-Distribution Detection: A Novel Evaluation Framework for Improving Generalization in OOD Detection [14.541761912174799]
This paper presents a novel evaluation framework for Out-of-Distribution (OOD) detection. It aims to assess the performance of machine learning models in more realistic settings.
arXiv Detail & Related papers (2022-11-20T07:30:15Z)
Pseudo-OOD training for robust language models [78.15712542481859]
OOD detection is a key component of a reliable machine-learning model for any industry-scale application. We propose POORE - POsthoc pseudo-Ood REgularization, that generates pseudo-OOD samples using in-distribution (IND) data. We extensively evaluate our framework on three real-world dialogue systems, achieving new state-of-the-art in OOD detection.
arXiv Detail & Related papers (2022-10-17T14:32:02Z)
ID and OOD Performance Are Sometimes Inversely Correlated on Real-world Datasets [30.82918381331854]
In-distribution (ID) and out-of-distribution (OOD) performance of models in computer vision and NLP are compared. Some studies report a frequent positive correlation and some surprisingly never observe an inverse correlation indicative of a necessary trade-off. This paper shows with multiple datasets that inverse correlations between ID and OOD performance do happen in real-world data.
arXiv Detail & Related papers (2022-09-01T17:27:25Z)
Understanding and Testing Generalization of Deep Networks on Out-of-Distribution Data [30.471871571256198]
Deep network models perform excellently on In-Distribution data, but can significantly fail on Out-Of-Distribution data. This study is devoted to analyzing the problem of experimental ID test and designing OOD test paradigm.
arXiv Detail & Related papers (2021-11-17T15:29:07Z)
ATOM: Robustifying Out-of-distribution Detection Using Outlier Mining [51.19164318924997]
Adrial Training with informative Outlier Mining improves robustness of OOD detection. ATOM achieves state-of-the-art performance under a broad family of classic and adversarial OOD evaluation tasks.
arXiv Detail & Related papers (2020-06-26T20:58:05Z)
On the Value of Out-of-Distribution Testing: An Example of Goodhart's Law [78.10523907729642]
VQA-CP has become the standard OOD benchmark for visual question answering. Most published methods rely on explicit knowledge of the construction of the OOD splits. We show that embarrassingly-simple methods, including one that generates answers at random, surpass the state of the art on some question types.
arXiv Detail & Related papers (2020-05-19T06:45:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.