Revisiting Out-of-distribution Robustness in NLP: Benchmark, Analysis,
and LLMs Evaluations
- URL: http://arxiv.org/abs/2306.04618v2
- Date: Thu, 26 Oct 2023 06:59:50 GMT
- Title: Revisiting Out-of-distribution Robustness in NLP: Benchmark, Analysis,
and LLMs Evaluations
- Authors: Lifan Yuan, Yangyi Chen, Ganqu Cui, Hongcheng Gao, Fangyuan Zou,
Xingyi Cheng, Heng Ji, Zhiyuan Liu, Maosong Sun
- Abstract summary: This paper reexamines the research on out-of-distribution (OOD) robustness in the field of NLP.
We propose a benchmark construction protocol that ensures clear differentiation and challenging distribution shifts.
We conduct experiments on pre-trained language models for analysis and evaluation of OOD robustness.
- Score: 111.88727295707454
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This paper reexamines the research on out-of-distribution (OOD) robustness in
the field of NLP. We find that the distribution shift settings in previous
studies commonly lack adequate challenges, hindering the accurate evaluation of
OOD robustness. To address these issues, we propose a benchmark construction
protocol that ensures clear differentiation and challenging distribution
shifts. Then we introduce BOSS, a Benchmark suite for Out-of-distribution
robustneSS evaluation covering 5 tasks and 20 datasets. Based on BOSS, we
conduct a series of experiments on pre-trained language models for analysis and
evaluation of OOD robustness. First, for vanilla fine-tuning, we examine the
relationship between in-distribution (ID) and OOD performance. We identify
three typical types that unveil the inner learning mechanism, which could
potentially facilitate the forecasting of OOD robustness, correlating with the
advancements on ID datasets. Then, we evaluate 5 classic methods on BOSS and
find that, despite exhibiting some effectiveness in specific cases, they do not
offer significant improvement compared to vanilla fine-tuning. Further, we
evaluate 5 LLMs with various adaptation paradigms and find that when sufficient
ID data is available, fine-tuning domain-specific models outperform LLMs on ID
examples significantly. However, in the case of OOD instances, prioritizing
LLMs with in-context learning yields better results. We identify that both
fine-tuned small models and LLMs face challenges in effectively addressing
downstream tasks. The code is public at
\url{https://github.com/lifan-yuan/OOD_NLP}.
Related papers
- UBENCH: Benchmarking Uncertainty in Large Language Models with Multiple Choice Questions [10.28688988951815]
UBENCH is a benchmark for evaluating large language models.
It includes 3,978 multiple-choice questions covering knowledge, language, understanding, and reasoning abilities.
We also evaluate the reliability of 15 popular LLMs, finding GLM4 to be the most outstanding.
arXiv Detail & Related papers (2024-06-18T16:50:38Z) - Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios.
We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples.
Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z) - Your Finetuned Large Language Model is Already a Powerful Out-of-distribution Detector [17.305076703258813]
We revisit the likelihood ratio between a pretrained large language model (LLM) and its finetuned variant as a criterion for out-of-distribution (OOD) detection.
We show that, for the first time, the likelihood ratio can serve as an effective OOD detector.
arXiv Detail & Related papers (2024-04-07T10:32:49Z) - Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM
Evaluation [51.99752147380505]
This paper presents a benchmark self-evolving framework to dynamically evaluate Large Language Models (LLMs)
We utilize a multi-agent system to manipulate the context or question of original instances, reframing new evolving instances with high confidence.
Our framework widens performance discrepancies both between different models and within the same model across various tasks.
arXiv Detail & Related papers (2024-02-18T03:40:06Z) - How Good Are LLMs at Out-of-Distribution Detection? [13.35571704613836]
Out-of-distribution (OOD) detection plays a vital role in enhancing the reliability of machine learning (ML) models.
This paper embarks on a pioneering empirical investigation of OOD detection in the domain of large language models (LLMs)
arXiv Detail & Related papers (2023-08-20T13:15:18Z) - Do-GOOD: Towards Distribution Shift Evaluation for Pre-Trained Visual
Document Understanding Models [68.12229916000584]
We develop an out-of-distribution (OOD) benchmark termed Do-GOOD for the fine-Grained analysis on Document image-related tasks.
We then evaluate the robustness and perform a fine-grained analysis of 5 latest VDU pre-trained models and 2 typical OOD generalization algorithms.
arXiv Detail & Related papers (2023-06-05T06:50:42Z) - Are Sample-Efficient NLP Models More Robust? [90.54786862811183]
We investigate the relationship between sample efficiency (amount of data needed to reach a given ID accuracy) and robustness (how models fare on OOD evaluation)
We find that higher sample efficiency is only correlated with better average OOD robustness on some modeling interventions and tasks, but not others.
These results suggest that general-purpose methods for improving sample efficiency are unlikely to yield universal OOD robustness improvements, since such improvements are highly dataset- and task-dependent.
arXiv Detail & Related papers (2022-10-12T17:54:59Z) - Towards Robust Visual Question Answering: Making the Most of Biased
Samples via Contrastive Learning [54.61762276179205]
We propose a novel contrastive learning approach, MMBS, for building robust VQA models by Making the Most of Biased Samples.
Specifically, we construct positive samples for contrastive learning by eliminating the information related to spurious correlation from the original training samples.
We validate our contributions by achieving competitive performance on the OOD dataset VQA-CP v2 while preserving robust performance on the ID dataset VQA v2.
arXiv Detail & Related papers (2022-10-10T11:05:21Z) - Energy-bounded Learning for Robust Models of Code [16.592638312365164]
In programming, learning code representations has a variety of applications, including code classification, code search, comment generation, bug prediction, and so on.
We propose the use of an energy-bounded learning objective function to assign a higher score to in-distribution samples and a lower score to out-of-distribution samples in order to incorporate such out-of-distribution samples into the training process of source code models.
arXiv Detail & Related papers (2021-12-20T06:28:56Z) - Understanding and Testing Generalization of Deep Networks on
Out-of-Distribution Data [30.471871571256198]
Deep network models perform excellently on In-Distribution data, but can significantly fail on Out-Of-Distribution data.
This study is devoted to analyzing the problem of experimental ID test and designing OOD test paradigm.
arXiv Detail & Related papers (2021-11-17T15:29:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.