GLUE-X: Evaluating Natural Language Understanding Models from an
Out-of-distribution Generalization Perspective
- URL: http://arxiv.org/abs/2211.08073v4
- Date: Mon, 22 May 2023 11:55:52 GMT
- Title: GLUE-X: Evaluating Natural Language Understanding Models from an
Out-of-distribution Generalization Perspective
- Authors: Linyi Yang, Shuibai Zhang, Libo Qin, Yafu Li, Yidong Wang, Hanmeng
Liu, Jindong Wang, Xing Xie, Yue Zhang
- Abstract summary: This paper presents the first attempt at creating a unified benchmark named GLUE-X for evaluating OOD robustness in NLP models.
evaluations are conducted on 8 classic NLP tasks over 21 popularly used PLMs, including GPT-3 and GPT-3.5.
- Score: 36.24251509242988
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Pre-trained language models (PLMs) are known to improve the generalization
performance of natural language understanding models by leveraging large
amounts of data during the pre-training phase. However, the out-of-distribution
(OOD) generalization problem remains a challenge in many NLP tasks, limiting
the real-world deployment of these methods. This paper presents the first
attempt at creating a unified benchmark named GLUE-X for evaluating OOD
robustness in NLP models, highlighting the importance of OOD robustness and
providing insights on how to measure the robustness of a model and how to
improve it. The benchmark includes 13 publicly available datasets for OOD
testing, and evaluations are conducted on 8 classic NLP tasks over 21 popularly
used PLMs, including GPT-3 and GPT-3.5. Our findings confirm the need for
improved OOD accuracy in NLP tasks, as significant performance degradation was
observed in all settings compared to in-distribution (ID) accuracy.
Related papers
- A Survey on Evaluation of Out-of-Distribution Generalization [41.39827887375374]
Out-of-Distribution (OOD) generalization is a complex and fundamental problem.
This paper serves as the first effort to conduct a comprehensive review of OOD evaluation.
We categorize existing research into three paradigms: OOD performance testing, OOD performance prediction, and OOD intrinsic property characterization.
arXiv Detail & Related papers (2024-03-04T09:30:35Z) - L3 Ensembles: Lifelong Learning Approach for Ensemble of Foundational
Language Models [15.726224465017596]
We propose an approach that focuses on extracting meaningful representations from unseen data and constructing a structured knowledge base.
We conducted experiments on various NLP tasks to validate its effectiveness, including benchmarks like GLUE and SuperGLUE.
The proposed L3 ensemble method increases the model accuracy by 4% 36% compared to the fine-tuned FLM.
arXiv Detail & Related papers (2023-11-11T06:59:50Z) - Revisiting Out-of-distribution Robustness in NLP: Benchmark, Analysis,
and LLMs Evaluations [111.88727295707454]
This paper reexamines the research on out-of-distribution (OOD) robustness in the field of NLP.
We propose a benchmark construction protocol that ensures clear differentiation and challenging distribution shifts.
We conduct experiments on pre-trained language models for analysis and evaluation of OOD robustness.
arXiv Detail & Related papers (2023-06-07T17:47:03Z) - Prompting GPT-3 To Be Reliable [117.23966502293796]
This work decomposes reliability into four facets: generalizability, fairness, calibration, and factuality.
We find that GPT-3 outperforms smaller-scale supervised models by large margins on all these facets.
arXiv Detail & Related papers (2022-10-17T14:52:39Z) - Pseudo-OOD training for robust language models [78.15712542481859]
OOD detection is a key component of a reliable machine-learning model for any industry-scale application.
We propose POORE - POsthoc pseudo-Ood REgularization, that generates pseudo-OOD samples using in-distribution (IND) data.
We extensively evaluate our framework on three real-world dialogue systems, achieving new state-of-the-art in OOD detection.
arXiv Detail & Related papers (2022-10-17T14:32:02Z) - Are Sample-Efficient NLP Models More Robust? [90.54786862811183]
We investigate the relationship between sample efficiency (amount of data needed to reach a given ID accuracy) and robustness (how models fare on OOD evaluation)
We find that higher sample efficiency is only correlated with better average OOD robustness on some modeling interventions and tasks, but not others.
These results suggest that general-purpose methods for improving sample efficiency are unlikely to yield universal OOD robustness improvements, since such improvements are highly dataset- and task-dependent.
arXiv Detail & Related papers (2022-10-12T17:54:59Z) - Evaluating the Robustness of Neural Language Models to Input
Perturbations [7.064032374579076]
In this study, we design and implement various types of character-level and word-level perturbation methods to simulate noisy input texts.
We investigate the ability of high-performance language models such as BERT, XLNet, RoBERTa, and ELMo in handling different types of input perturbations.
The results suggest that language models are sensitive to input perturbations and their performance can decrease even when small changes are introduced.
arXiv Detail & Related papers (2021-08-27T12:31:17Z) - SSMBA: Self-Supervised Manifold Based Data Augmentation for Improving
Out-of-Domain Robustness [66.37077266814822]
In natural language, it is difficult to generate new examples that stay on the underlying data manifold.
We introduce SSMBA, a data augmentation method for generating synthetic training examples.
In experiments on benchmarks across 3 tasks and 9 datasets, SSMBA consistently outperforms existing data augmentation methods.
arXiv Detail & Related papers (2020-09-21T22:02:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.