Related papers: Are Sample-Efficient NLP Models More Robust?

Are Sample-Efficient NLP Models More Robust?

URL: http://arxiv.org/abs/2210.06456v2
Date: Tue, 30 May 2023 20:33:26 GMT
Title: Are Sample-Efficient NLP Models More Robust?
Authors: Nelson F. Liu and Ananya Kumar and Percy Liang and Robin Jia
Abstract summary: We investigate the relationship between sample efficiency (amount of data needed to reach a given ID accuracy) and robustness (how models fare on OOD evaluation) We find that higher sample efficiency is only correlated with better average OOD robustness on some modeling interventions and tasks, but not others. These results suggest that general-purpose methods for improving sample efficiency are unlikely to yield universal OOD robustness improvements, since such improvements are highly dataset- and task-dependent.
Score: 90.54786862811183
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent results in image classification and extractive question answering have observed that pre-trained models trained on less in-distribution data have better out-of-distribution performance. However, it is unclear how broadly these trends hold. We conduct a large empirical study across three tasks, three broadly-applicable modeling interventions (increasing model size, using a different adaptation method, and pre-training on more data), and 14 diverse datasets to investigate the relationship between sample efficiency (amount of data needed to reach a given ID accuracy) and robustness (how models fare on OOD evaluation). We find that higher sample efficiency is only correlated with better average OOD robustness on some modeling interventions and tasks, but not others. On individual datasets, models with lower sample efficiency can even be more robust. These results suggest that general-purpose methods for improving sample efficiency are unlikely to yield universal OOD robustness improvements, since such improvements are highly dataset- and task-dependent. Even in an era of large, multi-purpose pretrained models, task-specific decisions may often be necessary for OOD generalization.

Related papers

Towards Robust Universal Information Extraction: Benchmark, Evaluation, and Solution [66.11004226578771]
Existing robust benchmark datasets have two key limitations. They generate only a limited range of perturbations for a single Information Extraction (IE) task. Considering the powerful generation capabilities of Large Language Models (LLMs), we introduce a new benchmark dataset for Robust UIE, called RUIE-Bench. We show that training with only textbf15% of the data leads to an average textbf7.5% relative performance improvement across three IE tasks.
arXiv Detail & Related papers (2025-03-05T05:39:29Z)
Self-Rationalization in the Wild: A Large Scale Out-of-Distribution Evaluation on NLI-related tasks [59.47851630504264]
Free-text explanations are expressive and easy to understand, but many datasets lack annotated explanation data. We fine-tune T5-Large and OLMo-7B models and assess the impact of fine-tuning data quality, the number of fine-tuning samples, and few-shot selection methods. The models are evaluated on 19 diverse OOD datasets across three tasks: natural language inference (NLI), fact-checking, and hallucination detection in abstractive summarization.
arXiv Detail & Related papers (2025-02-07T10:01:32Z)
Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data. We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z)
A CLIP-Powered Framework for Robust and Generalizable Data Selection [51.46695086779598]
Real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance. Data selection has shown promise in identifying the most representative samples from the entire dataset. We propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.
arXiv Detail & Related papers (2024-10-15T03:00:58Z)
Clarifying Myths About the Relationship Between Shape Bias, Accuracy, and Robustness [18.55761892159021]
Deep learning models can perform well when evaluated on images from the same distribution as the training set. Deep learning models can perform well when evaluated on images from the same distribution as the training set. Applying small blurrings to a model's input image and feeding the model with out-of-distribution (OOD) data can significantly drop the model's accuracy. Data augmentation is one of the well-practiced methods to improve model robustness against OOD data.
arXiv Detail & Related papers (2024-06-07T15:21:00Z)
Domain Generalization Using Large Pretrained Models with Mixture-of-Adapters [33.401355417911084]
This study is on leveraging the knowledge of large pretrained models to improve handling of OOD scenarios and tackle domain generalization problems. We employ parameter-efficient fine-tuning (PEFT) techniques to effectively preserve OOD robustness while working with large models. Our experiments and analysis confirm that the most effective approaches involve ensembling diverse models and increasing the scale of pretraining.
arXiv Detail & Related papers (2023-10-17T07:01:24Z)
Revisiting Out-of-distribution Robustness in NLP: Benchmark, Analysis, and LLMs Evaluations [111.88727295707454]
This paper reexamines the research on out-of-distribution (OOD) robustness in the field of NLP. We propose a benchmark construction protocol that ensures clear differentiation and challenging distribution shifts. We conduct experiments on pre-trained language models for analysis and evaluation of OOD robustness.
arXiv Detail & Related papers (2023-06-07T17:47:03Z)
Effective Robustness against Natural Distribution Shifts for Models with Different Training Data [113.21868839569]
"Effective robustness" measures the extra out-of-distribution robustness beyond what can be predicted from the in-distribution (ID) performance. We propose a new evaluation metric to evaluate and compare the effective robustness of models trained on different data.
arXiv Detail & Related papers (2023-02-02T19:28:41Z)
Exploring The Landscape of Distributional Robustness for Question Answering Models [47.178481044045505]
Investigation spans over 350 models and 16 question answering datasets. We find that, in many cases, model variations do not affect robustness. We release all evaluations to encourage researchers to further analyze robustness trends for question answering models.
arXiv Detail & Related papers (2022-10-22T18:17:31Z)
Understanding and Testing Generalization of Deep Networks on Out-of-Distribution Data [30.471871571256198]
Deep network models perform excellently on In-Distribution data, but can significantly fail on Out-Of-Distribution data. This study is devoted to analyzing the problem of experimental ID test and designing OOD test paradigm.
arXiv Detail & Related papers (2021-11-17T15:29:07Z)
Complementary Ensemble Learning [1.90365714903665]
We derive a technique to improve performance of state-of-the-art deep learning models. Specifically, we train auxiliary models which are able to complement state-of-the-art model uncertainty.
arXiv Detail & Related papers (2021-11-09T03:23:05Z)
The Evolution of Out-of-Distribution Robustness Throughout Fine-Tuning [25.85044477227461]
Models that are more accurate on the out-of-distribution data relative to this baseline exhibit "effective robustness" We find that models pre-trained on larger datasets exhibit effective robustness during training that vanishes at convergence. We discuss several strategies for scaling effective robustness to the high-accuracy regime to improve the out-of-distribution accuracy of state-of-the-art models.
arXiv Detail & Related papers (2021-06-30T06:21:42Z)
On the Efficacy of Adversarial Data Collection for Question Answering: Results from a Large-Scale Randomized Study [65.17429512679695]
In adversarial data collection (ADC), a human workforce interacts with a model in real time, attempting to produce examples that elicit incorrect predictions. Despite ADC's intuitive appeal, it remains unclear when training on adversarial datasets produces more robust models.
arXiv Detail & Related papers (2021-06-02T00:48:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.