Whispers of Doubt Amidst Echoes of Triumph in NLP Robustness
- URL: http://arxiv.org/abs/2311.09694v2
- Date: Wed, 3 Apr 2024 15:07:45 GMT
- Title: Whispers of Doubt Amidst Echoes of Triumph in NLP Robustness
- Authors: Ashim Gupta, Rishanth Rajendhran, Nathan Stringham, Vivek Srikumar, Ana Marasović,
- Abstract summary: We conduct evaluations using (a) out-of-domain and challenge test sets, (b) behavioral testing with CheckLists, (c) contrast sets, and (d) adversarial inputs.
We conclude that not only is the question of robustness in NLP as yet unresolved, but even some of the approaches to measure robustness need to be reassessed.
- Score: 29.312873775442757
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Do larger and more performant models resolve NLP's longstanding robustness issues? We investigate this question using over 20 models of different sizes spanning different architectural choices and pretraining objectives. We conduct evaluations using (a) out-of-domain and challenge test sets, (b) behavioral testing with CheckLists, (c) contrast sets, and (d) adversarial inputs. Our analysis reveals that not all out-of-domain tests provide insight into robustness. Evaluating with CheckLists and contrast sets shows significant gaps in model performance; merely scaling models does not make them adequately robust. Finally, we point out that current approaches for adversarial evaluations of models are themselves problematic: they can be easily thwarted, and in their current forms, do not represent a sufficiently deep probe of model robustness. We conclude that not only is the question of robustness in NLP as yet unresolved, but even some of the approaches to measure robustness need to be reassessed.
Related papers
- LoGU: Long-form Generation with Uncertainty Expressions [49.76417603761989]
We introduce the task of Long-form Generation with Uncertainty(LoGU)
We identify two key challenges: Uncertainty Suppression and Uncertainty Misalignment.
Our framework adopts a divide-and-conquer strategy, refining uncertainty based on atomic claims.
Experiments on three long-form instruction following datasets show that our method significantly improves accuracy, reduces hallucinations, and maintains the comprehensiveness of responses.
arXiv Detail & Related papers (2024-10-18T09:15:35Z) - Rigorous Probabilistic Guarantees for Robust Counterfactual Explanations [80.86128012438834]
We show for the first time that computing the robustness of counterfactuals with respect to plausible model shifts is NP-complete.
We propose a novel probabilistic approach which is able to provide tight estimates of robustness with strong guarantees.
arXiv Detail & Related papers (2024-07-10T09:13:11Z) - Exploring The Landscape of Distributional Robustness for Question
Answering Models [47.178481044045505]
Investigation spans over 350 models and 16 question answering datasets.
We find that, in many cases, model variations do not affect robustness.
We release all evaluations to encourage researchers to further analyze robustness trends for question answering models.
arXiv Detail & Related papers (2022-10-22T18:17:31Z) - Robust Models are less Over-Confident [10.42820615166362]
adversarial training (AT) aims to achieve robustness against such attacks.
We empirically analyze a variety of adversarially trained models that achieve high robust accuracies.
AT has an interesting side-effect: it leads to models that are significantly less overconfident with their decisions.
arXiv Detail & Related papers (2022-10-12T06:14:55Z) - Analyzing Modality Robustness in Multimodal Sentiment Analysis [48.52878002917685]
Building robust multimodal models is crucial for achieving reliable deployment in the wild.
We propose simple diagnostic checks for modality robustness in a trained multimodal model.
We analyze well-known robust training strategies to alleviate the issues.
arXiv Detail & Related papers (2022-05-30T23:30:16Z) - Measure and Improve Robustness in NLP Models: A Survey [23.515869499536237]
robustness has been separately explored in applications like vision and NLP, with various definitions, evaluation and mitigation strategies in multiple lines of research.
We first connect multiple definitions of robustness, then unify various lines of work on identifying robustness failures and evaluating models' robustness.
We present mitigation strategies that are data-driven, model-driven, and inductive-prior-based, with a more systematic view of how to effectively improve robustness in NLP models.
arXiv Detail & Related papers (2021-12-15T18:02:04Z) - Voting based ensemble improves robustness of defensive models [82.70303474487105]
We study whether it is possible to create an ensemble to further improve robustness.
By ensembling several state-of-the-art pre-trained defense models, our method can achieve a 59.8% robust accuracy.
arXiv Detail & Related papers (2020-11-28T00:08:45Z) - RobustBench: a standardized adversarial robustness benchmark [84.50044645539305]
Key challenge in benchmarking robustness is that its evaluation is often error-prone leading to robustness overestimation.
We evaluate adversarial robustness with AutoAttack, an ensemble of white- and black-box attacks.
We analyze the impact of robustness on the performance on distribution shifts, calibration, out-of-distribution detection, fairness, privacy leakage, smoothness, and transferability.
arXiv Detail & Related papers (2020-10-19T17:06:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.