Robustness Gym: Unifying the NLP Evaluation Landscape
- URL: http://arxiv.org/abs/2101.04840v1
- Date: Wed, 13 Jan 2021 02:37:54 GMT
- Title: Robustness Gym: Unifying the NLP Evaluation Landscape
- Authors: Karan Goel, Nazneen Rajani, Jesse Vig, Samson Tan, Jason Wu, Stephan
Zheng, Caiming Xiong, Mohit Bansal, Christopher R\'e
- Abstract summary: Deep neural networks are often brittle when deployed in real-world systems.
Recent research has focused on testing the robustness of such models.
We propose a solution in the form of Robustness Gym, a simple and evaluation toolkit.
- Score: 91.80175115162218
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite impressive performance on standard benchmarks, deep neural networks
are often brittle when deployed in real-world systems. Consequently, recent
research has focused on testing the robustness of such models, resulting in a
diverse set of evaluation methodologies ranging from adversarial attacks to
rule-based data transformations. In this work, we identify challenges with
evaluating NLP systems and propose a solution in the form of Robustness Gym
(RG), a simple and extensible evaluation toolkit that unifies 4 standard
evaluation paradigms: subpopulations, transformations, evaluation sets, and
adversarial attacks. By providing a common platform for evaluation, Robustness
Gym enables practitioners to compare results from all 4 evaluation paradigms
with just a few clicks, and to easily develop and share novel evaluation
methods using a built-in set of abstractions. To validate Robustness Gym's
utility to practitioners, we conducted a real-world case study with a
sentiment-modeling team, revealing performance degradations of 18%+. To verify
that Robustness Gym can aid novel research analyses, we perform the first study
of state-of-the-art commercial and academic named entity linking (NEL) systems,
as well as a fine-grained analysis of state-of-the-art summarization models.
For NEL, commercial systems struggle to link rare entities and lag their
academic counterparts by 10%+, while state-of-the-art summarization models
struggle on examples that require abstraction and distillation, degrading by
9%+. Robustness Gym can be found at https://robustnessgym.com/
Related papers
- From Adversarial Arms Race to Model-centric Evaluation: Motivating a
Unified Automatic Robustness Evaluation Framework [91.94389491920309]
Textual adversarial attacks can discover models' weaknesses by adding semantic-preserved but misleading perturbations to the inputs.
The existing practice of robustness evaluation may exhibit issues of incomprehensive evaluation, impractical evaluation protocol, and invalid adversarial samples.
We set up a unified automatic robustness evaluation framework, shifting towards model-centric evaluation to exploit the advantages of adversarial attacks.
arXiv Detail & Related papers (2023-05-29T14:55:20Z) - GREAT Score: Global Robustness Evaluation of Adversarial Perturbation using Generative Models [60.48306899271866]
We present a new framework, called GREAT Score, for global robustness evaluation of adversarial perturbation using generative models.
We show high correlation and significantly reduced cost of GREAT Score when compared to the attack-based model ranking on RobustBench.
GREAT Score can be used for remote auditing of privacy-sensitive black-box models.
arXiv Detail & Related papers (2023-04-19T14:58:27Z) - On the Robustness of Aspect-based Sentiment Analysis: Rethinking Model,
Data, and Training [109.9218185711916]
Aspect-based sentiment analysis (ABSA) aims at automatically inferring the specific sentiment polarities toward certain aspects of products or services behind social media texts or reviews.
We propose to enhance the ABSA robustness by systematically rethinking the bottlenecks from all possible angles, including model, data, and training.
arXiv Detail & Related papers (2023-04-19T11:07:43Z) - Resisting Adversarial Attacks in Deep Neural Networks using Diverse
Decision Boundaries [12.312877365123267]
Deep learning systems are vulnerable to crafted adversarial examples, which may be imperceptible to the human eye, but can lead the model to misclassify.
We develop a new ensemble-based solution that constructs defender models with diverse decision boundaries with respect to the original model.
We present extensive experimentations using standard image classification datasets, namely MNIST, CIFAR-10 and CIFAR-100 against state-of-the-art adversarial attacks.
arXiv Detail & Related papers (2022-08-18T08:19:26Z) - FLEX: Unifying Evaluation for Few-Shot NLP [17.425495611344786]
We formulate desiderata for an ideal few-shot NLP benchmark.
We present FLEX, the first benchmark, public leaderboard, and framework.
We also present UniFew, a simple yet strong prompt-based model for few-shot learning.
arXiv Detail & Related papers (2021-07-15T07:37:06Z) - Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy
Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning.
ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation.
Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z) - A Comprehensive Evaluation Framework for Deep Model Robustness [44.20580847861682]
Deep neural networks (DNNs) have achieved remarkable performance across a wide area of applications.
They are vulnerable to adversarial examples, which motivates the adversarial defense.
This paper presents a model evaluation framework containing a comprehensive, rigorous, and coherent set of evaluation metrics.
arXiv Detail & Related papers (2021-01-24T01:04:25Z) - RobustBench: a standardized adversarial robustness benchmark [84.50044645539305]
Key challenge in benchmarking robustness is that its evaluation is often error-prone leading to robustness overestimation.
We evaluate adversarial robustness with AutoAttack, an ensemble of white- and black-box attacks.
We analyze the impact of robustness on the performance on distribution shifts, calibration, out-of-distribution detection, fairness, privacy leakage, smoothness, and transferability.
arXiv Detail & Related papers (2020-10-19T17:06:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.