TextFlint: Unified Multilingual Robustness Evaluation Toolkit for
Natural Language Processing
- URL: http://arxiv.org/abs/2103.11441v1
- Date: Sun, 21 Mar 2021 17:20:38 GMT
- Title: TextFlint: Unified Multilingual Robustness Evaluation Toolkit for
Natural Language Processing
- Authors: Tao Gui, Xiao Wang, Qi Zhang, Qin Liu, Yicheng Zou, Xin Zhou, Rui
Zheng, Chong Zhang, Qinzhuo Wu, Jiacheng Ye, Zexiong Pang, Yongxin Zhang,
Zhengyan Li, Ruotian Ma, Zichu Fei, Ruijian Cai, Jun Zhao, Xinwu Hu, Zhiheng
Yan, Yiding Tan, Yuan Hu, Qiyuan Bian, Zhihua Liu, Bolin Zhu, Shan Qin,
Xiaoyu Xing, Jinlan Fu, Yue Zhang, Minlong Peng, Xiaoqing Zheng, Yaqian Zhou,
Zhongyu Wei, Xipeng Qiu and Xuanjing Huang
- Abstract summary: We propose a multilingual robustness evaluation platform for NLP tasks (TextFlint)
It incorporates universal text transformation, task-specific transformation, adversarial attack, subpopulation, and their combinations to provide comprehensive robustness analysis.
TextFlint generates complete analytical reports as well as targeted augmented data to address the shortcomings of the model's robustness.
- Score: 73.16475763422446
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Various robustness evaluation methodologies from different perspectives have
been proposed for different natural language processing (NLP) tasks. These
methods have often focused on either universal or task-specific generalization
capabilities. In this work, we propose a multilingual robustness evaluation
platform for NLP tasks (TextFlint) that incorporates universal text
transformation, task-specific transformation, adversarial attack,
subpopulation, and their combinations to provide comprehensive robustness
analysis. TextFlint enables practitioners to automatically evaluate their
models from all aspects or to customize their evaluations as desired with just
a few lines of code. To guarantee user acceptability, all the text
transformations are linguistically based, and we provide a human evaluation for
each one. TextFlint generates complete analytical reports as well as targeted
augmented data to address the shortcomings of the model's robustness. To
validate TextFlint's utility, we performed large-scale empirical evaluations
(over 67,000 evaluations) on state-of-the-art deep learning models, classic
supervised methods, and real-world systems. Almost all models showed
significant performance degradation, including a decline of more than 50% of
BERT's prediction accuracy on tasks such as aspect-level sentiment
classification, named entity recognition, and natural language inference.
Therefore, we call for the robustness to be included in the model evaluation,
so as to promote the healthy development of NLP technology.
Related papers
- Fennec: Fine-grained Language Model Evaluation and Correction Extended through Branching and Bridging [25.078498180620425]
We present a step-by-step evaluation framework, textbfFennec, capable of textbfFine-grained textbfEvaluatiotextbfN textbfExtended through brantextbfChing and bridging.
We employ the fine-grained correction capabilities induced by the evaluation model to refine multiple model responses, leading to an improvement of 1-2 points on the MT-Bench.
arXiv Detail & Related papers (2024-05-20T16:47:22Z) - AllHands: Ask Me Anything on Large-scale Verbatim Feedback via Large Language Models [34.82568259708465]
Allhands is an innovative analytic framework designed for large-scale feedback analysis through a natural language interface.
LLMs are large language models that enhance accuracy, robustness, generalization, and user-friendliness.
Allhands delivers comprehensive multi-modal responses, including text, code, tables, and images.
arXiv Detail & Related papers (2024-03-22T12:13:16Z) - Exploring Precision and Recall to assess the quality and diversity of LLMs [82.21278402856079]
We introduce a novel evaluation framework for Large Language Models (LLMs) such as textscLlama-2 and textscMistral.
This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora.
arXiv Detail & Related papers (2024-02-16T13:53:26Z) - Large Language Models are Diverse Role-Players for Summarization
Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal.
Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions.
We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z) - A Unified Neural Network Model for Readability Assessment with Feature
Projection and Length-Balanced Loss [17.213602354715956]
We propose a BERT-based model with feature projection and length-balanced loss for readability assessment.
Our model achieves state-of-the-art performances on two English benchmark datasets and one dataset of Chinese textbooks.
arXiv Detail & Related papers (2022-10-19T05:33:27Z) - Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of
Language Models [86.02610674750345]
Adversarial GLUE (AdvGLUE) is a new multi-task benchmark to explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks.
We apply 14 adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations.
All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy.
arXiv Detail & Related papers (2021-11-04T12:59:55Z) - Compression, Transduction, and Creation: A Unified Framework for
Evaluating Natural Language Generation [85.32991360774447]
Natural language generation (NLG) spans a broad range of tasks, each of which serves for specific objectives.
We propose a unifying perspective based on the nature of information change in NLG tasks.
We develop a family of interpretable metrics that are suitable for evaluating key aspects of different NLG tasks.
arXiv Detail & Related papers (2021-09-14T01:00:42Z) - Language Model Evaluation in Open-ended Text Generation [0.76146285961466]
We study different evaluation metrics that have been proposed to evaluate quality, diversity and consistency of machine-generated text.
From there, we propose a practical pipeline to evaluate language models in open-ended generation task.
arXiv Detail & Related papers (2021-08-08T06:16:02Z) - Learning Universal Representations from Word to Sentence [89.82415322763475]
This work introduces and explores the universal representation learning, i.e., embeddings of different levels of linguistic unit in a uniform vector space.
We present our approach of constructing analogy datasets in terms of words, phrases and sentences.
We empirically verify that well pre-trained Transformer models incorporated with appropriate training settings may effectively yield universal representation.
arXiv Detail & Related papers (2020-09-10T03:53:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.