TAPE: Assessing Few-shot Russian Language Understanding
- URL: http://arxiv.org/abs/2210.12813v1
- Date: Sun, 23 Oct 2022 18:28:25 GMT
- Title: TAPE: Assessing Few-shot Russian Language Understanding
- Authors: Ekaterina Taktasheva, Tatiana Shavrina, Alena Fenogenova, Denis
Shevelev, Nadezhda Katricheva, Maria Tikhonova, Albina Akhmetgareeva, Oleg
Zinkevich, Anastasiia Bashmakova, Svetlana Iordanskaia, Alena Spiridonova,
Valentina Kurenshchikova, Ekaterina Artemova, Vladislav Mikhailov
- Abstract summary: TAPE (Text Attack and Perturbation Evaluation) is a novel benchmark that includes six more complex NLU tasks for Russian.
The detailed analysis of testing the autoregressive baselines indicates that simple spelling-based perturbations affect the performance the most.
We publicly release TAPE to foster research on robust LMs that can generalize to new tasks when little to no supervision is available.
- Score: 1.9859374437454114
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in zero-shot and few-shot learning have shown promise for a
scope of research and practical purposes. However, this fast-growing area lacks
standardized evaluation suites for non-English languages, hindering progress
outside the Anglo-centric paradigm. To address this line of research, we
propose TAPE (Text Attack and Perturbation Evaluation), a novel benchmark that
includes six more complex NLU tasks for Russian, covering multi-hop reasoning,
ethical concepts, logic and commonsense knowledge. The TAPE's design focuses on
systematic zero-shot and few-shot NLU evaluation: (i) linguistic-oriented
adversarial attacks and perturbations for analyzing robustness, and (ii)
subpopulations for nuanced interpretation. The detailed analysis of testing the
autoregressive baselines indicates that simple spelling-based perturbations
affect the performance the most, while paraphrasing the input has a more
negligible effect. At the same time, the results demonstrate a significant gap
between the neural and human baselines for most tasks. We publicly release TAPE
(tape-benchmark.com) to foster research on robust LMs that can generalize to
new tasks when little to no supervision is available.
Related papers
- Single Ground Truth Is Not Enough: Add Linguistic Variability to Aspect-based Sentiment Analysis Evaluation [41.66053021998106]
Aspect-based sentiment analysis (ABSA) is the challenging task of extracting sentiment along with its corresponding aspects and opinions from human language.
Current evaluation methods for this task often restrict answers to a single ground truth, penalizing semantically equivalent predictions that differ in surface form.
We propose a novel, fully automated pipeline that augments existing test sets with alternative valid responses for aspect and opinion terms.
arXiv Detail & Related papers (2024-10-13T11:48:09Z) - ROAST: Review-level Opinion Aspect Sentiment Target Joint Detection for ABSA [50.90538760832107]
This research presents a novel task, Review-Level Opinion Aspect Sentiment Target (ROAST)
ROAST seeks to close the gap between sentence-level and text-level ABSA by identifying every ABSA constituent at the review level.
We extend the available datasets to enable ROAST, addressing the drawbacks noted in previous research.
arXiv Detail & Related papers (2024-05-30T17:29:15Z) - On the Robustness of Language Guidance for Low-Level Vision Tasks: Findings from Depth Estimation [71.72465617754553]
We generate "low-level" sentences that convey object-centric, three-dimensional spatial relationships, incorporate them as additional language priors and evaluate their downstream impact on depth estimation.
Our key finding is that current language-guided depth estimators perform optimally only with scene-level descriptions.
Despite leveraging additional data, these methods are not robust to directed adversarial attacks and decline in performance with an increase in distribution shift.
arXiv Detail & Related papers (2024-04-12T15:35:20Z) - Learning Shortcuts: On the Misleading Promise of NLU in Language Models [4.8951183832371]
Large language models (LLMs) have enabled significant performance gains in the field of natural language processing.
Recent studies have found that LLMs often resort to shortcuts when performing tasks, creating an illusion of enhanced performance while lacking generalizability in their decision rules.
arXiv Detail & Related papers (2024-01-17T21:55:15Z) - SOUL: Towards Sentiment and Opinion Understanding of Language [96.74878032417054]
We propose a new task called Sentiment and Opinion Understanding of Language (SOUL)
SOUL aims to evaluate sentiment understanding through two subtasks: Review (RC) and Justification Generation (JG)
arXiv Detail & Related papers (2023-10-27T06:48:48Z) - Analyzing and Reducing the Performance Gap in Cross-Lingual Transfer
with Fine-tuning Slow and Fast [50.19681990847589]
Existing research has shown that a multilingual pre-trained language model fine-tuned with one (source) language also performs well on downstream tasks for non-source languages.
This paper analyzes the fine-tuning process, discovers when the performance gap changes and identifies which network weights affect the overall performance most.
arXiv Detail & Related papers (2023-05-19T06:04:21Z) - IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and
Languages [87.5457337866383]
We introduce the Image-Grounded Language Understanding Evaluation benchmark.
IGLUE brings together visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages.
We find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks.
arXiv Detail & Related papers (2022-01-27T18:53:22Z) - Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of
Language Models [86.02610674750345]
Adversarial GLUE (AdvGLUE) is a new multi-task benchmark to explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks.
We apply 14 adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations.
All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy.
arXiv Detail & Related papers (2021-11-04T12:59:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.