Unreasonable Effectiveness of Rule-Based Heuristics in Solving Russian
SuperGLUE Tasks
- URL: http://arxiv.org/abs/2105.01192v1
- Date: Mon, 3 May 2021 22:19:22 GMT
- Title: Unreasonable Effectiveness of Rule-Based Heuristics in Solving Russian
SuperGLUE Tasks
- Authors: Tatyana Iazykova, Denis Kapelyushnik, Olga Bystrova, Andrey Kutuzov
- Abstract summary: Leader-boards like SuperGLUE are seen as important incentives for active development of NLP.
We show that its test datasets are vulnerable to shallows.
It is likely (as the simplest explanation) that a significant part of the SOTA models performance in the RSG leader-board is due to exploiting these shallows.
- Score: 2.6189995284654737
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Leader-boards like SuperGLUE are seen as important incentives for active
development of NLP, since they provide standard benchmarks for fair comparison
of modern language models. They have driven the world's best engineering teams
as well as their resources to collaborate and solve a set of tasks for general
language understanding. Their performance scores are often claimed to be close
to or even higher than the human performance. These results encouraged more
thorough analysis of whether the benchmark datasets featured any statistical
cues that machine learning based language models can exploit. For English
datasets, it was shown that they often contain annotation artifacts. This
allows solving certain tasks with very simple rules and achieving competitive
rankings.
In this paper, a similar analysis was done for the Russian SuperGLUE (RSG), a
recently published benchmark set and leader-board for Russian natural language
understanding. We show that its test datasets are vulnerable to shallow
heuristics. Often approaches based on simple rules outperform or come close to
the results of the notorious pre-trained language models like GPT-3 or BERT. It
is likely (as the simplest explanation) that a significant part of the SOTA
models performance in the RSG leader-board is due to exploiting these shallow
heuristics and that has nothing in common with real language understanding. We
provide a set of recommendations on how to improve these datasets, making the
RSG leader-board even more representative of the real progress in Russian NLU.
Related papers
- NeKo: Toward Post Recognition Generative Correction Large Language Models with Task-Oriented Experts [57.53692236201343]
We propose a Multi-Task Correction MoE, where we train the experts to become an expert'' of speech-to-text, language-to-text and vision-to-text datasets.
NeKo performs competitively on grammar and post-OCR correction as a multi-task model.
arXiv Detail & Related papers (2024-11-08T20:11:24Z) - bgGLUE: A Bulgarian General Language Understanding Evaluation Benchmark [28.472036496534116]
bgGLUE is a benchmark for evaluating language models on Natural Language Understanding (NLU) tasks in Bulgarian.
We run the first systematic evaluation of pre-trained language models for Bulgarian, comparing and contrasting results across the nine tasks in the benchmark.
arXiv Detail & Related papers (2023-06-04T12:54:00Z) - Pre-Trained Language-Meaning Models for Multilingual Parsing and
Generation [14.309869321407522]
We introduce multilingual pre-trained language-meaning models based on Discourse Representation Structures (DRSs)
Since DRSs are language neutral, cross-lingual transfer learning is adopted to further improve the performance of non-English tasks.
automatic evaluation results show that our approach achieves the best performance on both the multilingual DRS parsing and DRS-to-text generation tasks.
arXiv Detail & Related papers (2023-05-31T19:00:33Z) - This is the way: designing and compiling LEPISZCZE, a comprehensive NLP
benchmark for Polish [5.8090623549313944]
We introduce LEPISZCZE, a new, comprehensive benchmark for Polish NLP.
We use five datasets from the Polish benchmark and add eight novel datasets.
We provide insights and experiences learned while creating the benchmark for Polish as the blueprint to design similar benchmarks for other low-resourced languages.
arXiv Detail & Related papers (2022-11-23T16:51:09Z) - A Closer Look at Debiased Temporal Sentence Grounding in Videos:
Dataset, Metric, and Approach [53.727460222955266]
Temporal Sentence Grounding in Videos (TSGV) aims to ground a natural language sentence in an untrimmed video.
Recent studies have found that current benchmark datasets may have obvious moment annotation biases.
We introduce a new evaluation metric "dR@n,IoU@m" that discounts the basic recall scores to alleviate the inflating evaluation caused by biased datasets.
arXiv Detail & Related papers (2022-03-10T08:58:18Z) - IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and
Languages [87.5457337866383]
We introduce the Image-Grounded Language Understanding Evaluation benchmark.
IGLUE brings together visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages.
We find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks.
arXiv Detail & Related papers (2022-01-27T18:53:22Z) - From Good to Best: Two-Stage Training for Cross-lingual Machine Reading
Comprehension [51.953428342923885]
We develop a two-stage approach to enhance the model performance.
The first stage targets at recall: we design a hard-learning (HL) algorithm to maximize the likelihood that the top-k predictions contain the accurate answer.
The second stage focuses on precision: an answer-aware contrastive learning mechanism is developed to learn the fine difference between the accurate answer and other candidates.
arXiv Detail & Related papers (2021-12-09T07:31:15Z) - Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of
Language Models [86.02610674750345]
Adversarial GLUE (AdvGLUE) is a new multi-task benchmark to explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks.
We apply 14 adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations.
All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy.
arXiv Detail & Related papers (2021-11-04T12:59:55Z) - A Systematic Investigation of Commonsense Understanding in Large
Language Models [23.430757316504316]
Large language models have shown impressive performance on many natural language processing (NLP) tasks in a zero-shot setting.
We ask whether these models exhibit commonsense understanding by evaluating models against four commonsense benchmarks.
arXiv Detail & Related papers (2021-10-31T22:20:36Z) - Building Low-Resource NER Models Using Non-Speaker Annotation [58.78968578460793]
Cross-lingual methods have had notable success in addressing these concerns.
We propose a complementary approach to building low-resource Named Entity Recognition (NER) models using non-speaker'' (NS) annotations.
We show that use of NS annotators produces results that are consistently on par or better than cross-lingual methods built on modern contextual representations.
arXiv Detail & Related papers (2020-06-17T03:24:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.