RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark
- URL: http://arxiv.org/abs/2010.15925v2
- Date: Mon, 2 Nov 2020 11:02:10 GMT
- Title: RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark
- Authors: Tatiana Shavrina and Alena Fenogenova and Anton Emelyanov and Denis
Shevelev and Ekaterina Artemova and Valentin Malykh and Vladislav Mikhailov
and Maria Tikhonova and Andrey Chertok and Andrey Evlampiev
- Abstract summary: We introduce an advanced Russian general language understanding evaluation benchmark -- RussianGLUE.
For the first time, a benchmark of nine tasks, collected and organized analogically to the SuperGLUE methodology, was developed from scratch for the Russian language.
- Score: 5.258267224004844
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we introduce an advanced Russian general language
understanding evaluation benchmark -- RussianGLUE. Recent advances in the field
of universal language models and transformers require the development of a
methodology for their broad diagnostics and testing for general intellectual
skills - detection of natural language inference, commonsense reasoning,
ability to perform simple logical operations regardless of text subject or
lexicon. For the first time, a benchmark of nine tasks, collected and organized
analogically to the SuperGLUE methodology, was developed from scratch for the
Russian language. We provide baselines, human level evaluation, an open-source
framework for evaluating models
(https://github.com/RussianNLP/RussianSuperGLUE), and an overall leaderboard of
transformer models for the Russian language. Besides, we present the first
results of comparing multilingual models in the adapted diagnostic test set and
offer the first steps to further expanding or assessing state-of-the-art models
independently of language.
Related papers
- The Russian-focused embedders' exploration: ruMTEB benchmark and Russian embedding model design [39.80182519545138]
This paper focuses on research related to embedding models in the Russian language.
It introduces a new Russian-focused embedding model called ru-en-RoSBERTa and the ruMTEB benchmark.
arXiv Detail & Related papers (2024-08-22T15:53:23Z) - Advancing the Evaluation of Traditional Chinese Language Models: Towards
a Comprehensive Benchmark Suite [17.764840326809797]
We propose a novel set of benchmarks that leverage existing English datasets and are tailored to evaluate language models in Traditional Chinese.
These benchmarks encompass a wide range of tasks, including contextual question-answering, summarization, classification, and table understanding.
In this paper, we evaluate the performance of GPT-3.5, Taiwan-LLaMa-v1.0, and Model 7-C, our proprietary model, on these benchmarks.
arXiv Detail & Related papers (2023-09-15T14:52:23Z) - Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language
Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks.
Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena.
For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z) - Commonsense Knowledge Transfer for Pre-trained Language Models [83.01121484432801]
We introduce commonsense knowledge transfer, a framework to transfer the commonsense knowledge stored in a neural commonsense knowledge model to a general-purpose pre-trained language model.
It first exploits general texts to form queries for extracting commonsense knowledge from the neural commonsense knowledge model.
It then refines the language model with two self-supervised objectives: commonsense mask infilling and commonsense relation prediction.
arXiv Detail & Related papers (2023-06-04T15:44:51Z) - Russian SuperGLUE 1.1: Revising the Lessons not Learned by Russian NLP
models [53.95094814056337]
This paper presents Russian SuperGLUE 1.1, an updated benchmark styled after GLUE for Russian NLP models.
The new version includes a number of technical, user experience and methodological improvements.
We provide the integration of Russian SuperGLUE with a framework for industrial evaluation of the open-source models, MOROCCO.
arXiv Detail & Related papers (2022-02-15T23:45:30Z) - CUGE: A Chinese Language Understanding and Generation Evaluation
Benchmark [144.05723617401674]
General-purpose language intelligence evaluation has been a longstanding goal for natural language processing.
We argue that for general-purpose language intelligence evaluation, the benchmark itself needs to be comprehensive and systematic.
We propose CUGE, a Chinese Language Understanding and Generation Evaluation benchmark with the following features.
arXiv Detail & Related papers (2021-12-27T11:08:58Z) - Methods for Detoxification of Texts for the Russian Language [55.337471467610094]
We introduce the first study of automatic detoxification of Russian texts to combat offensive language.
We test two types of models - unsupervised approach that performs local corrections and supervised approach based on pretrained language GPT-2 model.
The results show that the tested approaches can be successfully used for detoxification, although there is room for improvement.
arXiv Detail & Related papers (2021-05-19T10:37:44Z) - TextFlint: Unified Multilingual Robustness Evaluation Toolkit for
Natural Language Processing [73.16475763422446]
We propose a multilingual robustness evaluation platform for NLP tasks (TextFlint)
It incorporates universal text transformation, task-specific transformation, adversarial attack, subpopulation, and their combinations to provide comprehensive robustness analysis.
TextFlint generates complete analytical reports as well as targeted augmented data to address the shortcomings of the model's robustness.
arXiv Detail & Related papers (2021-03-21T17:20:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.