bgGLUE: A Bulgarian General Language Understanding Evaluation Benchmark
- URL: http://arxiv.org/abs/2306.02349v2
- Date: Wed, 7 Jun 2023 03:57:51 GMT
- Title: bgGLUE: A Bulgarian General Language Understanding Evaluation Benchmark
- Authors: Momchil Hardalov, Pepa Atanasova, Todor Mihaylov, Galia Angelova,
Kiril Simov, Petya Osenova, Ves Stoyanov, Ivan Koychev, Preslav Nakov,
Dragomir Radev
- Abstract summary: bgGLUE is a benchmark for evaluating language models on Natural Language Understanding (NLU) tasks in Bulgarian.
We run the first systematic evaluation of pre-trained language models for Bulgarian, comparing and contrasting results across the nine tasks in the benchmark.
- Score: 28.472036496534116
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present bgGLUE(Bulgarian General Language Understanding Evaluation), a
benchmark for evaluating language models on Natural Language Understanding
(NLU) tasks in Bulgarian. Our benchmark includes NLU tasks targeting a variety
of NLP problems (e.g., natural language inference, fact-checking, named entity
recognition, sentiment analysis, question answering, etc.) and machine learning
tasks (sequence labeling, document-level classification, and regression). We
run the first systematic evaluation of pre-trained language models for
Bulgarian, comparing and contrasting results across the nine tasks in the
benchmark. The evaluation results show strong performance on sequence labeling
tasks, but there is a lot of room for improvement for tasks that require more
complex reasoning. We make bgGLUE publicly available together with the
fine-tuning and the evaluation code, as well as a public leaderboard at
https://bgglue.github.io/, and we hope that it will enable further advancements
in developing NLU models for Bulgarian.
Related papers
- DIALECTBENCH: A NLP Benchmark for Dialects, Varieties, and Closely-Related Languages [49.38663048447942]
We propose DIALECTBENCH, the first-ever large-scale benchmark for NLP on varieties.
This allows for a comprehensive evaluation of NLP system performance on different language varieties.
We provide substantial evidence of performance disparities between standard and non-standard language varieties.
arXiv Detail & Related papers (2024-03-16T20:18:36Z) - Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language
Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks.
Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena.
For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z) - Python Code Generation by Asking Clarification Questions [57.63906360576212]
In this work, we introduce a novel and more realistic setup for this task.
We hypothesize that the under-specification of a natural language description can be resolved by asking clarification questions.
We collect and introduce a new dataset named CodeClarQA containing pairs of natural language descriptions and code with created synthetic clarification questions and answers.
arXiv Detail & Related papers (2022-12-19T22:08:36Z) - IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and
Languages [87.5457337866383]
We introduce the Image-Grounded Language Understanding Evaluation benchmark.
IGLUE brings together visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages.
We find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks.
arXiv Detail & Related papers (2022-01-27T18:53:22Z) - CUGE: A Chinese Language Understanding and Generation Evaluation
Benchmark [144.05723617401674]
General-purpose language intelligence evaluation has been a longstanding goal for natural language processing.
We argue that for general-purpose language intelligence evaluation, the benchmark itself needs to be comprehensive and systematic.
We propose CUGE, a Chinese Language Understanding and Generation Evaluation benchmark with the following features.
arXiv Detail & Related papers (2021-12-27T11:08:58Z) - Can Machines Read Coding Manuals Yet? -- A Benchmark for Building Better
Language Models for Code Understanding [3.98345038769576]
We derive a set of benchmarks that assess code understanding based on tasks such as predicting the best answer to a question in a forum post.
We evaluate the performance of current state-of-the-art language models on these tasks and show that there is a significant improvement on each task from fine tuning.
arXiv Detail & Related papers (2021-09-15T17:42:44Z) - KLUE: Korean Language Understanding Evaluation [43.94952771238633]
We introduce Korean Language Understanding Evaluation (KLUE) benchmark.
KLUE is a collection of 8 Korean natural language understanding (NLU) tasks.
We build all of the tasks from scratch from diverse source corpora while respecting copyrights.
arXiv Detail & Related papers (2021-05-20T11:40:30Z) - GLGE: A New General Language Generation Evaluation Benchmark [139.25515221280767]
General Language Generation Evaluation (GLGE) is a new multi-task benchmark for evaluating the generalization capabilities of NLG models.
To encourage research on pretraining and transfer learning on NLG models, we make GLGE publicly available and build a leaderboard with strong baselines.
arXiv Detail & Related papers (2020-11-24T06:59:45Z) - CLUE: A Chinese Language Understanding Evaluation Benchmark [41.86950255312653]
We introduce the first large-scale Chinese Language Understanding Evaluation (CLUE) benchmark.
CLUE brings together 9 tasks spanning several well-established single-sentence/sentence-pair classification tasks, as well as machine reading comprehension.
We report scores using an exhaustive set of current state-of-the-art pre-trained Chinese models.
arXiv Detail & Related papers (2020-04-13T15:02:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.