Dolphin: A Challenging and Diverse Benchmark for Arabic NLG
- URL: http://arxiv.org/abs/2305.14989v2
- Date: Tue, 24 Oct 2023 17:48:43 GMT
- Title: Dolphin: A Challenging and Diverse Benchmark for Arabic NLG
- Authors: El Moatez Billah Nagoudi, AbdelRahim Elmadany, Ahmed El-Shangiti,
Muhammad Abdul-Mageed
- Abstract summary: Dolphin is a novel benchmark that addresses the need for a natural language generation (NLG) evaluation framework.
Dolphin comprises a substantial corpus of 40 diverse and representative public datasets across 50 test splits.
It sets a new standard for evaluating the performance and generalization capabilities of Arabic and multilingual models.
- Score: 21.06280737470819
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present Dolphin, a novel benchmark that addresses the need for a natural
language generation (NLG) evaluation framework dedicated to the wide collection
of Arabic languages and varieties. The proposed benchmark encompasses a broad
range of 13 different NLG tasks, including dialogue generation, question
answering, machine translation, summarization, among others. Dolphin comprises
a substantial corpus of 40 diverse and representative public datasets across 50
test splits, carefully curated to reflect real-world scenarios and the
linguistic richness of Arabic. It sets a new standard for evaluating the
performance and generalization capabilities of Arabic and multilingual models,
promising to enable researchers to push the boundaries of current
methodologies. We provide an extensive analysis of Dolphin, highlighting its
diversity and identifying gaps in current Arabic NLG research. We also offer a
public leaderboard that is both interactive and modular and evaluate several
models on our benchmark, allowing us to set strong baselines against which
researchers can compare.
Related papers
- DIALECTBENCH: A NLP Benchmark for Dialects, Varieties, and Closely-Related Languages [49.38663048447942]
We propose DIALECTBENCH, the first-ever large-scale benchmark for NLP on varieties.
This allows for a comprehensive evaluation of NLP system performance on different language varieties.
We provide substantial evidence of performance disparities between standard and non-standard language varieties.
arXiv Detail & Related papers (2024-03-16T20:18:36Z) - Retrieval is Accurate Generation [99.24267226311157]
We introduce a novel method that selects context-aware phrases from a collection of supporting documents.
Our model achieves the best performance and the lowest latency among several retrieval-augmented baselines.
arXiv Detail & Related papers (2024-02-27T14:16:19Z) - ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic [53.1913348687902]
We present ArabicMMLU, the first multi-task language understanding benchmark for Arabic language.
Our data comprises 40 tasks and 14,575 multiple-choice questions in Modern Standard Arabic (MSA)
Our evaluations of 35 models reveal substantial room for improvement, particularly among the best open-source models.
arXiv Detail & Related papers (2024-02-20T09:07:41Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - LEXTREME: A Multi-Lingual and Multi-Task Benchmark for the Legal Domain [24.54412069999257]
We survey the legal NLP literature and select 11 datasets covering 24 languages, creating LEXTREME.
The best baseline (XLM-R large) achieves both a dataset aggregate score a language aggregate score of 61.3.
This indicates that LEXTREME is still very challenging and leaves ample room for improvement.
arXiv Detail & Related papers (2023-01-30T18:05:08Z) - ORCA: A Challenging Benchmark for Arabic Language Understanding [8.9379057739817]
ORCA is a publicly available benchmark for Arabic language understanding evaluation.
To measure current progress in Arabic NLU, we use ORCA to offer a comprehensive comparison between 18 multilingual and Arabic language models.
arXiv Detail & Related papers (2022-12-21T04:35:43Z) - Multi-lingual Evaluation of Code Generation Models [82.7357812992118]
We present new benchmarks on evaluation code generation models: MBXP and Multilingual HumanEval, and MathQA-X.
These datasets cover over 10 programming languages.
We are able to assess the performance of code generation models in a multi-lingual fashion.
arXiv Detail & Related papers (2022-10-26T17:17:06Z) - LinCE: A Centralized Benchmark for Linguistic Code-switching Evaluation [13.947879344871442]
We propose a benchmark for Linguistic Code-switching Evaluation (LinCE)
LinCE combines ten corpora covering four different code-switched language pairs.
We provide the scores of different popular models, including LSTM, ELMo, and multilingual BERT.
arXiv Detail & Related papers (2020-05-09T00:00:08Z) - KLEJ: Comprehensive Benchmark for Polish Language Understanding [4.702729080310267]
We introduce a comprehensive multi-task benchmark for the Polish language understanding, accompanied by an online leaderboard.
We also release HerBERT, a Transformer-based model trained specifically for the Polish language, which has the best average performance and obtains the best results for three out of nine tasks.
arXiv Detail & Related papers (2020-05-01T21:55:40Z) - XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating
Cross-lingual Generalization [128.37244072182506]
Cross-lingual TRansfer Evaluation of Multilinguals XTREME is a benchmark for evaluating the cross-lingual generalization capabilities of multilingual representations across 40 languages and 9 tasks.
We demonstrate that while models tested on English reach human performance on many tasks, there is still a sizable gap in the performance of cross-lingually transferred models.
arXiv Detail & Related papers (2020-03-24T19:09:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.