BOUQuET: dataset, Benchmark and Open initiative for Universal Quality Evaluation in Translation
- URL: http://arxiv.org/abs/2502.04314v1
- Date: Thu, 06 Feb 2025 18:56:37 GMT
- Title: BOUQuET: dataset, Benchmark and Open initiative for Universal Quality Evaluation in Translation
- Authors: The Omnilingual MT Team, Pierre Andrews, Mikel Artetxe, Mariano Coria Meglioli, Marta R. Costa-jussà , Joe Chuang, David Dale, Cynthia Gao, Jean Maillard, Alex Mourachko, Christophe Ropers, Safiyyah Saleem, Eduardo Sánchez, Ioannis Tsiamas, Arina Turkatenko, Albert Ventayol-Boada, Shireen Yates,
- Abstract summary: This dataset is handcrafted in non-English languages first.
Each of these source languages is represented among the 23 languages commonly used by half of the world's population.
- Score: 28.456351723077088
- License:
- Abstract: This paper presents BOUQuET, a multicentric and multi-register/domain dataset and benchmark, and its broader collaborative extension initiative. This dataset is handcrafted in non-English languages first, each of these source languages being represented among the 23 languages commonly used by half of the world's population and therefore having the potential to serve as pivot languages that will enable more accurate translations. The dataset is specially designed to avoid contamination and be multicentric, so as to enforce representation of multilingual language features. In addition, the dataset goes beyond the sentence level, as it is organized in paragraphs of various lengths. Compared with related machine translation (MT) datasets, we show that BOUQuET has a broader representation of domains while simplifying the translation task for non-experts. Therefore, BOUQuET is specially suitable for the open initiative and call for translation participation that we are launching to extend it to a multi-way parallel corpus to any written language.
Related papers
- UltraLink: An Open-Source Knowledge-Enhanced Multilingual Supervised
Fine-tuning Dataset [69.33424532827608]
Open-source large language models (LLMs) have gained significant strength across diverse fields.
In this work, we construct an open-source multilingual supervised fine-tuning dataset.
The resulting UltraLink dataset comprises approximately 1 million samples across five languages.
arXiv Detail & Related papers (2024-02-07T05:05:53Z) - Multi-EuP: The Multilingual European Parliament Dataset for Analysis of
Bias in Information Retrieval [62.82448161570428]
This dataset is designed to investigate fairness in a multilingual information retrieval context.
It boasts an authentic multilingual corpus, featuring topics translated into all 24 languages.
It offers rich demographic information associated with its documents, facilitating the study of demographic bias.
arXiv Detail & Related papers (2023-11-03T12:29:11Z) - The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - Soft Prompt Decoding for Multilingual Dense Retrieval [30.766917713997355]
We show that applying state-of-the-art approaches developed for cross-lingual information retrieval to MLIR tasks leads to sub-optimal performance.
This is due to the heterogeneous and imbalanced nature of multilingual collections.
We present KD-SPD, a novel soft prompt decoding approach for MLIR that implicitly "translates" the representation of documents in different languages into the same embedding space.
arXiv Detail & Related papers (2023-05-15T21:17:17Z) - MULTI3NLU++: A Multilingual, Multi-Intent, Multi-Domain Dataset for
Natural Language Understanding in Task-Oriented Dialogue [115.32009638844059]
We extend the English only NLU++ dataset to include manual translations into a range of high, medium, and low resource languages.
Because of its multi-intent property, MULTI3NLU++ represents complex and natural user goals.
We use MULTI3NLU++ to benchmark state-of-the-art multilingual models for the Natural Language Understanding tasks of intent detection and slot labelling.
arXiv Detail & Related papers (2022-12-20T17:34:25Z) - EUR-Lex-Sum: A Multi- and Cross-lingual Dataset for Long-form
Summarization in the Legal Domain [2.4815579733050157]
We propose a novel dataset, called EUR-Lex-Sum, based on manually curated document summaries of legal acts from the European Union law platform (EUR-Lex)
Documents and their respective summaries exist as cross-lingual paragraph-aligned data in several of the 24 official European languages.
We obtain up to 1,500 document/summary pairs per language, including a subset of 375 cross-lingually aligned legal acts with texts available in all 24 languages.
arXiv Detail & Related papers (2022-10-24T17:58:59Z) - Bridging Cross-Lingual Gaps During Leveraging the Multilingual
Sequence-to-Sequence Pretraining for Text Generation [80.16548523140025]
We extend the vanilla pretrain-finetune pipeline with extra code-switching restore task to bridge the gap between the pretrain and finetune stages.
Our approach could narrow the cross-lingual sentence representation distance and improve low-frequency word translation with trivial computational cost.
arXiv Detail & Related papers (2022-04-16T16:08:38Z) - MultiEURLEX -- A multi-lingual and multi-label legal document
classification dataset for zero-shot cross-lingual transfer [13.24356999779404]
We introduce MULTI-EURLEX, a new multilingual dataset for topic classification of legal documents.
The dataset comprises 65k European Union (EU) laws, officially translated in 23 languages, annotated with multiple labels from the EUROVOC taxonomy.
We use the dataset as a testbed for zero-shot cross-lingual transfer, where we exploit annotated training documents in one language (source) to classify documents in another language (target)
arXiv Detail & Related papers (2021-09-02T12:52:55Z) - Complete Multilingual Neural Machine Translation [44.98358050355681]
We study the use of multi-way aligned examples to enrich the original English-centric parallel corpora.
We call MNMT with such connectivity pattern complete Multilingual Neural Machine Translation (cMNMT)
In combination with a novel training data sampling strategy that is conditioned on the target language only, cMNMT yields competitive translation quality for all language pairs.
arXiv Detail & Related papers (2020-10-20T13:03:48Z) - A Study of Cross-Lingual Ability and Language-specific Information in
Multilingual BERT [60.9051207862378]
multilingual BERT works remarkably well on cross-lingual transfer tasks.
Datasize and context window size are crucial factors to the transferability.
There is a computationally cheap but effective approach to improve the cross-lingual ability of multilingual BERT.
arXiv Detail & Related papers (2020-04-20T11:13:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.